Nonstationary and Sparsely-Correlated Multioutput Gaussian Process with Spike-and-Slab Prior
Abstract
Multioutput Gaussian process (MGP) is commonly used as a transfer learning method to leverage information among multiple outputs. A key advantage of MGP is providing uncertainty quantification for prediction, which is highly important for subsequent decision-making tasks. However, traditional MGP may not be sufficiently flexible to handle multivariate data with dynamic characteristics, particularly when dealing with complex temporal correlations. Additionally, because some outputs may lack correlation, transferring information among them may lead to negative transfer. To address these issues, this study proposes a nonstationary MGP model that can capture both the dynamic and sparse correlation among outputs. Specifically, the covariance functions of MGP are constructed using convolutions of time-varying kernel functions. Then a dynamic spike-and-slab prior is placed on correlation parameters to automatically decide which sources are informative to the target output in the training process. An expectation-maximization (EM) algorithm is proposed for efficient model fitting. Both numerical studies and real cases demonstrate its efficacy in capturing dynamic and sparse correlation structure and mitigating negative transfer for high-dimensional time-series data modeling. A mountain-car reinforcement learning case highlights that transferring knowledge from source expertise can enhance the efficiency of the target decision-making process. Our method holds promise for application in more complex multitask decision-making challenges within nonstationary environments in the future.
History: Rema Padman served as the senior editor for this article.
Funding: This work was supported by NSFC [Grants NSFC-72171003, NSFC-71932006, and NSFC-72101147].
Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.4010696.v1 and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2023.0022).
1. Introduction
Gaussian process (GP) provides an elegant and flexible Bayesian nonparametric framework for modeling nonlinear mappings (Williams and Rasmussen 2006). Characterized solely by mean and covariance functions, it is capable of capturing complex input-output relationships, as well as measuring prediction uncertainty, which is critical for decision making. As a result, GP has been widely applied in various fields, such as Bayesian optimization (Frazier 2018), experiment design (Gramacy 2020), and product quality monitoring (Zhang et al. 2016). However, standard GP is designed for only a single output, which limits its use in multioutput or multitask scenarios arising in various fields, such as Bayesian optimization (Shen et al. 2023), a traffic network (Rodrigues et al. 2019), and a computer simulation emulator (Fricker et al. 2013). Consequently, multioutput Gaussian process (MGP) has been gaining increasing attention from researchers and has emerged as an important member in the vast family of transfer learning (Pan and Yang 2009) and multitask learning (Zhang and Yang 2021) methods.
Stationary GP and MGP models are commonly used with covariance function depending only on the distance among data points. However, this invariance to input space translation makes them unsuitable for nonstationary environments, where the data characteristics vary across the input domain (Williams and Rasmussen 2006). This phenomenon is quite common in time-series data. For instance, in the energy field, the mean of power consumption in a household is different in every season (Paun et al. 2023). In clinical studies, sepsis is very likely to cause changes in the cross-correlation among vital signs in the early onset (Fairchild et al. 2017). In kinesiology, the cooperation patterns of human joints vary across different gestures: for example, both hands move jointly in a “shoot” action but separately in a “throw” action (Xu et al. 2022). In such cases, nonstationary models, which allow all or a subset of parameters to vary, are generally more appropriate. Modeling and capturing such a structural change are important in subsequent decision-making tasks, such as identifying the risk of disease and providing medical care (Paun et al. 2023).
Mainly two kinds of methods have been proposed to capture the dynamic characteristics of the nonstationary data. The first category assumes that the parameters are the same within local regions but different across regions. For example, a Bayesian tree-based GP (Gramacy and Lee 2008) uses a tree structure to partition the input space of a computer simulation emulator. Another method, called jump GP, cuts the input space into several segments to model piece-wise continuous functions (Park 2022). This model is optimized using the expectation-minimization (EM) algorithm or variational inference. Further, a clustering-based GP (Heaton et al. 2017) partitions the spatial data into groups by calculating a cluster dissimilarity and constructs stationary GPs for each group of data. A space-partitioning-based GP is further extended to the active learning area to accelerate the design of experiments with heterogeneous systems (Lee et al. 2023). However, these methods are not suitable for data with gradually changing characteristics. To address this issue, the methods in the second category abandon the locally stationary assumption. They allow all or some parameters to be input/time-dependent, and model those parameters by additional GPs. For instance, nonstationary GP introduced in Heinonen et al. (2016) and Paun et al. (2023) applies GP priors on the amplitude and length-scale parameters of a square-exponential covariance function. In addition, based on these single-output nonstationary GPs, researchers have explored MGP models for multivariate data with dynamic characteristics. For example, a nonstationary MGP is established to model the varying correlation between vital signals, where a GP prior is imposed on the time-dependent correlation matrix (Meng et al. 2021). However, in this state-of-the-art MGP model, the GP prior has no shrinkage effect, which does not encourage a sparse estimation of cross-correlation among multiple outputs.
Pursuing sparse estimation of cross-correlation is rooted in negative information transfer, which is another critical challenge when using MGP. Transfer learning is very promising for leveraging information to a data-insufficient domain, called a target domain, from other correlated domains, called the source domain. However, not all the data from the source domain are necessarily correlated with the target domain. If knowledge is transferred from uncorrelated domains, it may reduce the performance of target learning, known as negative transfer (Pan and Yang 2009, Yoon and Li 2018). For example, the motion signal of a specific human joint may only be correlated with a subset of the other joints. In order to recover the joint’s motion information by borrowing information from others, it is necessary to detect which joints share similar moving trajectories with the target joint. Therefore, it is crucial that researchers or engineers can make the best choice on using which sources to transfer information.
Negative transfer exists widely in transfer learning, often stemming from the excessive inclusion of source data. To handle this issue, one straightforward approach is to measure the relatedness of each source to the target and choose the most closely related one for information transfer. For example, the method proposed in Ye and Dai (2021) takes Jensen-Shannon (JS) divergence as a criterion and selects the source with the least divergence for knowledge transfer. However, such a method only takes the pairwise transferability into account and ignores the global structure between the target and the sources. And this choice is made independently on the specific model before training, which is far from an optimal decision. An alternative approach is the regularized MGP (Wang et al. 2022), which jointly models all outputs and selects informative sources during the training process. However, all these approaches assume that the source-target cross-correlation is fixed in the time space, and thus cannot model the dynamic and sparse structure among multiple outputs.
To this end, we propose a nonstationary MGP model to capture the varying characteristics of data and mitigate the negative transfer simultaneously. Specifically, we focus on modeling the dynamic and sparse correlations between the sources and the target. In the proposed framework, we first construct a convolution-process-based MGP for transfer learning, whose covariance function parameters are allowed to vary in the time space. We then apply a spike-and-slab prior to the parameters that are related to the sparse correlation between the sources and the target. The slab part mainly accounts for smoothly changing or constant correlation parameters, whereas the spike part is responsible for shrinking some parameters to zero, thereby removing the corresponding uninformative sources. To the best of our knowledge, this is the first research on MGP that simultaneously handles dynamic relationships and negative transfer. Our contributions can be summarized as follows:
A novel nonstationary MGP model is established using the convolution of latent processes and time-dependent kernel functions, which is suitable for modeling multiple outputs with varying characteristics.
A dynamic spike-and-slab prior is applied to capture the temporal and sparse correlations among outputs, deciding from which sources to transfer information to the target.
The mixture weight of the spike-and-slab priors is automatically adjusted during the training process using an EM-based optimization algorithm, which can effectively prevent placing shrinkage effects on nonzero elements.
The rest of this paper is organized as follows. In Section 2, we revisit the related literature and the static MGP model. Section 3 presents the proposed nonstationary MGP and an efficient EM algorithm for model training. In Section 4, we evaluate the effectiveness of our model on simulated data. In Section 5, we perform one time-series analysis case on human gesture data (Fothergill et al. 2012) and one control policy optimization case on the mountain-car problem (Moore 1990). In Section 6, we conclude the paper with a discussion.
2. Preliminaries
In this section, we first review research that is related to our work. We then introduce the static MGP based on the convolution process, which has been widely applied in various areas because of its flexibility (Boyle and Frean 2004, Kontar et al. 2018, Hu and Wang 2021).
2.1. Related Work
To deal with nonstationary data, a natural extension of GPs is to release the restriction that the parameters of the covariance functions are invariant throughout the input space. Most of the existing approaches either encourage the parameters to be constant in a local area and construct a piece-wise model (the first category, e.g., Gramacy and Lee 2008, Heaton et al. 2017, Park 2022, Lee et al. 2023), or allow them to vary at each point and model them using other GPs (the second category, e.g., Paciorek and Schervish 2003, Heinonen et al. 2016, Paun et al. 2023). The methods in the second category have a similar structure to that of a two-layer Deep Gaussian Process (Deep GP), where the input is first transformed by the first GP layer into a latent input, and then fed into the second GP layer to obtain the output (Damianou and Lawrence 2013, AlBahar et al. 2022, Ko and Kim 2022). However, the parameters of Deep GP are stationary, which differs from the second category where the covariance parameters are dynamic.
Nonstationary GPs mainly focus on modeling the dynamic mean, smoothness, and amplitude parameters. With regards to MGP, dynamic correlation is another key characteristic that needs to be considered. In the classical Linear Model of Coregionalization (LMC), each output is a linear combination of several latent Gaussian processes, and the covariance matrix is modeled by a Kronecker product of a correlation matrix and a single GP’s covariance matrix (Goulard and Voltz 1992). The existing nonstationary MGPs are mainly extensions of the classical LMC model. For example, the approach in Gelfand et al. (2004) allows the correlation matrix to vary with inputs to model the dynamic relationship among outputs. In Meng et al. (2021), a nonstationary MGP combines the time-varying correlation (across outputs) and smoothness (within each output) together. However, as extensions of the traditional LMC, these methods also suffer from the limitation that all outputs possess the same covariance structure. More flexible MGPs are proposed by constructing each output through the convolution process and modeling them with individual parameters (Boyle and Frean 2004). However, these approaches are for stationary data. Furthermore, all existing approaches fail to capture a sparse correlation structure in a nonstationary environment.
Spatial-temporal modeling of nonstationary data is closely related to our work. In comparison with normal time-series modeling, spatial-temporal analysis requires modeling the spatial correlation to enhance the prediction accuracy. A large number of spatial-temporal models have been investigated, such as the spatial-temporal autoregressive integrated moving average method (ST-ARIMA) (Stathopoulos and Karlaftis 2003), spatial-temporal k-nearest neighbors (ST-KNN) (Xia et al. 2016), spatial-temporal random fields (Christakos 2012, Yun et al. 2022), and spatial-temporal deep neural networks (Wang et al. 2020b, Wen et al. 2023). Based on the aforementioned methods, a number of recent works try to extend them to handle nonstationary spatial-temporal data. One popular and efficient solution is utilizing some change detection algorithm to partition the time-series into several stationary periods, and then applying the stationary model for each period, for example, an ST-KNN with a wrapped K-means partition algorithm (Cheng et al. 2021) or an autoregressive model coupled with a block-fused-Lasso change-point detection algorithm (Bai et al. 2022). Besides partitioning the time-series into stationary parts, the method proposed by Shand and Li (2017) maps the nonstationary space-time process into a high-dimensional stationary process through augmenting new dimensions. Another type of nonstationary spatial-temporal model is Bayesian random fields with nonstationary space-time kernels (Garg et al. 2012, Ton et al. 2018, Wang et al. 2020a, Zou et al. 2023), whose hyper-parameters change over time or location. Deep learning methods are also explored on nonstationary data recently, such as nonstationary recurrent neural networks (Rangapuram et al. 2018, Liu et al. 2020), long short-term memory networks (Wang et al. 2019), and transformer-based networks (Liu et al. 2022, Wen et al. 2023). In contrast to the spatial-temporal model, MGP does not impose a restriction that the source outputs must be sampled during the same period as the target outputs. Furthermore, it does not depend on spatial distance to establish correlations among outputs.
It is important to mention that we use a dynamic spike-and-slab prior in our model. The classical spike-and-slab prior is a Bayesian selection approach (George and McCulloch 1993) that has been used for feature selection in (generalized) linear models, additive models, and Gaussian processes (Ishwaran and Rao 2005, Scheipl et al. 2012, Dance and Paige 2022). With this prior, smaller parameters tend to be more influenced by the spike prior to reach zero, whereas the larger ones are mainly dominated by the slab part and bear little shrinkage. However, the classical spike-and-slab prior cannot account for the modeling of dynamic and sparse correlation parameters in our model. Therefore, we propose to extend this prior to a dynamic version. Although current works have explored the dynamic variable selection for the varying-coefficient linear models (Kalli and Griffin 2014, Huber et al. 2021, Rockova and McAlinn 2021), no work utilizes a dynamic spike-and-slab prior to model the dynamic and sparse correlation among outputs in a nonstationary MGP.
2.2. Static MGP Based on Convolution Process
Consider a set of m outputs , where is a d-dimensional input domain applied to all outputs. Suppose that the observation yi is accompanied with independent and identically distributed (i.i.d.) noise , that is,
As the convolution of a GP and a smoothing kernel is still a GP, we can construct each output fi through convolving a group of shared latent processes and kernel functions in the following way (Boyle and Frean 2004, Alvarez and Lawrence 2011):
At a new point , the posterior distribution of given data is
3. Model Development
We propose a nonstationary MGP model for transfer learning that can capture sparse source-target correlation in a dynamic environment. Specifically, we assume that the correlations between the target and each source vary over time. Besides, some sources may not be related to the target during certain time periods. Under such a circumstance, a spike-and-slab prior is utilized to model the varying and sparse correlation structure.
3.1. The Proposed Model
The structure of our hierarchical model is illustrated in Figure 1. The first layer constructs outputs through the convolution of time-dependent kernel functions and latent white Gaussian noise processes, and the second layer consists of priors on function parameters designed to encourage desired properties, such as smoothness and sparsity.

Notes. Latent processes and kernel functions are shown on a gray background, whereas the parameters are shown on a white background. The parameters’ priors are shown in rectangles.
3.1.1. Dynamic MGP.
In this subsection, we introduce the major part of the proposed nonstationary MGP, which corresponds to the first layer in Figure 1. To simplify the notation, we slightly abuse the notation used in the previous section. Specifically, we take the first m − 1 outputs as the sources, and the last one as the target. We still assume that the observation yi is accompanied with the i.i.d. measurement noise . Let be the index set of all outputs, and contain the indices of all sources. For simplicity yet without loss of generality, we assume the source and target data are sampled at the time , that is, and . In Online Appendix A, we will show a more general case where each output is observed only at a subset of time stamps.
Based on the above assumption, our dynamic MGP model is formulated as
This flexible model allows each source to have its own kernel gii, thereby allowing for heterogeneity among the sources. In order to transfer knowledge from the sources to the target, the target is connected to though the kernel function . Regarding the parameters, are responsible for the nonstationary behavior within each output, whereas capture the dynamic correlation between target and sources. More specifically, the amplitude parameter controls the knowledge transfer. For example, if , then fi will not transfer information to fm at time t.
As the latent processes are independent of each other, the covariance matrix among sources is zero-valued. Therefore, the covariance matrix can be repartitioned as
The proposed nonstationary MGP covariance matrix in Equation (9) is positive-definite, that is, ,
The proof is provided in Online Appendix B.
Based on Equation (9), the joint distribution of all sources and the target is expressed as
In the proposed model, the most important and challenging task is to estimate those time-varying parameters. If no restriction is applied, model training may suffer from a serious overfitting problem. To address this issue, Gaussian processes are typically employed to model the kernel parameters, for example, , where are covariance functions for the amplitude and length-scale parameters, respectively. Although this technique can force the parameters to vary smoothly and reduce overfitting, it cannot model a sparse correlation between the sources and the target. Consequently, this approach cannot avoid the negative transfer caused by unrelated sources.
3.1.2. Dynamic Spike-and-Slab Prior.
The classical spike-and-slab prior (George and McCulloch 1993, Dance and Paige 2022) only handles the shrinkage of one single parameter and cannot model the smooth-varying parameters. To take both the dynamic and sparse properties of correlation into account, we propose a dynamic spike-and-slab prior placing on :
Note that is a diagonal matrix, so that the slab prior is placed on its d diagonal elements independently, that is, . By using the conditional distributions as priors, we can control the change of amplitude or smoothness from the previous time step to the current one, for example, from to . Compared with the dynamic spike-and-slab prior used in linear models (Rockova and McAlinn 2021), our method does not constrain the slab prior to be a stable autoregressive process. Besides, we use a simpler but more flexible prior for , whereas the work in Rockova and McAlinn (2021) uses a prior conditional on the coefficients of the linear model.
The spike prior is responsible for shrinking the parameters to zero and cutting down the information transfer channel to the target. Common choices for this prior include point mass distribution, Laplace distribution, and Gaussian distribution with a small variance. Considering the shrinkage performance and optimization convenience, we choose the Laplace distribution as the spike prior, that is,
At time t, η can be interpreted as the prior probability that belongs to the slab process. It influences the strength of the shrinkage effect that bears. Ideally, for nonzero , the posterior mean of should be close to one so that is barely impacted by the spike prior. In the optimization algorithm developed in the next subsection, we will show that the estimated mean of is automatically adjusted based on the estimated to avoid shrinking nonzero elements. This makes our method superior to traditional Lasso methods where the sparse penalty weights are identical for zero and nonzero parameters.
Finally, based on the above discussion, the whole hierarchical model of the proposed nonstationary MGP can be expressed as follows:
3.2. Expectation-Maximization-Based Optimization Algorithm
The widely used algorithm for a Bayesian model is Markov chain Monte Carlo (MCMC) sampling, but it is computationally inefficient for the proposed nonstationary model with considerable time-varying parameters. Therefore, we develop an efficient EM algorithm. Instead of directly maximizing the posterior , we proceed iteratively in terms of the complete log posterior , where the binary parameters are treated as “missing data.” Because this function is not observable, in the Expectation-step (E-step), we calculate its conditional expectation given the observed data and the current estimated parameters. Then in the Maximization-step (M-step), we maximize the expected complete log-posterior with respect to . More precisely, the E-step and M-step at the th iteration can be expressed as
Based on Bayes’ theorem and the property of multivariate normal distribution, the expectation of can be as follows (derivation details can be found in Online Appendix C):
In the E-step, because is only dependent on , the posterior of is calculated as
Then the conditional expectation of can be updated as
The posterior mean can be interpreted as the posterior probability of classifying the current parameter into a slab process as opposed to a spike process, based on the past value . For example, we set η to 0.5 as a noninformative prior, and take a small ν0 (e.g., 0.01) for the spike prior and a large ν1 (e.g., 0.1) for the soft slab prior. Based on Equation (21), supposing that is small and is large, the expectation will tend toward one, indicating that is more likely from the slab prior. On the other hand, if is small, is close to zero, enforcing strong shrinkage on .
In the M-step, we can optimize the objective function in Equation (19) with various gradient ascent methods, such as (stochastic) ADAM, L-BFGS, etc. (Zhu et al. 1997, Kingma and Ba 2015). This objective function is actually a standard Gaussian process log-likelihood with additional regularization terms. The regularization terms penalize the difference between parameters at successive time points and shrink the amplitude parameters to facilitate source selection. The weights of the regularization terms are modulated by the expectation of . Ideally, for nonzero will equal to one, so no shrinkage effect will be placed on it. In other words, the strength of sparse penalty is automatically adjusted through Equation (21), and this adjustment has explicit statistical interpretability.
In our case studies, we find the algorithm converges rapidly, for example, achieving convergence after only five iterations. The whole algorithm is summarized in Algorithm 1, where an ADAM method (Kingma and Ba 2015) is utilized in the M-step. Note that the large parameter space of poses a challenge in identifying the modes of the full posterior. To speed up the convergence, we propose to initialize the source parameter by maximizing the sum of sources’ marginal log-likelihood and source parameter prior:
For target parameters, we find simple random initialization can achieve satisfactory performance in experiments.
3.3. Computational Challenge
There are three main computational challenges that we need to address. The first one is the calculation of integration for a convolution kernel. To avoid an intractable integration for the covariance, we utilize the Gaussian kernel in Equation (8) and derive closed-form covariance functions in Equations (10a)–(10c).
The second challenge is calculating the inverse of the covariance matrix, which is a critical issue for all GPs. The computational complexity of Algorithm 1 is approximately , where ni is the number of data points for each output. In comparison, the computational cost for traditional MGP is with the same size of data and nonseparable covariance. The main computational load of our model is in calculating the inverse of and in the source marginal distribution and the target conditional distribution , respectively, in Equation (19). Because all the latent processes are independent of each other, is a block-diagonal matrix and the complexity is reduced from to . The calculation of the inverse of is . Therefore, the overall computational complexity is reduced to .
Finally, it is a hard task to estimate a considerable number of time-varying parameters. Therefore, we develop the EM-based algorithm to fit the model rather than using a sampling method. Based on the results of a nonstationary linear model (Rockova and McAlinn 2021), the MCMC and EM algorithm lead to very close prediction errors, but the running time of MCMC is about 10 times longer than that of the EM.
(
Input: Data , ν0, ν1, ρ, η.
1: Set starting value: .
2: Initialization: obtain through Equation (22).
3: for kout iterations do
4: E-step given : Update through Equation (21).
5: M-step given :
6: for kin epochs do
7: Calculate , and according to Equation (10).
8: Calculate the value and gradient of objective function in Equation (19).
9: Obtain using the ADAM method.
10: end for
11: end for
3.4. Tuning Parameter Selection
The tuning parameters for our model are ν0, ν1 (for the hard slab prior), and ν2 (for the soft slab prior). Here, because the key of our method is selecting the most informative sources, we propose to maximize the following criterion:
3.5. Model Prediction
Because the EM algorithm only estimates the value of parameters at the observed time points, given a new input of interest, we first need to estimate the target parameter and at the new time point , then derive the predictive distribution of .
3.5.1. Forecasting.
In the forecasting task, . The estimated value of is
3.5.2. Recovery.
In the recovery task, the target data are unobserved at some specific time points and we aim to recover the missing data, that is, . Define tbe and taf to be the nearest observed time points before and after , respectively. Denote the nearest observation time to as tnear, that is,
As the parameters before and after are already optimized by the EM algorithm, the estimation of becomes
Then, given the parameter and , and the new input point , the joint distribution of and observations can be expressed as
4. Numerical Study
In this section, we verify the effectiveness of the proposed nonstationary MGP with the spike-and-slab prior (denoted as DMGP-SS) using synthetic data. In Section 4.1, we briefly describe the general settings for the numerical study and benchmark methods. In Section 4.2, we introduce the design of simulation cases, where the cross-correlation of the sources and the target is dynamic and sparse. In Section 4.3, we demonstrate our model’s capability in detecting the underlying correlation pattern as well as improving target prediction performance on the synthetic data.
4.1. General Settings
Similar to the assumption made in the model development section, we generate m sequences consisting of m − 1 sources and one target, sampled at the same time stamps. The input space is simply time. To investigate the source selection capability of DMGP-SS, we assign only sources to be correlated with the target at time t. Besides, a source will remain either correlated or uncorrelated continuously for a certain period of time.
For comparison, we consider three benchmarks:
GP. The target is modeled using a single-output GP, with a squared-exponential covariance function.
MGP-L1. It is a state-of-art static method introduced in Wang et al. (2022). MGP-L1 models the target and sources in one MGP model, with the same covariance structure as in Equation (9). The scaling parameters are penalized by L1 term to achieve source selection. The regularized log-likelihood of this model is
where K is calculated using the static covariance functions in Wang et al. (2022) and λ is a tuning parameter.DMGP-GP. This is a state-of-the-art nonstationary MGP model, which constructs an LMC model for all outputs and assumes the hyper-parameters follow other GPs (Meng et al. 2021). More details can be found in Online Appendix D. In this model, the covariance for the m outputs is
where is the correlation matrix of m outputs, and is the diagonal matrix with . In this study, we focus on the correlation between the sources and target, which corresponds to the last column of (except the last element ), that is, .
All methods are implemented in Python 3.8 and executed on a computer with an Intel® Core™ i5-7400 CPU with 3.00 GHz and 16 GB RAM. Both GP and MGP-L1 are implemented using GPflow (Matthews et al. 2017). DMGP-GP is implemented using TensorFlow and optimized with ADAM (Kingma and Ba 2015), with a maximum iteration limit of 500. The EM algorithm for DMGP-SS is also based on TensorFlow, and we use stochastic ADAM with four batches in the M-step. We set the kout and kin in Algorithm 1 to 5 and 400, respectively.
For MGP-L1, the weight of L1 penalty λ is a tuning parameter. For DMGP-GP, we use square-exponential functions for and . For simplicity, we apply the same tuning parameters for both kernels, that is, the amplitude and length-scale . Those parameters are tuned by cross-validation. In the case of DMGP-SS, the prior sparsity parameter η is set to 0.5. We repeat each case 50 times and report the prediction performance through averaging the results.
4.2. Simulation Cases
The main objective of this section is to demonstrate the effectiveness of our method in capturing the nonstationary and sparse cross-correlation between the sources and the target. For simplicity, we hold the other characteristics constant over time, for example, the smoothness of each output. Specifically, we design two simulation cases with different cross-correlation patterns. The first case involves a piece-wise constant cross-correlation, whereas the second case has a smoothly changing correlation. In each case, the input data are evenly spaced in . The observed data are generated from sine functions with measurement noise .
Case 1. In this case, we define four kinds of source functions:
where is a random phase to make the sampled outputs different from each other, and “%” represents the reminder operation. The term is used to deviate the shape of Y2 from the standard sine function. In each experiment, we generate 4k sources through sampling each kind of function k times, that is, . Specifically, the sources are sampled from Yi. Then, we define the dynamic target output asIn this case, we simulate a piece-wise constant cross-correlation by setting
Therefore, there are three segments, , and . Only the first, the second, and the second and third sources are correlated with the target in the three periods, respectively. The other sources remain uncorrelated with the target at all times.
Case 2. Compared with Case 1, we only change the coefficients into smoothly changing ones in this case. Specifically, we let them vary along sine-cosine trajectories,
In all cases, we set k = 1, 4 to generate 4 and 16 source outputs for each case. In order to compare the prediction performance of different methods, we randomly remove three short sequences of length 10 from , and respectively. These 30 data points are treated as missing data, whereas the others are used as training data.
4.3. Simulation Results
To begin with, we demonstrate that the proposed DMGP-SS is capable of capturing the dynamic and sparse correlation between the sources and the target. Figure 2 illustrates the estimated for MGP-L1, the for DMGP-SS, and the estimated for DMGP-GP in Cases 1 and 2 with four sources. Figure 3 visualizes the sources and target prediction results in Case 1 with four sources. DMGP-SS is implemented with the hard and soft slab priors for Cases 1 and 2, respectively. Note that the values of and are not identical, because is a linear combination weight rather than the correlation parameter in MGP.

Notes. (a) Case 1; (b) Case 2. The first column is the true , the second and fourth columns are the estimated for MGP-L1 and DMGP-SS, respectively, and the third column shows the estimated for DMGP-GP.

Note. The shaded region represents the 99% confidence interval.
Overall, DMGP-SS successfully recovers the true dynamic structural sparsity of correlation shown in the first column of Figure 2. Firstly, DMGP-SS tracks closely the periods of correlation between each source and the target, achieving a dynamic selection of sources. Because the target’s characteristics do not change abruptly (as shown in Figure 3), it is reasonable that the estimated correlation change points are about 10 time steps before or after the designed change times. Second, the hard slab prior encourages nearly piece-wise correlation, whereas the soft slab prior allows smoothly changing correlation. Because of the appropriate selection of sources at different times in Figure 3, the proposed DMGP-SS achieves precise prediction with the lowest prediction uncertainty. This highly improves the confidence of decision making when using the recovered series. The difference on confidence interval of three MGP models is because the uncorrelated sources “poison” the correlation structure and decrease the influence of the truly correlated sources, resulting in a lower value of variance reduction term in posterior prediction Equation (25).
In contrast, another nonstationary method, DMGP-GP, fails to estimate a sparse structure in the source-target correlation because the GP prior on parameters lacks the shrinkage effect. The proposed DMGP-SS addresses this problem through combining the smooth slab prior and the shrinking spike prior. Regarding MGP-L1, although it can estimate a sparse structure of , the estimated values of nonzero parameters are constant over time and cannot reflect the change of correlation. As a result, it performs even worse than GP in recovering the target output in Figure 3.
Another advantage of DMGP-SS is the adaptive adjustment of the spike-and-slab combination weight, . In our settings, in the spike prior is much larger than in the hard slab prior (or in the soft slab prior), to put more penalty on the correlation sparsity. For example, we set and in Case 1. However, this sparse penalty does not cause significant bias on nonzero because of the automatically adjusted in the EM algorithm. Specifically, we initialize it with 0.99 to barely shrink parameters at the beginning. Then is updated in the E-step of the EM algorithm. Figure 4 shows its estimated value after five iterations. For the correlated sources (e.g., the first source during ), their corresponding is very close to one, so they bear a negligible shrinkage effect from the spike prior. In contrast, for the uncorrelated sources (e.g., the first source after t = 50), is approximately 0.2, implying a substantial shrinkage effect. The value 0.2 can be derived based on Equation (21). For consecutive zero elements, , and , resulting in .

Table 1 summarizes the results of 40 repetitions for each case. DMGP-SS outperforms all the other methods in both cases. Further, the increase of source number does not affect the prediction accuracy, demonstrating its remarkable robustness in dealing with moderate data. MGP-L1 exhibits slightly higher prediction accuracy than GP. The static covariance structure limits its ability to transfer accurate information at all times. Under some circumstances, this limitation will cause a negative transfer effect on the learning of target (for example, the result shown in Figure 3). DMGP-GP has a better performance on target prediction than GP and MGP-L1, because of the ability to model dynamic correlation. However, it does not achieve the same prediction accuracy as DMGP-SS in these cases. On the one hand, it lacks the ability to exclude the impact of uncorrelated sources. On the other hand, DMGP-GP is an extension of LMC and uses the same function to model the autocovariance of every output. This feature makes it unsuitable for our cases where the source covariance functions have four kinds of length-scales. Nevertheless, the proposed DMGP-SS models each source with separate kernels and latent functions, highly increasing its flexibility.
|
Table 1. Prediction Error for Cases 1 and 2
Outputs | Case 1 | Case 2 | ||||||
---|---|---|---|---|---|---|---|---|
GP | MGP-L1 | DMGP-GP | DMGP-SS | GP | MGP-L1 | DMGP-GP | DMGP-SS | |
5 | 0.99 | 0.89 | 0.76 | 0.56 | 1.12 | 1.04 | 0.85 | 0.66 |
(0.26) | (0.21) | (0.23) | (0.13) | (0.27) | (0.24) | (0.26) | (0.17) | |
17 | 0.95 | 0.91 | 0.75 | 0.50 | 1.15 | 1.02 | 0.78 | 0.64 |
(0.25) | (0.21) | (0.19) | (0.15) | (0.26) | (0.21) | (0.17) | (0.16) |
Note. Values in parentheses are standard deviations.
Table 2 lists the computational time of the four methods in Case 1. Between the two nonstationary methods, DMGP-SS requires much less computation time than DMGP-GP. This exactly verifies the analysis in Section 3.3 that our model can save more time than the traditional nonstationary MGP method, because of a block-sparse covariance matrix. In fact, our model is scalable for larger sizes of data, which is described in Online Appendix E.
|
Table 2. Computational Time for the Different Methods in Case 1
Outputs | n | GP | MGP-L1 | DMGP-GP | DMGP-SS |
---|---|---|---|---|---|
5 | 130 | 0.13 | 8.6 | 130 | 73 |
17 | 130 | 0.10 | 26 | 2,100 | 300 |
5. Case Study
In this section, we apply DMGP-SS to two cases: human movement signal modeling and control policy optimization. In the first case, these signals are recorded by sensors attached to different joints, such as hands and feet. As the movement of joints is dynamically correlated across different gestures (Xu et al. 2022), it is reasonable to utilize a nonstationary method to capture the varying correlation and leverage information among the signals of joints. Regarding the control policy iteration, we study a classical reinforcement learning (RL) problem, the mountain-car problem. We evaluate the performance of DMGP-SS on leveraging knowledge between different systems when the environment is nonstationary.
5.1. Movement Signal Modeling
5.1.1. Data Description.
We use the MSRC-12 gesture data set (Fothergill et al. 2012) consisting of sequences of human skeletal body movement. Twenty sensors are distributed across the body, and each sensor can record three-dimensional coordinates of joint positions. The motion signal has a sampling frequency of 30 Hz and an accuracy of . The data set comprises 12 different gestures performed by 30 participants. The experiment involves each participant starting from a standing position, performing one gesture, and returning to the standing state. This process repeats several times continuously within each sample.
To demonstrate the effectiveness of DMGP-SS, we connect the instances of three gestures (“Goggles,” “Shoot,” and “Throw”) performed by the same individual. Figure 5(a) shows the snapshots of the standing position and the selected gestures. In the first two gestures, the participant stretches both arms in front of him to perform searching or shooting motions. In the third gesture, the participant only uses his left arm to make an overarm throwing movement. In these gestures, the main acting joints are hands, wrists, and elbows, where the trajectories of hands and wrists are almost identical. Therefore, we select the movement signals of four joints (left wrist, left elbow, right wrist, and right elbow) as 12 outputs. We choose the z-coordinate movement of the left elbow as a target output, while using the remaining 11 movement signals as sources.

Notes. The snapshots of “Stand” (initial state), “Goggles” (Put on vision goggles), “Shoot” (shoot a pistol), and “Throw” (throw an object). The 20 joints are represented by circles. We mark out the four selected joints for our study, where “L” and “R” represent “left side” and “right side,” respectively, and “E” and “W” represent the elbow and wrist joints, respectively. (b) The 12 movement signals of the selected joints in the three gestures’ data, where the red signal is taken as the target signal and the others are the source signals. In the label of the vertical axis, “x,” “y,” and “z” represent different coordinates. Each signal has a vertical range of , with the horizontal ticks representing time indices.
We select two 120-frame-long instances for each gesture and down-sample each instance to 30 frames. Therefore, there are 180 points for each output. To eliminate the difference in initial coordinate values, we reset the initial three-dimensional coordinate value to (0, 0, 2) across different recordings. Additionally, we rescale all outputs to . Figure 5(b) displays the 12 outputs and the change of joints’ positions.
5.1.2. Results.
Intuitively, the cross-correlation between the source and target signals should remain constant for a single gesture, so a hard slab prior is used in DMGP-SS. All other settings for this case are identical to those used in the simulation studies. We still simulate consecutive data missing for the target. From the 60-point-long time-series of each gesture, we randomly remove a sequence of length 10 as missing data.
First of all, Figure 6(a) shows the estimated correlation between the sources and the target. MGP-L1 selects six sources (the third “R-W-z,” the fourth “R-E-x,” the sixth “R-E-z,” the seventh “L-W-x,” the ninth “L-W-z,” and the 11th “L-E-y”) as the correlated signals for the whole time period, in which the ninth source has the strongest correlation. DMGP-SS selects the sixth source (“R-E-z”) and the ninth source (“L-W-z”) as correlated sources when and , respectively. DMGP-GP does not provide a sparse estimation of cross-correlation. Among them, the proposed DMGP-SS accurately captures the underlying physical movements of each gesture. In the “Goggles” and “Shoot” gestures, both elbows have almost the same trajectory. In the “throw” gesture, the left elbow’s movement is highly correlated with that of the left wrist. These findings align well with the signals shown in Figure 5(b). On the contrary, limited by the static correlation structure, MGP-L1 can only select some sources and force them to be correlated with the target all the time, but such signals do not exist in the dynamic environment. For DMGP-GP, the results cannot provide us an intuitive understanding of the joints’ relationship.

Notes. (a) Correlation estimation results for the data of three gestures. The first and third rows are the estimated for MGP-L1 and DMGP-SS, respectively, and the second row shows the estimated for DMGP-GP. (b) Recovery results of four methods for the movement signal of the right wrist. The shaded region represents the 99% confidence interval.
Figure 6(b) displays the recovered target signal in one experiment. Notably, DMGP-SS accurately recovers the missing data with high precision. Besides, it has the least uncertainty because it can select the most correlated sources at each time and such a high correlation improves the confidence on prediction results. Conversely, the predictions of MGP-L1 and DMGP-GP display an obvious bias in the first gap and higher uncertainty for all gaps. Besides, similar to the results of numerical studies, the proposed DMGP-SS also gives predictions with the lowest uncertainty, which significantly improves the confidence of decision making.
We further repeat the experiments 36 times. Table 3 compares the prediction accuracy of four methods in terms of both the mean absolute error (MAE) and the continuous-ranked-probability score (CRPS). The CRPS measure is a widely used metric to evaluate the probabilistic forecast for a real-valued outcome (Hersbach 2000):
|
Table 3. Prediction Error for Real Case
GP | MGP-L1 | DMGP-GP | DMGP-SS | |
---|---|---|---|---|
MAE-mean | 0.43 | 0.22 | 0.50 | 0.18 |
MAE-std. | (0.12) | (0.10) | (0.25) | (0.08) |
CRPS-mean | 0.30 | 0.16 | 0.41 | 0.15 |
Note. The values in parentheses are standard deviations.
where is the predicted output and is the cumulative density function (CDF) of the predicted distribution. A low CRPS value suggests that the predicted posteriors are concentrated around the true value, indicating a high probabilistic forecast performance. As expected, DMGP-SS outperforms the other methods because of its ability to capture the dynamic and sparse correlation accurately. MGP-L1 performs better than GP, benefiting from the information borrowed from the other sources. However, it cannot model a dynamic correlation, resulting in lower prediction accuracy than DMGP-SS. Regarding DMGP-GP, although it captures the change of correlations between the target and some sources, its prediction accuracy is even lower than GP. This result may be attributed to the fact that nonsparse correlations lead to potential negative transfer effects. Besides, because the sources’ smoothness is heterogeneous, it is inappropriate to use the same autocovariance function to model all outputs.
5.2. Control Policy Optimization
In reinforcement learning problems, there are five important elements for the agent: state s, action a, reward r, value V, and policy W. Starting from one state s, the agent takes an action a, transitions into a new state u, and gets an immediate reward r(u) from the environment. In general, a reinforcement learning framework includes two primary procedures:
Model estimation: estimating the state transition function U(s, a) based on the observed transition samples .
Policy iteration: iteratively estimating the state value V(s) (the long-term expected reward) for each state and improving the control policy W(s).
More details can be found in the works on GP-based reinforcement learning (Kuss and Rasmussen 2003, Verstraeten et al. 2020).
In this study, we employ the well-known reinforcement learning problem, the mountain-car problem, to demonstrate the application of our model in decision making. Figure 7 illustrates such a problem. A car begins in a valley and aims to reach a goal position of the rightmost hill. Because of the steep slope of the hill, the car cannot directly accelerate to the goal position. Instead, it needs to drive up the opposite side, turn back, and accelerate to reach the goal position. In the system, the agent state is described by the position and the velocity of the car, that is, . The agent action is a horizontal force . The car starts at the state with and , aiming to reach the goal state . The reward function is the probability density function of . The dynamic equation of this system is approximated by

where P is the horizon power unit and G is the vertical force unit.
We consider the control problem involving a car in a nonstationary environment with limited transition samples. During , the target car runs in an environment with and we get 20 random samples. However, at t = 20, an unknown change occurs, altering the environment factors to . We sampled another 20 random samples during . After t = 40, we start to control the car to reach the goal position and the environment does not change any more. Given that those samples are too limited to build an accurate transition model, we transfer information from two historical source data sets, each consisting of 200 samples from stationary environments with and , respectively. We can see that the target environment is close to the first source environment before t = 20, and is close to the second one after that.
Algorithm 2 summarizes the workflow of reinforcement learning, where we employ the DMGP-SS as a transition model.
(
Input: Source transition samples , target transition samples , reward function , 256 support points .
1: Initialize: policy random policy, value .
2: Fit two DMGP-SS models for the state transition using both the source and the target transition samples, one for the position state and the other for the velocity state.
3: Fit a GP for the state value model using .
4: Improve policy and value model iteratively based on the state-transition model until converges.
5: Execute the optimized policy starting at the state :
For simplicity, we model the source transition data with stationary GPs in DMGP-SS. The sources and the target are expressed as
Here, u is the position or velocity state to which the agent transitioned from a state s after taking an action a. After fitting this model to both the target and source samples, we train a GP for the value model with even-spaced support points and rewards . Based on the transition model , we iteratively improve the policy and value model until the prediction of converges. Details on the policy improvement procedure can be found in Kuss and Rasmussen (2003) and Verstraeten et al. (2020). Although we focus on the offline setting in this case, it is worth noting that this framework can be readily extended to accommodate an online setting, wherein the training sample consists of the visited states and the transition models are updated every few steps.
We choose two reinforcement learning benchmark methods (Kuss and Rasmussen 2003, Verstraeten et al. 2020) with the stationary GP and MGP as the state transition model, respectively. The maximum execution steps are 600 for each method. In Table 4, we report the mean of absolute distances to the goal state for three methods. The DMGP-SS-based control policy has the shortest average distance to the goal state in 600 steps. Specifically, Figure 8 compares the position of the car controlled by the three policies. With DMGP-SS as the transition model, the car reaches the goal position and stays there after about 250 time steps. However, the other methods cannot find a good policy to reach the goal position within 600 moves, because their stationary transition models cannot account for the environment change and make the policy iteration hard to converge.
|
Table 4. Predictive Accuracy on State Transition and the Mean of Absolute Distances to the Goal State in the Mountain-Car Case
Metric | RL-GP | RL-MGP | RL-DMGP-SS |
---|---|---|---|
The mean of absolute distances to the goal position | 0.98 | 0.83 | 0.27 |
Predictive MAE on velocity transition () | 3.8 | 5.8 | 0.51 |
Predictive MAE on position transition () | 3.6 | 4.9 | 8.2 |

Notes. (a) RL-GP, (b) RL-MGP, and (c) RL-DMGP-SS. The blue points mark the positions every 50 steps.
In Table 4, we further compare the predictive MAE on state transition for three methods. The proposed method has a significantly lower prediction error on velocity state transition than the other methods, because it can capture the change of environment and transfer information from the related sources. Figure 9 illustrates the estimated from DMGP-SS trained on the velocity transition data. We can find that the proposed method successfully finds that the correlation between the sources and the target changes at t = 20. Therefore, during the policy improvement stage, it can leverage information from the similar source (the second one) and avoid the negative transfer from the uncorrelated source (the first one). Regarding the prediction error on position transition, although RL-DMGP-SS has a higher MAE than the other benchmarks, the difference is minor considering the position range . Therefore, our method can provide the best control policy.

6. Conclusion
This paper proposes a flexible nonstationary multioutput Gaussian process for modeling multivariate data in transfer learning. The novelty of our approach lies in its ability to capture dynamic and sparse cross-correlations between sources and targets. We achieve this by allowing correlation parameters to follow a spike-and-slab prior, where the slab prior ensures correlation variation over time, and the spike prior encourages parameters to shrink to zero, eliminating negative transfer effects from uncorrelated sources. The ratio of these two priors is automatically adjusted in the proposed EM algorithm, preventing the shrinkage effect on nonzero correlation parameters. Through the experiments on simulation and a human gesture data set, we demonstrate that our method is well suited for both capturing nonstationary characteristics and mitigating negative transfer.
The proposed data-driven method provides a powerful tool for researchers and engineers to select the most informative sources to transfer knowledge. Except for high-dimensional time-series modeling, our approach could also find applications in both sequential decision making and change-point detection. For instance, transfer learning has arisen to handle the critical challenge of sparse feedbacks in RL, a popular framework for solving sequential decision-making problems. However, negative transfer is still a notable challenge for multitask reinforcement learning and it is risky to naively share information across all tasks (Yang et al. 2020, Zhu et al. 2023). Therefore, we embedded our model with the offline RL transfer framework to automatically select correlated sources to share knowledge in a nonstationary environment (Padakandla et al. 2020). In the future, we can further develop our method to account for an online RL task. Besides that, the estimated dynamic correlation among outputs can help us to understand the structure of time-series even with some missing data, as well as to detect some structural change points for subsequent decision making.
Many extensions are possible for our model. Although the proposed methodology is flexible enough to capture complex and dynamic correlations, scaling it up to large data sets is computationally challenging. Possible solutions include utilizing a sparse approximation for the covariance matrix or developing a more efficient optimizing algorithm.
In addition, the proposed EM optimization algorithm solely provides estimated values of the model parameters without incorporating any uncertainty measurement. To address this issue, we can use variational inference to approximate the true posterior distribution of parameters, thereby capturing the inherent uncertainty associated with these parameters. The approximated uncertainty would be propagated into the prediction at new points.
References
- 2022) Physics-constrained Bayesian optimization for optimal actuators placement in composite structures assembly. IEEE Trans. Automation Sci. Engrg. 20(4):2772–2783.Google Scholar (
- 2011) Computationally efficient convolved multiple output gaussian processes. J. Machine Learn. Res. 12:1459–1500.Google Scholar (
- 2022) Hybrid modeling of regional Covid-19 transmission dynamics in the U.S. IEEE J. Selected Topics Signal Processing 16(2):261–275.Google Scholar (
- 2004) Dependent Gaussian processes. Saul L, Weiss Y, Bottou B, eds. Advances in Neural Information Processing Systems, vol. 17 (MIR Press, Cambridge, MA), 217–224.Google Scholar (
- 2021) Short-term traffic forecasting by mining the non-stationarity of spatiotemporal patterns. IEEE Trans. Intelligent Transportation Systems 22(10):6365–6383.Google Scholar (
- 2009) Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans. Inform. Tech. Biomedicine 13(1):131–137.Google Scholar (
- 2012) Random Field Models in Earth Sciences (Courier Corporation, North Chelmsford, MA).Google Scholar (
- 2013) Deep Gaussian processes. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence Statist., Proc. Machine Learn. Res., vol. 31 (JMLR.org, Scottsdale, AZ), 207–215.Google Scholar (
- 2022) Fast and scalable spike and slab variable selection in high-dimensional Gaussian processes. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist., Proc. Machine Learn. Res., vol. 151 (JMLR.org), 7976–8002.Google Scholar (
- 2017) Vital signs and their cross-correlation in sepsis and NEC: A study of 1,065 very-low-birth-weight infants in two NICUs. Pediatric Res. 81(2):315–321.Google Scholar (
- 2012) Instructing people for training gestural interactive systems. Konstan JA, Chi H, Hook K, eds. Proc. SIGCHI Conf. Human Factors Comput. Systems (Association for Computing Machinery, New York), 1737–1746.Google Scholar (
- 2018)
Bayesian optimization . Gel E, Ntaimo L, eds. Recent Advances in Optimization and Modeling of Contemporary Problems. Tutorials in Operations Research (INFORMS, Catonsville, MD), 255–278.Link, Google Scholar ( - 2013) Multivariate Gaussian process emulators with nonseparable covariance structures. Technometrics 55(1):47–56.Google Scholar (
- 2012) Learning non-stationary space-time models for environmental monitoring. Proc. 26th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 288–294.Google Scholar (
- 2004) Nonstationary multivariate process modeling through spatially varying coregionalization. Test 13(2):263–312.Google Scholar (
- 1993) Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88(423):881–889.Google Scholar (
- 1992) Linear coregionalization model: Tools for estimation and choice of cross-variogram matrix. Math. Geology 24(3):269–286.Google Scholar (
- 2020) Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences (CRC Press, Boca Raton, FL).Google Scholar (
- 2008) Bayesian treed Gaussian process models with an application to computer modeling. J. Amer. Statist. Assoc. 103(483):1119–1130.Google Scholar (
- 2017) Nonstationary Gaussian process models using spatial hierarchical clustering from finite differences. Technometrics 59(1):93–101.Google Scholar (
- 2016) Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. Gretton A, Robert CC, eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist., Proc. Machine Learn. Res., vol. 51 (JMLR.org, Cadiz, Spain), 732–740.Google Scholar (
- 2000) Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecasting 15(5):559–570.Google Scholar (
- 2021) Nonlinear online multioutput Gaussian process for multistream data informatics. IEEE Trans. Indust. Informatics 18(6):3885–3893.Google Scholar (
- 2021) Inducing sparsity and shrinkage in time-varying parameter models. J. Bus. Econom. Statist. 39(3):669–683.Google Scholar (
- 2005) Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Statist. 33(2):730–773.Google Scholar (
- 2014) Time-varying sparsity in dynamic regression models. J. Econometrics 178(2):779–793.Google Scholar (
- 2015) Adam: A method for stochastic optimization. Bengio Y, LeCun Y, eds. Proc. 3rd Internat. Conf. Learn. Representation (ICLR, San Diego).Google Scholar (
- 2022) Deep Gaussian process models for integrating multifidelity experiments with nonstationary relationships. IISE Trans. 54(7):686–698.Google Scholar (
- 2018) Nonparametric modeling and prognosis of condition monitoring signals using multivariate gaussian convolution processes. Technometrics 60(4):484–496.Google Scholar (
- 2003) Gaussian processes in reinforcement learning. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems (NIPS 2003), vol. 16 (Whistler, British Columbia, Canada), 751–758.Google Scholar (
- 2023) Partitioned active learning for heterogeneous systems. J. Comput. Inform. Sci. Engrg. 23(4):041009-1–041009-11.Google Scholar (
- 2020) DSTP-RNN: A dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Systems Appl. 143:113082-1–113082-12.Google Scholar (
- 2022) Non-stationary transformers: Exploring the stationarity in time series forecasting. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Advances in Neural Information Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 9881–9893.Google Scholar (
- 2017) GPflow: A Gaussian process library using TensorFlow. J. Machine Learn. Res. 18(40):1–6.Google Scholar (
- 2021) Nonstationary multivariate Gaussian processes for electronic health records. J. Biomedical Informatics 117:103698-1–103698-11.Google Scholar (
- 1990) Efficient memory-based learning for robot control. Technical Report Number 209, University of Cambridge, Cambridge, UK.Google Scholar (
- 2003) Nonstationary covariance functions for Gaussian process regression. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems, vol. 16 (MIT Press, Cambridge, MA), 273–280.Google Scholar (
- 2020) Reinforcement learning algorithm for non-stationary environments. Appl. Intelligence 50(11):3590–3606.Google Scholar (
- 2009) A survey on transfer learning. IEEE Trans. Knowledge Data Engrg. 22(10):1345–1359.Google Scholar (
- 2022) Jump Gaussian process model for estimating piecewise continuous regression functions. J. Machine Learn. Res. 23(278):1–37.Google Scholar (
- 2023) Stochastic variational inference for scalable non-stationary gaussian process regression. Statist. Comput. 33(2):44-1–44-21.Google Scholar (
- 2018) Deep state space models for time series forecasting. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY), 7796–7805.Google Scholar (
- 2014) EMVS: The EM approach to Bayesian variable selection. J. Amer. Statist. Assoc. 109(506):828–846.Google Scholar (
- 2021) Dynamic variable selection with spike-and-slab process priors. Bayesian Anal. 16(1):233–269.Google Scholar (
- 2019) Multi-output Gaussian processes for crowdsourced traffic data imputation. IEEE Trans. Intelligent Transportation Systems 20(2):594–603.Google Scholar (
- 2012) Spike-and-slab priors for function selection in structured additive regression models. J. Amer. Statist. Assoc. 107(500):1518–1532.Google Scholar (
- 2017) Modeling nonstationarity in space and time. Biometrics 73(3):759–768.Google Scholar (
- 2023) Multi-task Gaussian process upper confidence bound for hyperparameter tuning and its application for simulation studies of additive manufacturing. IISE Trans. 55(5):496–508.Google Scholar (
- 2003) A multivariate state space approach for urban traffic flow modeling and prediction. Transportation Res. Part C Emerging Tech. 11(2):121–135.Google Scholar (
- 2018) Spatial mapping with Gaussian processes and nonstationary Fourier features. Spatial Statist. 28:59–78.Google Scholar (
- 2020) Fleet control using coregionalized Gaussian process policy iteration. De Giacomo G, Catala A, Dilkina B, Milano M, Barro S, Bugarin A, Lang J, eds. Proc. 24th Eur. Conf. Artificial Intelligence ECAI 2020 (IOS Press, Amsterdam), 1571–1578.Google Scholar (
- 2020b) Deep learning for spatio-temporal data mining: A survey. IEEE Trans. Knowledge Data Engrg. 34(8):3681–3700.Google Scholar (
- 2020a) Non-separable non-stationary random fields. Daume H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., Proc. Machine Learn. Res., vol. 119 (JMLR.org), 9887–9897.Google Scholar (
- 2022) Regularized multi-output Gaussian convolution process with domain adaptation. IEEE Trans. Pattern Anal. Machine Intelligence 45(5):6142–6156.Google Scholar (
- 2019) Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. Davis L, Torr P, Zhu SC, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Computer Society, Washington, DC), 9154–9162.Google Scholar (
- 2023) Transformers in time series: A survey. Proc. 32nd Internat. Joint Conf. Artificial Intelligence (IJCAI, Macao, SAR), 6778–6786.Google Scholar (
- 2006) Gaussian Processes for Machine Learning, vol. 2 (MIT Press, Cambridge, MA).Google Scholar (
- 2016) A distributed spatial–temporal weighted model on MapReduce for short-term traffic flow forecasting. Neurocomputing 179:246–263.Google Scholar (
- 2022) Online structural change-point detection of high-dimensional streaming data via dynamic sparse subspace learning. Technometrics 65(1):19–32.Google Scholar (
- 2020) Multi-task reinforcement learning with soft modularization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. NIPS ‘20 Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 4767–4777.Google Scholar (
- 2021) Implementing transfer learning across different datasets for time series forecasting. Pattern Recognition 109:106717-1–106717-12.Google Scholar (
- 2018) A novel positive transfer learning approach for telemonitoring of Parkinson’s disease. IEEE Trans. Automation Sci. Engrg. 16(1):180–191.Google Scholar (
- 2022) Detection of local differences in spatial characteristics between two spatiotemporal random fields. J. Amer. Statist. Assoc. 117(537):291–306.Google Scholar (
- 2021) A survey on multi-task learning. IEEE Trans. Knowledge Data Engrg. 34(12):5586–5609.Google Scholar (
- 2016) Monitoring wafers’ geometric quality using an additive gaussian process model. IIE Trans. 48(1):1–15.Google Scholar (
- 2021) Dynamic multivariate functional data modeling via sparse subspace learning. Technometrics 63(3):370–383.Google Scholar (
- 1997) Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Software 23(4):550–560.Google Scholar (
- 2023) Transfer learning in deep reinforcement learning: A survey. IEEE Trans. Pattern Anal. Machine Intelligence 45(11):13344–13362.Google Scholar (
- 2023) Joint spatio-temporal precoding for practical non-stationary wireless channels. IEEE Trans. Comm. 71(4):2396–2409.Google Scholar (