Bayesian marginal analysis
Motivation and general form of prior
Berger, De Oliveira, and Sansó have shown that the ML estimation of Kriging models often gives estimated ranges \(\widehat{\theta}_k = 0\) or \(\widehat{\theta}_k = \infty\), leading to poor predictions. Although finite positive bounds can be imposed in the optimization to address this issue, the bounds are quite arbitrary. Berger, De Oliveira, and Sansó have shown that one can instead replace the ML estimates by the marginal posterior mode in a Bayesian analysis. Provided that suitable priors are used, it can be shown that the estimated ranges will be both finite and positive: \(0 < \widehat{\theta}_k < \infty\).
Note In libKriging the Bayesian approach will be used only to provide alternatives to the ML estimation of the range or correlation parameters. The Bayesian inference on these parameters will not be achieved. Rather than the profile likelihood, a socalled marginal likelihood will be used.
In this section we switch to a Bayesian style of notations. The vector of parameters is formed by three blocks: the vector \(\boldsymbol{\theta}\) of correlation ranges, the GP variance \(\sigma^2\) and the vector \(\boldsymbol{\beta}\) of trend parameters. A major concern is the elicitation of the prior density \(\pi(\boldsymbol{\theta}, \, \sigma^2, \,\boldsymbol{\beta})\).
Objective priors of Gu et al
A natural idea is that the prior should not provide information about \(\boldsymbol{\beta}\), implying the use of improper probability densities. With the factorization
a further assumption is that the trend parameter vector \(\boldsymbol{\beta}\) is a priori independent of the covariance parameters \(\boldsymbol{\theta}\) and \(\sigma^2\), and that the prior for \(\boldsymbol{\beta}\) is an improper prior with constant density
Then the problem boils down to the choice of the joint prior \(\pi(\boldsymbol{\theta}, \, \sigma^2)\).
In the case where no nugget or noise is used, an interesting choice is
with \(a >0\). With this specific form the result of the integration of the likelihood or of the posterior density with respect to \(\sigma^2\) and \(\boldsymbol{\beta}\) is then known in closed form.
Fit: Bayesian marginal analysis
In the Kriging
case, the marginal likelihood
a.k.a. integrated likelihood for \(\boldsymbol{\theta}\) is obtained by
marginalizing the GP variance \(\sigma^2\) and the trend parameter
vector \(\boldsymbol{\beta}\) out of the likelihood according to
where \(p(\mathbf{y} \, \vert \,\boldsymbol{\theta}, \, \sigma^2, \, \boldsymbol{\beta})\) is the likelihood \(L(\boldsymbol{\theta}, \, \sigma^2, \, \boldsymbol{\beta};\, \mathbf{y})\). We get a closed expression given in the table below. Now for a prior having the form (1), the marginal posterior factorizes as
In the NuggetKriging
case, the same approach can be used, but the
parameter used for the nugget is not marginalized out so it remains an
argument of the marginal likelihood. In libKriging the nugget
parameter is taken as \(\alpha := \sigma^2 / (\sigma^2 + \tau^2)\) where
\(\tau^2\) is the nugget variance. We then have the factorization
Note The marginal likelihood differs from the frequentist notion attached to this name. But it also differs from the marginal likelihood as often used in the GP community e.g., in Rasmussen and Williams [RW06] where the marginalization is for the values \(\boldsymbol{\zeta}\) of the unobserved GP hence is nothing but the likelihood descrided in this section.
Table of marginal likelihood functions
The following table gives the marginal loglikelihood for the different forms of Kriging models. The sum of squares \(S^2\) is given by \(S^2 := \mathbf{e}^\top \mathring{\mathbf{C}}^{1} \mathbf{e}\) where \(\mathbf{e}:= \mathbf{y}  \mathbf{F}\widehat{\boldsymbol{\beta}}\) and \(\mathring{\mathbf{C}}\) is the correlation matrix (equal to \(\mathbf{R}\) or \(\mathbf{R}_\alpha\)). The sum of squares \(S^2\) can be expressed as \(S^2 = \mathbf{y}^\top \mathring{\mathbf{B}} \mathbf{y}\) where \(\mathring{\mathbf{B}} := \sigma^2 \mathbf{B}\) is a scaled version ot the Bending Energy matrix \(\mathbf{B}\).

\(2 \ell_{\texttt{marg}}(\boldsymbol{\theta}) = \log \lvert\mathbf{R}\rvert + \log\lvert \mathbf{F}^\top \mathbf{R}^{1}\mathbf{F}\rvert + (n  p + 2a  2) \log S^2\) 

\(2 \ell_{\texttt{marg}}(\boldsymbol{\theta}, \, \alpha) = \log \lvert\mathbf{R}_\alpha\rvert + \log\lvert \mathbf{F}^\top \mathbf{R}_\alpha^{1}\mathbf{F}\rvert + (n  p + 2a 2) \log S^2\) 

not used 
It can be interesting to compare this table with the table of profile loglikelihoods.
Reference prior for the correlation parameters [not implemented yet]
For the case when no nugget or noise is used, Berger, De Oliveira, and Sansó define the reference joint prior for \(\boldsymbol{\theta}\) and \(\sigma^2\) in relation to the integrated likelihood where only the trend parameter \(\boldsymbol{\beta}\) is marginalized out, that is \(p(\mathbf{y} \, \vert \, \boldsymbol{\theta}, \, \sigma^2) \, = \int p(\mathbf{y} \, \vert \, \boldsymbol{\theta}, \, \sigma^2, \, \boldsymbol{\beta}) \, \text{d}\boldsymbol{\beta}\) and they show that it has the form
where \(\pi_{\texttt{ref}}(\boldsymbol{\theta})\) no longer depends on \(\sigma^2\).
We now give some hints on the derivation and the computation of the reference prior. Let \(\mathbf{I}^\star(\boldsymbol{\theta},\,\sigma^2)\) be the \((d+1) \times (d+1)\) Fisher information matrix based on the marginal loglikelihood \(\ell_{\texttt{marg}}(\boldsymbol{\theta},\,\sigma^2) = \log L_{\texttt{marg}}(\boldsymbol{\theta},\,\sigma^2)\)
One can show that this information matrix can be expressed by using the \(n \times n\) symmetric matrices \(\mathbf{N}_k := \mathbf{L}^{1} \left[\partial_{\theta_k} \mathbf{R}\right] \mathbf{L}^{\top}\) where \(\mathbf{L}\) is the lower Cholesky root of the correlation matrix according to
By multiplying by \(\sigma^2\) both the last row and the last column of \(\mathbf{I}^\star(\boldsymbol{\theta}, \, \sigma^2)\) corresponding to \(\sigma^2\), we get a new \((d+1) \times (d+1)\) matrix say \(\mathbf{I}^\star(\boldsymbol{\theta})\) which no longer depends on \(\sigma^2\), the notation \(\mathbf{I}^\star(\boldsymbol{\theta})\) being consistent with Gu et al. [GWB18]. Then \(\pi_{\texttt{ref}}(\boldsymbol{\theta}) = \left \mathbf{I}^\star(\boldsymbol{\theta}) \right^{1/2}\).
Note that the determinant expresses as
where \(\mathring{\mathbf{u}} := \sigma^2 \mathbf{u}\). See Gu [Gu16] for details.
Note The information matrix takes the blocks in the order: “\(\boldsymbol{\theta}\) then \(\sigma^2\)”, while the opposite order is used in Gu [Gu16].
The reference prior suffers from its high computational cost. Indeed, in order to get the value of the prior density one needs the derivatives of the correlation matrix \(\mathbf{R}\) and in order to use the derivatives of the prior to find the posterior mode, the second order derivatives of \(\mathbf{R}\) are required. An alternative is the following socalled Jointly robust prior.
The “Jointly Robust” prior of Gu
Gu [Gu19] defines an easily computed prior called the Jointly Robust (JR) prior. This prior is implemented in the R package RobustGaSP. In the nugget case the prior is defined with some obvious abuse of notation by
where as above \(\alpha := \sigma^2 / (\sigma^2 + \tau^2)\) so that the nugget variance ratio \(\eta := \tau^2 / \sigma^2\) of Gu [Gu19] is \(\eta = (1  \alpha) / \alpha\). The JR prior corresponds to
where \(a_{\texttt{JR}}> (d + 1)\) and \(b_{\texttt{JR}} >0\) are two hyperparameters and \(C_\ell\) is proportional to the range \(r_\ell\) of the column \(\ell\) in \(\mathbf{X}\)
The values of \(a_{\texttt{JR}}\) and \(b_{\texttt{JR}}\) are chosen as
Note that as opposed to the objective prior described above, the JR prior does not depend on the specific kernel chosen for the GP. However the integration w.r.t. \(\sigma^2\) and \(\boldsymbol{\beta}\) is the same as for the reference prior, which means that the marginal likelihood is the same as for the reference prior above corresponding to \(a = 1\) in the prior (1) above.
Caution The parameter \(a_{\texttt{JR}}\) is denoted by \(a\) in Gu [Gu19] and in the code of libKriging. It differs from the exponent \(a\) of \(\sigma^{2}\) used above.