(SecBayes)=
# Bayesian marginal analysis

## Motivation and general form of prior
{cite:authors}`BergerAtAl_ObjectiveBayesSpatial` have shown that the
ML estimation of Kriging models often gives estimated ranges
$\widehat{\theta}_k = 0$ or $\widehat{\theta}_k = \infty$, leading to
poor predictions. Although finite positive bounds can be imposed in
the optimization to address this issue, the bounds are quite
arbitrary. {cite:authors}`BergerAtAl_ObjectiveBayesSpatial` have shown
that one can instead replace the ML estimates by the marginal
posterior mode in a Bayesian analysis. Provided that suitable priors
are used, it can be shown that the estimated ranges will be both
finite and positive: $0 < \widehat{\theta}_k < \infty$.

**Note**.  In **libKriging** the Bayesian approach will be used only to provide
  alternatives to the ML estimation of the range or correlation
  parameters.  The Bayesian inference on these parameters will not be
  achieved. Rather than the profile likelihood, a so-called
  *marginal likelihood* will be used.

In this section we switch to a Bayesian style of notations. The vector
of parameters is formed by three blocks: the vector $\bs{\theta}$ of
correlation ranges, the GP variance $\sigma^2$ and the vector
$\bs{\beta}$ of trend parameters. A major concern is the elicitation
of the prior density $\pi(\bs{\theta}, \, \sigma^2, \,\bs{\beta})$.

## Objective priors of Gu et al

A natural idea is that the prior should not provide information about
$\bs{\beta}$, implying the use of *improper* probability
densities. With the factorization

$$
  \pi(\bs{\theta}, \, \sigma^2, \,\bs{\beta}) =
  \pi(\bs{\beta} \, \vert \, \bs{\theta}, \,\sigma^2) \times
  \pi(\bs{\theta}, \, \sigma^2),
$$

a further assumption is that the trend parameter vector $\bs{\beta}$
is a priori independent of the covariance parameters $\bs{\theta}$ and
$\sigma^2$, and that the prior for $\bs{\beta}$ is an improper prior
with constant density

$$
   \pi(\bs{\beta} \, \vert \, \bs{\theta}, \sigma^2)
        = \pi(\bs{\beta}) \propto 1.
$$

Then the problem boils down to the choice of the joint prior
$\pi(\bs{\theta}, \, \sigma^2)$. 

In the case where no nugget or noise is used, an interesting choice is

$$ 
  \tag{1}
  \pi(\bs{\theta}, \, \sigma^2) =
    \frac{\pi(\bs{\theta})}{(\sigma^2)^a}
$$

with $a >0$. With this specific form the result of the integration of
the likelihood or of the posterior density with respect to $\sigma^2$
and $\bs{\beta}$ is then known in closed form.  

## Fit: Bayesian marginal analysis

In the `Kriging` case, the *marginal likelihood*
a.k.a. *integrated likelihood* for $\bs{\theta}$ is obtained by
marginalizing the GP variance $\sigma^2$ and the trend parameter
vector $\bs{\beta}$ out of the likelihood according to

$$
  L_{\texttt{marg}}(\bs{\theta};\,\m{y}) := 
  p(\m{y} \, \vert \, \bs{\theta}) \propto \int
  p(\m{y} \, \vert \,\bs{\theta}, \, \sigma^2, \, \bs{\beta}) \,
  \frac{1}{\sigma^{2a}} \,
  \text{d}\sigma^2\,
  \text{d}\bs{\beta},
$$

where $p(\m{y} \, \vert \,\bs{\theta}, \, \sigma^2, \,
\bs{\beta})$ is the likelihood $L(\bs{\theta}, \,
\sigma^2, \, \bs{\beta};\, \m{y})$. We get a closed
expression given in the [table below](TabMarglik).  Now for a prior
having the form (1), the marginal posterior factorizes as

$$
    p_{\texttt{marg}}(\bs{\theta}\,\vert \,\m{y}) 
	\propto \pi(\bs{\theta}) \times  L_{\texttt{marg}}(\bs{\theta};\,\m{y}).
$$

In the `NuggetKriging` case, the same approach can be used, but the
parameter used for the nugget is not marginalized out so it remains an
argument of the marginal likelihood. In **libKriging** the nugget
parameter is taken as $\alpha := \sigma^2 / (\sigma^2 + \tau^2)$ where
$\tau^2$ is the nugget variance. We then have the factorization

$$
    p_{\texttt{marg}}(\bs{\theta}, \, \alpha \,\vert \,\m{y}) 
	\propto \pi(\bs{\theta},\,\alpha) \times  L_{\texttt{marg}}(\bs{\theta},\,\alpha;\,\m{y}).
$$


**Note** The marginal likelihood differs from the frequentist notion
  attached to this name. But it also differs from the marginal
  likelihood as often used in the GP community e.g., in
  {cite:t}`RasmussenWilliams_GaussianProcesses` where the
  marginalization is for the values $\bs{\zeta}$ of the
  unobserved GP hence is nothing but the likelihood descrided in [this
  section](SecMLE).

(TabMarglik)=
## Table of marginal likelihood functions

The following table gives the marginal log-likelihood for the
different forms of Kriging models. The sum of squares $S^2$ is given
by $S^2 := \m{e}^\top \mathring{\m{C}}^{-1} \m{e}$
where $\m{e}:= \m{y} -
\m{F}\widehat{\bs{\beta}}$ and $\mathring{\m{C}}$ is
the correlation matrix (equal to $\m{R}$ or
$\m{R}_\alpha$). The sum of squares $S^2$ can be expressed as
$S^2 = \m{y}^\top \mathring{\m{B}} \m{y}$ where
$\mathring{\m{B}} := \sigma^2 \m{B}$ is a scaled version ot
the [Bending Energy matrix](SecBending) $\m{B}$.


|   |   |
|:--|:--|
| `"Kriging"` | $-2 \ell_{\texttt{marg}}(\bs{\theta}) = \log \lvert\m{R}\rvert + \log\lvert \m{F}^\top \m{R}^{-1}\m{F}\rvert + (n - p + 2a - 2) \log S^2$  |
| `"NuggetKriging"` | $-2 \ell_{\texttt{marg}}(\bs{\theta}, \, \alpha) = \log \lvert\m{R}_\alpha\rvert + \log\lvert \m{F}^\top \m{R}_\alpha^{-1}\m{F}\rvert + (n - p + 2a -2) \log S^2$ |
| `"NoiseKriging"` | *not used*  |


It can be interesting to compare this table with the [table of profile
log-likelihoods](TabProflik). 


## Reference prior for the correlation parameters [not implemented yet]

For the case when no nugget or noise is used,
{cite:authors}`BergerAtAl_ObjectiveBayesSpatial` define the reference joint
prior for $\bs{\theta}$ and $\sigma^2$ in relation to the
integrated likelihood where only the trend parameter
$\bs{\beta}$ is marginalized out, that is 
$p(\m{y} \, \vert \, \bs{\theta}, \, \sigma^2) \, = \int
p(\m{y} \, \vert \, \bs{\theta}, \, \sigma^2, \, \bs{\beta}) \,
\text{d}\bs{\beta}$ and they show that it has the form

$$
  \pi_{\texttt{ref}}(\bs{\theta},\, \sigma^2) %% = \left|\m{I}^\star(\sigma^2,\, \bs{\theta})\right|^{1/2}
  = \frac{\pi_{\texttt{ref}}(\bs{\theta})}{\sigma^2}
$$

where $\pi_{\texttt{ref}}(\bs{\theta})$ no longer depends on $\sigma^2$.

We now give some hints on the derivation and the computation of the
reference prior. Let
$\m{I}^\star(\bs{\theta},\,\sigma^2)$ be the $(d+1)
\times (d+1)$ Fisher information matrix based on the marginal
log-likelihood $\ell_{\texttt{marg}}(\bs{\theta},\,\sigma^2) =
\log L_{\texttt{marg}}(\bs{\theta},\,\sigma^2)$

$$
  \m{I}^\star(\bs{\theta},\, \sigma^2) := 
  \begin{bmatrix}
    -\mathbb{E}\left\{\frac{\partial^2}{\partial \bs{\theta}\partial \bs{\theta}^\top}
      \,
      \ell_{\texttt{marg}}(\bs{\theta}, \,\sigma^2 )\right\}
    & -\mathbb{E}\left\{\frac{\partial^2}{\partial \sigma^2\partial \bs{\theta}^\top}
      \,
      \ell_{\texttt{marg}}(\bs{\theta}, \,\sigma^2 )\right\}\\
    -\mathbb{E}\left\{\frac{\partial^2}{\partial \sigma^2\partial \bs{\theta}}\,
      \ell_{\texttt{marg}}(\bs{\theta}, \,\sigma^2 )\right\}
    & -\mathbb{E}\left\{\frac{\partial^2 }{\partial \sigma^2\partial \sigma^2}\,
      \ell_{\texttt{marg}}(\bs{\theta}, \,\sigma^2 )\right\}
    \rule{0pt}{1.5em}
  \end{bmatrix} =:
  \begin{bmatrix}
    \m{H} & \m{u}^\top\\
    \m{u} & b
  \end{bmatrix}.
$$

One can show that this information matrix can be expressed by using
the $n \times n$ symmetric matrices $\m{N}_k := \m{L}^{-1}
\left[\partial_{\theta_k} \m{R}\right] \m{L}^{-\top}$ where
$\m{L}$ is the lower Cholesky root of the correlation matrix
according to

$$
  \m{H} = \frac{1}{2}\,
  \begin{bmatrix}
    \text{tr}(\m{N}_1\m{N}_1)   & \text{tr}(\m{N}_1\m{N}_2)
    & \dots & \text{tr}(\m{N}_1\m{N}_p) \\
    \text{tr}(\m{N}_2\m{N}_1)   & \text{tr}(\m{N}_2\m{N}_2)
    & \dots & \text{tr}(\m{N}_2\m{N}_p) \\
    \vdots  & \vdots  & &  \vdots \\
    \text{tr}(\m{N}_p\m{N}_1)   & \text{tr}(\m{N}_p\m{N}_2)
    & \dots & \text{tr}(\m{N}_p\m{N}_p) 
  \end{bmatrix}, \quad
  \m{u} = \frac{1}{2 \sigma^2}\,
  \begin{bmatrix}
    \text{tr}(\m{N}_1)\\
    \text{tr}(\m{N}_2) \\
    \vdots\\
    \text{tr}(\m{N}_p) 
  \end{bmatrix}, \qquad
  b = \frac{n - p}{2 \sigma^4}.
$$

By multiplying by $\sigma^2$ both the last row and the last column of
$\m{I}^\star(\bs{\theta}, \, \sigma^2)$ corresponding to
$\sigma^2$, we get a new $(d+1) \times (d+1)$ matrix say
$\m{I}^\star(\bs{\theta})$ which no longer depends on
$\sigma^2$, the notation $\m{I}^\star(\bs{\theta})$ being
consistent with {cite:t}`GuEtAl_RobusGaSp`. Then
$\pi_{\texttt{ref}}(\bs{\theta}) = \left|
\m{I}^\star(\bs{\theta}) \right|^{1/2}$.

Note that the determinant expresses as

$$
  \left| \m{I}^\star(\bs{\theta}) \right| 
  = |\m{H}| \times
  \left|n -p  - \mathring{\m{u}}^\top \m{H}^{-1} 
  \mathring{\m{u}} \right| 
$$

where $\mathring{\m{u}} := \sigma^2 \m{u}$. See {cite:t}`Gu_Phd`
for details.

**Note**   The information matrix takes the blocks in the order:
  "$\bs{\theta}$ *then* $\sigma^2$", while the opposite order is used
  in {cite:t}`Gu_Phd`.

The reference prior suffers from its high computational cost. Indeed,
in order to get the value of the prior density one needs the
derivatives of the correlation matrix $\m{R}$ and in order to use
the derivatives of the prior to find the posterior mode, the second
order derivatives of $\m{R}$ are required. An alternative is
the following so-called *Jointly robust* prior.

(SecJointlyrobust)=
## The "Jointly Robust"  prior of Gu

{cite:t}`Gu_JointlyRobust` defines an easily computed prior called the
*Jointly Robust* (JR) prior. This prior is implemented in the R
package **RobustGaSP**. In the nugget case the prior is defined with some
obvious abuse of notation by

$$
  \pi_{\texttt{JR}}(\bs{\theta},\, \sigma^2, \, \alpha)  \propto
  \frac{\pi_{\texttt{JR}}(\bs{\theta}, \, \alpha)}{\sigma^2}
$$

where as above $\alpha := \sigma^2 / (\sigma^2 + \tau^2)$ so that the
nugget variance ratio $\eta := \tau^2 / \sigma^2$ of
{cite:t}`Gu_JointlyRobust` is $\eta = (1 - \alpha) / \alpha$. The JR
prior corresponds to

$$
  \pi_{\texttt{JR}}(\bs{\theta}, \, \alpha)  \propto t^{a_{\texttt{JR}}}
  \exp\{ -b_{\texttt{JR}}t\} \qquad
  t :=  \frac{1 - \alpha}{\alpha} + \sum_{\ell= 1}^d \frac{C_\ell}{\theta_\ell},
$$

where $a_{\texttt{JR}}> -(d + 1)$ and $b_{\texttt{JR}} >0$ are two
hyperparameters and $C_\ell$ is proportional to the range $r_\ell$ of
the column $\ell$ in $\m{X}$

$$
  C_\ell = n^{-1/d} \times r_\ell, \qquad r_\ell :=
  \max_i\{X_{i\ell}\} -\min_i\{X_{i\ell}\}.
$$

The values of $a_{\texttt{JR}}$ and $b_{\texttt{JR}}$ are chosen as

$$
  a_{\texttt{JR}} := 0.2, \qquad b_{\texttt{JR}} := 
  n^{-1/d} \times (a_{\texttt{JR}} + d).
$$

Note that as opposed to the objective prior described above, the JR
prior does not depend on the specific kernel chosen for the
GP. However the integration w.r.t.  $\sigma^2$ and $\bs{\beta}$ is the
same as for the reference prior, which means that the marginal
likelihood is the same as for the reference prior above corresponding
to $a = 1$ in the prior (1) above.

**Caution** The parameter $a_{\texttt{JR}}$ is denoted by $a$ in
{cite:t}`Gu_JointlyRobust` and in the code of **libKriging**. It
differs from the exponent $a$ of $\sigma^{-2}$ used above.