Leave-one-out

Consider \(n\) observations \(y_i\) from a Kriging model corresponding to the “Kriging” case with no nugget or noise. For \(i=1\), \(\dots\), \(n\) let \(\widehat{y}_{i|-i}\) be the prediction of \(y_i\) based on the vector \(\m{y}_{-i}\) obtained by omitting the observation \(i\) in \(\m{y}\). The vector of leave-one-out (LOO) predictions is defined by

\[ \widehat{\m{y}}_{\mathtt{LOO}} := [ \widehat{y}_{1|-1}, \dots, \, \widehat{y}_{n|-n} ]^\top, \]

and the leave-one-out Sum of Square Errors criterion is defined by

\[ \texttt{SSE}_{\texttt{LOO}} := \sum_{i=1}^n \{ y_i - \widehat{y}_{i|-i} \}^2 = \| \m{y} - \widehat{\m{y}}_{\texttt{LOO}} \|^2. \]

It can be shown that

\[ \m{y} - \widehat{\m{y}}_{\texttt{LOO}} = \m{D}_{\m{B}}^{-1}\m{B}\,\m{y} \]

where \(\m{B}\) is the Bending Energy Matrix (BEM) and \(\m{D}_{\m{B}}\) is the diagonal matrix with the same diagonal as \(\m{B}\).

By minimizing \(\texttt{SSE}_{\texttt{LOO}}\) with respect to the covariance parameters \(\theta_\ell\) we get estimates of these. Note that similarly to the profile likelihood, the LOO MSE does not depend on the vector \(\bs{\beta}\) of trend parameters.

An estimate of the GP variance \(\sigma^2\) is given by

\[ \widehat{\sigma}^2_{\texttt{LOO}} = \frac{1}{n} \, \m{y}^\top \mathring{\m{B}} \m{D}_{\mathring{\m{B}}}^{-1} \mathring{\m{B}} \m{y} \]

where \(\mathring{\m{B}}:= \sigma^2 \m{B}\) does not depend on \(\sigma^2\) and \(\m{D}_{\mathring{\m{B}}}\) is the diagonal matrix having the same diagonal as \(\mathring{\m{B}}\).

The LOO estimation can be preferable to the maximum-likelihood estimation when the covariance kernel is mispecified, see Bachoc [Bac12] who provides many details on the criterion \(\texttt{SSE}_{\texttt{LOO}}\), including its derivatives.