Trend estimation
Generalized Least Squares
Using \(n\) given observations \(y_i\), we can estimate the trend at the inputs \(\m{x}_i\). For that aim we must find an estimate \(\widehat{\bs{\beta}}\) of the unknown vector \(\bs{\beta}\). When no nugget or noise is used, the GP part comes as the difference \(\widehat{\bs{\zeta}} = \m{y} - \m{F}\widehat{\bs{\beta}}\). When instead a nugget or a noise is present a further step is needed to separate the smooth GP part from the nugget or noise in \(\m{y} - \m{F}\widehat{\bs{\beta}}\).
If the covariance parameters are known, the estimate \(\widehat{\bs{\beta}}\) can be obtained by using General Least Squares (GLS); this estimate is also the Maximum Likelihood estimate. The computations related to GLS can rely on the Cholesky and the QR decompositions of matrices as now detailed.
The "Kriging" case
In the "Kriging" case, we have \(\m{C} = \sigma^2 \m{R}\) where
\(\m{R}\) is the correlation matrix depending on \(\bs{\theta}\). If the
correlation matrix \(\m{R}\) is known, then the ML estimate of
\(\bs{\beta}\) and its covariance are given by
Moreover the ML estimate \(\widehat{\sigma}^2\) is available as well.
In practice we can use the Cholesky decomposition \(\m{R} = \m{L}\m{L}^\top\) where \(\m{L}\) is a \(n \times n\) lower triangular matrix with positive diagonal elements. By left-multiplying the relation \(\m{y} = \m{F}\bs{\beta} + \bs{\zeta}\) by \(\m{L}^{-1}\), we get
where the “dagged” symbols indicate a left multiplication by \(\m{L}^{-1}\) e.g., \(\m{y}^\dagger=\m{L}^{-1}\m{y}\). We get a standard linear regression with i.i.d. Gaussian errors \(\bs{\zeta}_i^\dagger\) having zero mean and variance \(\sigma^2\). So the ML estimates \(\widehat{\bs{\beta}}\) and \(\widehat{\sigma}^2\) come by Ordinary Least Squares. Using \(\widehat{\bs{\zeta}} = \m{y} - \m{F}\widehat{\bs{\beta}}\) and \(\bs{\zeta}^\dagger := \m{L}^{-1}\widehat{\bs{\zeta}}\) we have
Note that \(\widehat{\sigma}^2_{\texttt{ML}}\) is a biased estimate of \(\sigma^2\). An alternative unbiased estimate can be obtained by using \(n-p\) instead of \(n\) as the denominator: this is the so-called Restricted Maximum Likelihood (REML) estimate.
The computations rely on the so-called “thin” or “economical” QR decomposition of the transformed trend matrix \(\m{F}^\dagger\)
where \(\m{Q}_{\m{F}^\dagger}\) is a \(n \times p\) orthogonal matrix and \(\m{R}_{\m{F}^\dagger}\) is a \(p \times p\) upper triangular matrix. The orthogonality means that \(\m{Q}_{\m{F}^{\dagger}}^\top\m{Q}_{\m{F}^\dagger}= \m{I}_p\). The estimate \(\widehat{\bs{\beta}}\) comes by solving the triangular system \(\m{R}_{\m{F}^\dagger}\bs{\beta} = \m{Q}_{\m{F}^\dagger}^\top \m{y}^\dagger\), and the covariance of the estimate is \(\textsf{Cov}(\widehat{\bs{\beta}}) = \m{R}_{\m{F}^\dagger}^{-1} \m{R}_{\m{F}^\dagger}^{-\top}\)
Following a popular linear regression trick, one can further use the QR decomposition of the matrix \(\m{F}^\dagger_+\) obtained by adding a new column \(\m{y}^\dagger\) to \(\m{F}^\dagger\)
Then the \(p+1\) column of \(\m{Q}_{\m{F}^\dagger_+}\) contains the vector of residuals \(\widehat{\bs{\zeta}}^\dagger = \m{y}^\dagger - \m{F}^\dagger \widehat{\bs{\beta}}\) in its first \(p\) elements and the residual sum of squares is given by the square of the element \(R_{\m{F}^\dagger_+}[p + 1, p +1]\). See Lange [Lan10].
Kriging(noise = "nugget") and Kriging(noise = <variance vector>)
When a nugget or noise term is used, the estimate of \(\bs{\beta}\) can
be obtained as above provided that the covariance matrix is that of
the non-trend component hence includes the nugget or noise variance in
its diagonal. In the Kriging(noise = "nugget") case the GLS will provide an
estimate of the variance \(\nu^2 = \sigma^2 + \tau^2\) but the ML
estimate of \(\sigma^2\) can only be obtained by using a numerical
optimization providing the ML estimate of \(\alpha\) from which the
estimate of \(\sigma^2\) is found.
The Bending Energy Matrix
Since \(\widehat{\bs{\beta}}\) is a linear function of \(\m{y}\) we have
where the \(n \times n\) matrix \(\m{B}\) called the Bending Energy Matrix (BEM) is given by
The \(n \times n\) matrix \(\m{B}\) is such that \(\m{B}\m{F} = \m{0}\) which means that the columns of \(\m{F}\) are eigenvectors of \(\m{B}\) with eigenvalue \(0\). If \(\m{C}\) is positive definite and \(\m{F}\) has full column rank as assumed, then \(\m{B}\) has rank \(n- p\).
In the special case where no trend is used i.e., \(p=0\) the bending energy matrix can consistently be defined as \(\m{B} := \m{C}^{-1}\), the trend matrix \(\m{F}\) then being a matrix with zero columns and the vector \(\bs{\beta}\) being of length zero.
The BEM matrix is closely related to smoothing since the trend and GP component of \(\m{y}\) are given by
The matrix \(\m{I}_n - \m{C}\m{B}\) is the matrix of the orthogonal projection on the linear space spanned by the columns of \(\m{F}\) in \(\mathbb{R}^n\) equipped with the inner product \(\langle\m{z},\,\m{z}'\rangle_{\m{C}^{-1}} := \m{z}^\top \m{C}^{-1}\m{z}'\).
Note The BEM does not depend on the specific basis used to define the linear space of trend functions. It also depends on the kernel only through the reduced kernel related to the trend linear space, see Pronzato [Pro19]. So the eigen-decomposition of the BEM provides useful insights into the model used such as the so-called Principal Kriging Functions
The BEM \(\m{B}\) can be related to the matrices \(\m{C}\) and \(\m{F}\) by a block inversion
where the inverse exists provided that \(\m{F}\) has full column rank, the kernel being assumed to be definite positive.
The relation can be derived by using the so-called kernel shift functions \(\m{x} \mapsto C(\m{x}, \, \m{x}_i)\) to represent the GP component of \(y(\m{x})\) in the Kriging mean function
In the case where the model has no nugget or noise, using the \(n\) observations \(y_i\) we can find the \(n + p\) unknown coefficients \(\alpha_i\) and \(\beta_k\) by imposing the orthogonality constraints \(\m{F}^\top\bs{\alpha} = \m{0}_p\), leading to the linear system
see Mardia et al. [MKGL96].
It turns out that the trend part of the solution is then identical to the GLS estimate \(\widehat{\bs{\beta}}\).
If \(n^\star\) “new” inputs \(\m{x}^\star_j\) are given in a matrix \(\m{X}^\star\), then with \(\m{C}^\star := \m{C}(\m{X}^\star, \, \m{X})\) and \(\m{F}^\star :=\m{F}(\m{X}^\star)\) the prediction writes in blocks form as