# Trend estimation (SecGLS)= ## Generalized Least Squares Using $n$ given observations $y_i$, we can estimate the trend at the inputs $\m{x}_i$. For that aim we must find an estimate $\widehat{\bs{\beta}}$ of the unknown vector $\bs{\beta}$. When no nugget or noise is used, the GP part comes as the difference $\widehat{\bs{\zeta}} = \m{y} - \m{F}\widehat{\bs{\beta}}$. When instead a nugget or a noise is present a further step is needed to separate the smooth GP part from the nugget or noise in $\m{y} - \m{F}\widehat{\bs{\beta}}$. If the covariance parameters are known, the estimate $\widehat{\bs{\beta}}$ can be obtained by using General Least Squares (GLS); this estimate is also the Maximum Likelihood estimate. The computations related to GLS can rely on the Cholesky and the QR decompositions of matrices as now detailed. ### The `"Kriging"` case In the `"Kriging"` case, we have $\m{C} = \sigma^2 \m{R}$ where $\m{R}$ is the correlation matrix depending on $\bs{\theta}$. If the correlation matrix $\m{R}$ is known, then the ML estimate of $\bs{\beta}$ and its covariance are given by $$ \widehat{\bs{\beta}} = \left[\m{F}^\top \m{R}^{-1} \m{F}\right]^{-1} \m{F}^\top \m{R}^{-1}\m{y}, \qquad \textsf{Cov}(\widehat{\bs{\beta}}) = \sigma^2 [\m{F}^\top \m{R}^{-1}\m{F}]^{-1}. $$ Moreover the ML estimate $\widehat{\sigma}^2$ is available as well. In practice we can use the Cholesky decomposition $\m{R} = \m{L}\m{L}^\top$ where $\m{L}$ is a $n \times n$ lower triangular matrix with positive diagonal elements. By left-multiplying the relation $\m{y} = \m{F}\bs{\beta} + \bs{\zeta}$ by $\m{L}^{-1}$, we get $$ \m{y}^\dagger = \m{F}^\dagger\bs{\beta} + \bs{\zeta}^\dagger $$ where the "dagged" symbols indicate a left multiplication by $\m{L}^{-1}$ e.g., $\m{y}^\dagger=\m{L}^{-1}\m{y}$. We get a standard linear regression with i.i.d. Gaussian errors $\bs{\zeta}_i^\dagger$ having zero mean and variance $\sigma^2$. So the ML estimates $\widehat{\bs{\beta}}$ and $\widehat{\sigma}^2$ come by Ordinary Least Squares. Using $\widehat{\bs{\zeta}} = \m{y} - \m{F}\widehat{\bs{\beta}}$ and $\bs{\zeta}^\dagger := \m{L}^{-1}\widehat{\bs{\zeta}}$ we have $$ \widehat{\sigma}^2_{\texttt{ML}} = \frac{1}{n} \,S^2, \quad\text{with}\quad S^2 := \widehat{\bs{\zeta}}^{\dagger\top}\widehat{\bs{\zeta}}^\dagger = \widehat{\bs{\zeta}}^\top\m{R}^{-1}\widehat{\bs{\zeta}}. $$ Note that $\widehat{\sigma}^2_{\texttt{ML}}$ is a biased estimate of $\sigma^2$. An alternative unbiased estimate can be obtained by using $n-p$ instead of $n$ as the denominator: this is the so-called *Restricted Maximum Likelihood* (REML) estimate. The computations rely on the so-called "thin" or "economical" QR decomposition of the transformed trend matrix $\m{F}^\dagger$ $$ \m{F}^\dagger = \m{Q}_{\m{F}^\dagger} \m{R}_{\m{F}^\dagger} $$ where $\m{Q}_{\m{F}^\dagger}$ is a $n \times p$ orthogonal matrix and $\m{R}_{\m{F}^\dagger}$ is a $p \times p$ upper triangular matrix. The orthogonality means that $\m{Q}_{\m{F}^{\dagger}}^\top\m{Q}_{\m{F}^\dagger}= \m{I}_p$. The estimate $\widehat{\bs{\beta}}$ comes by solving the triangular system $\m{R}_{\m{F}^\dagger}\bs{\beta} = \m{Q}_{\m{F}^\dagger}^\top \m{y}^\dagger$, and the covariance of the estimate is $\textsf{Cov}(\widehat{\bs{\beta}}) = \m{R}_{\m{F}^\dagger}^{-1} \m{R}_{\m{F}^\dagger}^{-\top}$ Following a popular linear regression trick, one can further use the QR decomposition of the matrix $\m{F}^\dagger_+$ obtained by adding a new column $\m{y}^\dagger$ to $\m{F}^\dagger$ $$ \m{F}^\dagger_+ := \left[ \m{F}^\dagger \, \vert \, \m{y}^\dagger \right] = \m{Q}_{\m{F}^\dagger_+}\m{R}_{\m{F}^\dagger_+}. $$ Then the $p+1$ column of $\m{Q}_{\m{F}^\dagger_+}$ contains the vector of residuals $\widehat{\bs{\zeta}}^\dagger = \m{y}^\dagger - \m{F}^\dagger \widehat{\bs{\beta}}$ in its first $p$ elements and the residual sum of squares is given by the square of the element $R_{\m{F}^\dagger_+}[p + 1, p +1]$. See {cite:t}`Lange_Numerical`. ### `"NuggetKriging"` and `"NoiseKriging"` When a nugget or noise term is used, the estimate of $\bs{\beta}$ can be obtained as above provided that the covariance matrix is that of the non-trend component hence includes the nugget or noise variance in its diagonal. In the `NuggetKriging` case the GLS will provide an estimate of the variance $\nu^2 = \sigma^2 + \tau^2$ but the ML estimate of $\sigma^2$ can only be obtained by using a numerical optimization providing the ML estimate of $\alpha$ from which the estimate of $\sigma^2$ is found. (SecBending)= ## The Bending Energy Matrix Since $\widehat{\bs{\beta}}$ is a linear function of $\m{y}$ we have $$ [\m{y} - \m{F}\widehat{\bs{\beta}}]^\top \m{C}^{-1} [\m{y} - \m{F}\widehat{\bs{\beta}}] = \m{y}^\top \m{B} \m{y} $$ where the $n \times n$ matrix $\m{B}$ called the *Bending Energy Matrix* (BEM) is given by $$ \m{B} = \m{C}^{-1} - \m{C}^{-1}\m{F} \left[\m{F}^\top \m{C}^{-1} \m{F} \right]^{-1} \m{F}^\top\m{C}^{-1}. $$ The $n \times n$ matrix $\m{B}$ is such that $\m{B}\m{F} = \m{0}$ which means that the columns of $\m{F}$ are eigenvectors of $\m{B}$ with eigenvalue $0$. If $\m{C}$ is positive definite and $\m{F}$ has full column rank as assumed, then $\m{B}$ has rank $n- p$. In the special case where no trend is used i.e., $p=0$ the bending energy matrix can consistently be defined as $\m{B} := \m{C}^{-1}$, the trend matrix $\m{F}$ then being a matrix with zero columns and the vector $\bs{\beta}$ being of length zero. The BEM matrix is closely related to smoothing since the trend and GP component of $\m{y}$ are given by $$ \m{y} = \underset{\text{trend}} {\underbrace{\widehat{\bs{\mu}}}} + \underset{\text{GP}} {\underbrace{\widehat{\bs{\eta}}}} = [\m{I}_n - \m{C}\m{B}] \, \m{y} + \m{C}\m{B} \, \m{y}. $$ The matrix $\m{I}_n - \m{C}\m{B}$ is the matrix of the orthogonal projection on the linear space spanned by the columns of $\m{F}$ in $\mathbb{R}^n$ equipped with the inner product $\langle\m{z},\,\m{z}'\rangle_{\m{C}^{-1}} := \m{z}^\top \m{C}^{-1}\m{z}'$. **Note** The BEM does not depend on the specific basis used to define the linear space of trend functions. It also depends on the kernel only through the *reduced kernel* related to the trend linear space, see {cite:t}`Pronzato_Sens`. So the eigen-decomposition of the BEM provides useful insights into the model used such as the so-called *Principal Kriging Functions* The BEM $\m{B}$ can be related to the matrices $\m{C}$ and $\m{F}$ by a block inversion $$ \begin{bmatrix} \m{C} & \m{F}\\ \m{F}^\top & \m{0} \end{bmatrix}^{-1} = \begin{bmatrix} \m{B} & \m{U}\\ \m{U}^\top & \m{V} \end{bmatrix} \qquad \text{with } \left\{ \begin{aligned} \m{V} &:= - [\m{F}^\top\m{C}^{-1}\m{F}]^{-1}\\ \m{U} &:= - \m{C}^{-1}\m{F}\m{V} \end{aligned} \right. $$ where the inverse exists provided that $\m{F}$ has full column rank, the kernel being assumed to be definite positive. The relation can be derived by using the so-called *kernel shift* functions $\m{x} \mapsto C(\m{x}, \, \m{x}_i)$ to represent the GP component of $y(\m{x})$ in the Kriging mean function $$ h(\m{x}) = \underset{\text{GP}} {\underbrace{\sum_{i=1}^n \alpha_i \, C(\m{x}_i, \, \m{x})}} + \underset{\text{trend}} {\underbrace{\sum_{k=1}^p \beta_k f_k(\m{x})}}. $$ In the case where the model has no nugget or noise, using the $n$ observations $y_i$ we can find the $n + p$ unknown coefficients $\alpha_i$ and $\beta_k$ by imposing the orthogonality constraints $\m{F}^\top\bs{\alpha} = \m{0}_p$, leading to the linear system $$ \begin{bmatrix} \m{C} & \m{F}\\ \m{F}^\top & \m{0} \end{bmatrix} \begin{bmatrix} \bs{\alpha}\\ \bs{\beta} \end{bmatrix} = \begin{bmatrix} \m{y}\\ \m{0} \end{bmatrix}, $$ see {cite:t}`MardiaEtAl_KrigingSplines`. It turns out that the trend part of the solution is then identical to the GLS estimate $\widehat{\bs{\beta}}$. If $n^\star$ "new" inputs $\m{x}^\star_j$ are given in a matrix $\m{X}^\star$, then with $\m{C}^\star := \m{C}(\m{X}^\star, \, \m{X})$ and $\m{F}^\star :=\m{F}(\m{X}^\star)$ the prediction writes in blocks form as $$ \widehat{\m{y}}^\star = \begin{bmatrix} \m{C}^\star & \m{F}^\star \end{bmatrix} \begin{bmatrix} \widehat{\bs{\alpha}} \\ \widehat{\bs{\beta}} \end{bmatrix} = \begin{bmatrix} \m{C}^\star & \m{F}^\star \end{bmatrix} \begin{bmatrix} \m{B} & \m{U}\\ \m{U}^\top & \m{V} \end{bmatrix} \begin{bmatrix} \m{y} \\ \m{0} \end{bmatrix}. $$