Maximum likelihood
General form of the likelihood
The general form of the likelihood is
where \(\bs{\psi}\) is the vector of covariance parameters which depend on the specific Kriging model used, see the section Parameters. The notation \(|\m{C}|\) is for the determinant of the matrix \(\m{C}\).
Profile likelihood
In the ML framework it turns out that at least the ML estimate \(\widehat{\bs{\beta}}\) of the trend coefficient vector can be computed by GLS as exposed in Section Generalized Least Squares. Moreover the GLS step can provide an estimate of the variance for the non-trend part component i.e., the difference between the response and the trend part. See Roustant et al. [RGD12].
This allows the maximization of a profile likelihood function \(L_{\texttt{prof}}\) depending on a smaller number of parameters. In practice the log-likelihood \(\ell := \log L\) and the log-profile likelihood \(\ell_{\texttt{prof}} := \log L_{\texttt{prof}}\) are used. The profile log-likelihood functions are detailed and summarized in the Table below.
Remind that if we replace \(\bs{\beta}\) by its estimate \(\widehat{\bs{\beta}}\) in the sum of squares used in the log-likelihood, we get a quadratic form of \(\m{y}\)
where \(\m{B}\) is the Bending Energy Matrix (BEM).
Kriging(noise = NULL)
In the "Kriging" case where \(\m{C} = \sigma^2 \,
\m{R}(\bs{\theta})\), both the ML estimates
\(\widehat{\bs{\beta}}\) and \(\widehat{\bs{\sigma}}^2\)
are provided by GLS. So these parameters are “concentrated out of the
likelihood” and we can use the profile likelihood function depending
on \(\bs{\theta}\) only \(L_{\texttt{prof}}(\bs{\theta})
:= L(\bs{\theta}, \, \widehat{\sigma}^2,\,
\widehat{\bs{\beta}})\) where both \(\widehat{\sigma}^2\) and
\(\widehat{\bs{\beta}}\) depend on \(\bs{\theta}\).
Kriging(noise = "nugget")
In the Kriging(noise = "nugget") case, beside the vector \(\bs{\theta}\) of
correlation ranges and instead of the couple of parameters
\([\sigma^2, \, \tau^2]\) or \([\sigma^2, \, \alpha]\) we can use the couple
\([\nu^2,\, \alpha]\) defined by
and which can be named the total variance and the variance ratio. The covariance matrix used in the likelihood is then
where \(\m{R}_\alpha\) is a correlation matrix. As for the Kriging case the ML estimate \(\widehat{\nu}^2\) can be obtained by GLS as \(\widehat{\nu}^2 = S^2/n\). Therefore we can use a profile likelihood function depending on the correlation ranges \(\bs{\theta}\) and the variance ratio \(\alpha\), namely \(L_{\texttt{prof}}(\bs{\theta},\,\alpha) := L(\bs{\theta}, \, \widehat{\nu}^2,\, \widehat{\bs{\beta}})\).
Kriging(noise = <variance vector>)
The covariance matrix to be used in the likelihood is
where the noise variances \(\tau_i^2\) are known. In this case the parameter \(\sigma^2\) can no longer be concentrated out and the profile likelihood to be maximized is a function of \(\bs{\theta}\) and \(\sigma^2\) with only the trend parameter being concentrated out \(L_{\texttt{prof}}(\bs{\theta},\,\sigma^2) := L(\bs{\theta}, \, \sigma^2, \, \widehat{\bs{\beta}})= L(\bs{\psi}, \, \widehat{\bs{\beta}})\).
Table
The following table gives the profile log-likelihood for the different forms of Kriging models. The sum of squares \(S^2\) is given by \(S^2 = \m{e}^\top \mathring{\m{C}}^{-1} \m{e}\) where \(\m{e}:= \m{y} - \m{F}\widehat{\bs{\beta}}\) is the estimated non-trend component and \(\mathring{\m{C}}\) is the correlation matrix (equal to \(\m{R}\) or \(\m{R}_\alpha\)).
|
\(-2 \ell_{\texttt{prof}}(\bs{\theta}) = \log \lvert\m{R}\rvert + n \log S^2\) |
|
\(-2 \ell_{\texttt{prof}}(\bs{\theta}, \, \alpha) = \log \lvert\m{R}_\alpha\rvert + n \log S^2\) |
|
\(-2 \ell_{\texttt{prof}}(\bs{\theta}, \, \sigma^2) = \log \lvert\m{C}\rvert + \m{e}^\top \m{C}^{-1}\m{e}\) |
Note that \(\widehat{\bs{\beta}}\) and \(\m{e}\) depend on the covariance parameters as do the correlation or covariance matrix. The profile log-likelihoods are given up to additive constants. The sum of squares \(S^2\) can be expressed as \(S^2 = \m{y}^\top \mathring{\m{B}} \m{y}\) where \(\mathring{\m{B}} := \sigma^2 \m{B}\) is a scaled version ot the Bending Energy matrix \(\m{B}\).
Derivatives w.r.t. the parameters
In the three cases, the symbolic derivatives of the log-profile likelihood w.r.t. the parameters can be obtained by chain rule hence be used in the optimization routine.