Maximum likelihood
General form of the likelihood
The general form of the likelihood is
where \(\boldsymbol{\psi}\) is the vector of covariance parameters which depend on the specific Kriging model used, see the section Parameters. The notation \(\mathbf{C}\) is for the determinant of the matrix \(\mathbf{C}\).
Profile likelihood
In the ML framework it turns out that at least the ML estimate \(\widehat{\boldsymbol{\beta}}\) of the trend coefficient vector can be computed by GLS as exposed in Section Generalized Least Squares. Moreover the GLS step can provide an estimate of the variance for the nontrend part component i.e., the difference between the response and the trend part. See Roustant et al. [RGD12].
This allows the maximization of a profile likelihood function \(L_{\texttt{prof}}\) depending on a smaller number of parameters. In practice the loglikelihood \(\ell := \log L\) and the logprofile likelihood \(\ell_{\texttt{prof}} := \log L_{\texttt{prof}}\) are used. The profile loglikelihood functions are detailed and summarized in the Table below.
Remind that if we replace \(\boldsymbol{\beta}\) by its estimate \(\widehat{\boldsymbol{\beta}}\) in the sum of squares used in the loglikelihood, we get a quadratic form of \(\mathbf{y}\)
where \(\mathbf{B}\) is the Bending Energy Matrix (BEM).
"Kriging"
In the "Kriging"
case where \(\mathbf{C} = \sigma^2 \,
\mathbf{R}(\boldsymbol{\theta})\), both the ML estimates
\(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\boldsymbol{\sigma}}^2\)
are provided by GLS. So these parameters are “concentrated out of the
likelihood” and we can use the profile likelihood function depending
on \(\boldsymbol{\theta}\) only \(L_{\texttt{prof}}(\boldsymbol{\theta})
:= L(\boldsymbol{\theta}, \, \widehat{\sigma}^2,\,
\widehat{\boldsymbol{\beta}})\) where both \(\widehat{\sigma}^2\) and
\(\widehat{\boldsymbol{\beta}}\) depend on \(\boldsymbol{\theta}\).
"NuggetKriging"
In the "NuggetKriging"
case, beside the vector \(\boldsymbol{\theta}\) of
correlation ranges and instead of the couple of parameters
\([\sigma^2, \, \tau^2]\) or \([\sigma^2, \, \alpha]\) we can use the couple
\([\nu^2,\, \alpha]\) defined by
and which can be named the total variance and the variance ratio. The covariance matrix used in the likelihood is then
where \(\mathbf{R}_\alpha\) is a correlation matrix. As for the Kriging case the ML estimate \(\widehat{\nu}^2\) can be obtained by GLS as \(\widehat{\nu}^2 = S^2/n\). Therefore we can use a profile likelihood function depending on the correlation ranges \(\boldsymbol{\theta}\) and the variance ratio \(\alpha\), namely \(L_{\texttt{prof}}(\boldsymbol{\theta},\,\alpha) := L(\boldsymbol{\theta}, \, \widehat{\nu}^2,\, \widehat{\boldsymbol{\beta}})\).
"NoiseKriging"
The covariance matrix to be used in the likelihood is
where the noise variances \(\tau_i^2\) are known. In this case the parameter \(\sigma^2\) can no longer be concentrated out and the profile likelihood to be maximized is a function of \(\boldsymbol{\theta}\) and \(\sigma^2\) with only the trend parameter being concentrated out \(L_{\texttt{prof}}(\boldsymbol{\theta},\,\sigma^2) := L(\boldsymbol{\theta}, \, \widehat{\boldsymbol{\beta}})\).
Table
The following table gives the profile loglikelihood for the different forms of Kriging models. The sum of squares \(S^2\) is given by \(S^2 = \mathbf{e}^\top \mathring{\mathbf{C}}^{1} \mathbf{e}\) where \(\mathbf{e}:= \mathbf{y}  \mathbf{F}\widehat{\boldsymbol{\beta}}\) is the estimated nontrend component and \(\mathring{\mathbf{C}}\) is the correlation matrix (equal to \(\mathbf{R}\) or \(\mathbf{R}_\alpha\)).

\(2 \ell_{\texttt{prof}}(\boldsymbol{\theta}) = \log \lvert\mathbf{R}\rvert + n \log S^2\) 

\(2 \ell_{\texttt{prof}}(\boldsymbol{\theta}, \, \alpha) = \log \lvert\mathbf{R}_\alpha\rvert + n \log S^2\) 

\(2 \ell_{\texttt{prof}}(\boldsymbol{\theta}, \, \sigma^2) = \log \lvert\mathbf{C}\rvert + \mathbf{e}^\top \mathbf{C}^{1}\mathbf{e}\) 
Note that \(\widehat{\boldsymbol{\beta}}\) and \(\mathbf{e}\) depend on the covariance parameters as do the correlation or covariance matrix. The profile loglikelihood are given up to additive constants. The sum of squares \(S^2\) can be expressed as \(S^2 = \mathbf{y}^\top \mathring{\mathbf{B}} \mathbf{y}\) where \(\mathring{\mathbf{B}} := \sigma^2 \mathbf{B}\) is a scaled version ot the Bending Energy matrix \(\mathbf{B}\).
Derivatives w.r.t. the parameters
In the three cases, the symbolic derivatives of the logprofile likelihood w.r.t. the parameters can be obtained by chain rule hence be used in the optimization routine.