A GP defines a prior over functions. After observing some function values, it can be converted into a posterior over functions.

A Gaussian process is a stochastic process β€” a collection of random variables indexed by time or space β€” such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A GP is a stochastic process where any point \(\mathbf{x} \in \mathbb{R}^d\) is assigned to a random variable $f(\(\mathbf{x}\))$ and where the joint distribution of a finite number of these variables $p(f(\(\mathbf{x_1}\)), \ldots, f(\(\mathbf{x_N}\)))$ is Gaussian:

\begin{equation} p(\mathbf{f} \lvert \mathbf{X}) = \mathcal{N}(\mathbf{f} \lvert \boldsymbol\mu, \mathbf{K}) \label{eq1} \end{equation}

In Equation \((1)\), \(\mathbf{f} = (f(\mathbf{x}_1),…,f(\mathbf{x}_N))\), \(\boldsymbol\mu = (m(\mathbf{x}_1),…,m(\mathbf{x}_N))\) and \(K_{ij} = \kappa(\mathbf{x}_i,\mathbf{x}_j)\). \(m\) is the mean function and it is common to use \(m(\mathbf{x}) = 0\) as GPs are flexible enough to model the mean arbitrarily well. \(\kappa\) is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, …) is defined by \(\mathbf{K}\). If points \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are considered to be similar by the kernel the function values at these points, \(f(\mathbf{x}_i)\) and \(f(\mathbf{x}_j)\), can be expected to be similar too.

A GP prior \(p(\mathbf{f} \lvert \mathbf{X})\) can be converted into a GP posterior \(p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})\) after having observed some data \(\mathbf{y}\). The posterior can then be used to make predictions \(\mathbf{f}_*\) given new input \(\mathbf{X}_*\):

\begin{align*} p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{X},\mathbf{y}) &= \int{p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{f})p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})}\ d\mathbf{f} \\
&= \mathcal{N}(\mathbf{f}_* \lvert \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)\tag{2}\label{eq2} \end{align*}

Equation \((2)\) is the posterior predictive distribution which is also a Gaussian with mean \(\boldsymbol{\mu}_*\) and \(\boldsymbol{\Sigma}_*\). By definition of the GP, the joint distribution of observed data \(\mathbf{y}\) and predictions \(\mathbf{f}_*\) is

\begin{pmatrix}\mathbf{y} \ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N} \left(\boldsymbol{0}, \begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix}

\right)\tag{3}\label{eq3} $$

With \(N\) training data and \(N_*\) new input data, \(\mathbf{K}_y = \kappa(\mathbf{X},\mathbf{X}) + \sigma_y^2\mathbf{I} = \mathbf{K} + \sigma_y^2\mathbf{I}\) is \(N \times N\), \(\mathbf{K}_* = \kappa(\mathbf{X},\mathbf{X}_*)\) is \(N \times N_*\) and \(\mathbf{K}_{**} = \kappa(\mathbf{X}_*,\mathbf{X}_*)\) is \(N_* \times N_*\). \(\sigma_y^2\) is the noise term in the diagonal of \(\mathbf{K_y}\). It is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to \(\boldsymbol{0}\) for notational simplicity. The sufficient statistics of the posterior predictive distribution, \(\boldsymbol{\mu}_*\) and \(\boldsymbol{\Sigma}_*\), can be computed with

$$

\begin{align*} \boldsymbol{\mu_*} &= \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{y}\tag{4}\label{eq4} \\
\boldsymbol{\Sigma_*} &= \mathbf{K}_{**} - \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{K}_*\tag{5}\label{eq5} \end{align*}

$$

The optimal hyperparameters of the GP (e.g. length scales of squared exponential kernel) are obtained by maximising the \textit{log marginal likelihood} given by

\[ \log p(\mathbf{y} \lvert \mathbf{X}) = \log \mathcal{N}(\mathbf{y} \lvert \boldsymbol{0},\mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y} -\frac{1}{2} \log \begin{vmatrix}\mathbf{K}_y\end{vmatrix} -\frac{N}{2} \log(2\pi) \tag{7} \]

Nice resources πŸ”—

Advantages πŸ”—

Disadvantages πŸ”—

GP’s are a Β§nonparametric_method.

<3

Key advantages summarize from https://mlss2011.comp.nus.edu.sg/uploads/Site/lect1gp.pdf: (1) With GP’s you save yourself the grid search for the kernel parameters. GP’s offer a likelihood that can be maximized to determine kernel parameters

(2) Same story as above for regularization if you need it.

(3) You have a framework for feature selection in terms of automatic relevance determination.