A GP defines a prior over functions. After observing some function values, it can be converted into a posterior over functions.

A Gaussian process is a stochastic process — a collection of random variables indexed by time or space — such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A GP is a stochastic process where any point $$\mathbf{x} \in \mathbb{R}^d$$ is assigned to a random variable $f($$\mathbf{x}$$)$ and where the joint distribution of a finite number of these variables $p(f($$\mathbf{x_1}$$), \ldots, f($$\mathbf{x_N}$$))$ is Gaussian:

$$p(\mathbf{f} \lvert \mathbf{X}) = \mathcal{N}(\mathbf{f} \lvert \boldsymbol\mu, \mathbf{K}) \label{eq1}$$

In Equation $$(1)$$, $$\mathbf{f} = (f(\mathbf{x}_1),…,f(\mathbf{x}_N))$$, $$\boldsymbol\mu = (m(\mathbf{x}_1),…,m(\mathbf{x}_N))$$ and $$K_{ij} = \kappa(\mathbf{x}_i,\mathbf{x}_j)$$. $$m$$ is the mean function and it is common to use $$m(\mathbf{x}) = 0$$ as GPs are flexible enough to model the mean arbitrarily well. $$\kappa$$ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, …) is defined by $$\mathbf{K}$$. If points $$\mathbf{x}_i$$ and $$\mathbf{x}_j$$ are considered to be similar by the kernel the function values at these points, $$f(\mathbf{x}_i)$$ and $$f(\mathbf{x}_j)$$, can be expected to be similar too.

A GP prior $$p(\mathbf{f} \lvert \mathbf{X})$$ can be converted into a GP posterior $$p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})$$ after having observed some data $$\mathbf{y}$$. The posterior can then be used to make predictions $$\mathbf{f}_*$$ given new input $$\mathbf{X}_*$$:

\begin{align*} p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{X},\mathbf{y}) &= \int{p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{f})p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})}\ d\mathbf{f} \\
&= \mathcal{N}(\mathbf{f}_* \lvert \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)\tag{2}\label{eq2} \end{align*}

Equation $$(2)$$ is the posterior predictive distribution which is also a Gaussian with mean $$\boldsymbol{\mu}_*$$ and $$\boldsymbol{\Sigma}_*$$. By definition of the GP, the joint distribution of observed data $$\mathbf{y}$$ and predictions $$\mathbf{f}_*$$ is

\begin{pmatrix}\mathbf{y} \ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N} \left(\boldsymbol{0}, \begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix}

\right)\tag{3}\label{eq3} $$With $$N$$ training data and $$N_*$$ new input data, $$\mathbf{K}_y = \kappa(\mathbf{X},\mathbf{X}) + \sigma_y^2\mathbf{I} = \mathbf{K} + \sigma_y^2\mathbf{I}$$ is $$N \times N$$, $$\mathbf{K}_* = \kappa(\mathbf{X},\mathbf{X}_*)$$ is $$N \times N_*$$ and $$\mathbf{K}_{**} = \kappa(\mathbf{X}_*,\mathbf{X}_*)$$ is $$N_* \times N_*$$. $$\sigma_y^2$$ is the noise term in the diagonal of $$\mathbf{K_y}$$. It is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to $$\boldsymbol{0}$$ for notational simplicity. The sufficient statistics of the posterior predictive distribution, $$\boldsymbol{\mu}_*$$ and $$\boldsymbol{\Sigma}_*$$, can be computed with$$

\begin{align*} \boldsymbol{\mu_*} &= \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{y}\tag{4}\label{eq4} \\
\boldsymbol{\Sigma_*} &= \mathbf{K}_{**} - \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{K}_*\tag{5}\label{eq5} \end{align*}



The optimal hyperparameters of the GP (e.g. length scales of squared exponential kernel) are obtained by maximising the \textit{log marginal likelihood} given by

$\log p(\mathbf{y} \lvert \mathbf{X}) = \log \mathcal{N}(\mathbf{y} \lvert \boldsymbol{0},\mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y} -\frac{1}{2} \log \begin{vmatrix}\mathbf{K}_y\end{vmatrix} -\frac{N}{2} \log(2\pi) \tag{7}$