📖 Christabella Irwanto's Blog

A GP defines a prior over functions. After observing some function values, it can be converted into a posterior over functions.

A Gaussian process is a stochastic process — a collection of random variables indexed by time or space — such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A GP is a stochastic process where any point $\mathbf{x} \in \mathbb{R}^d$ is assigned to a random variable $f($\mathbf{x}$)$ and where the joint distribution of a finite number of these variables $p(f($\mathbf{x_1}$), \ldots, f($\mathbf{x_N}$))$ is Gaussian:

\begin{equation} p(\mathbf{f} \lvert \mathbf{X}) = \mathcal{N}(\mathbf{f} \lvert \boldsymbol\mu, \mathbf{K}) \label{eq1} \end{equation}

In Equation $(1)$, $\mathbf{f} = (f(\mathbf{x}_1),…,f(\mathbf{x}_N))$, $\boldsymbol\mu = (m(\mathbf{x}_1),…,m(\mathbf{x}_N))$ and $K_{ij} = \kappa(\mathbf{x}_i,\mathbf{x}_j)$. $m$ is the mean function and it is common to use $m(\mathbf{x}) = 0$ as GPs are flexible enough to model the mean arbitrarily well. $\kappa$ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, …) is defined by $\mathbf{K}$. If points $\mathbf{x}_i$ and $\mathbf{x}_j$ are considered to be similar by the kernel the function values at these points, $f(\mathbf{x}_i)$ and $f(\mathbf{x}_j)$, can be expected to be similar too.

A GP prior $p(\mathbf{f} \lvert \mathbf{X})$ can be converted into a GP posterior $p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})$ after having observed some data $\mathbf{y}$. The posterior can then be used to make predictions $\mathbf{f}_*$ given new input $\mathbf{X}_*$:

\begin{align*} p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{X},\mathbf{y}) &= \int{p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{f})p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})}\ d\mathbf{f} \\
&= \mathcal{N}(\mathbf{f}_* \lvert \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)\tag{2}\label{eq2} \end{align*}

Equation $(2)$ is the posterior predictive distribution which is also a Gaussian with mean $\boldsymbol{\mu}_*$ and $\boldsymbol{\Sigma}_*$. By definition of the GP, the joint distribution of observed data $\mathbf{y}$ and predictions $\mathbf{f}_*$ is

\begin{pmatrix}\mathbf{y} \ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N} \left(\boldsymbol{0}, \begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix}

\right)\tag{3}\label{eq3} $$

With $N$ training data and $N_*$ new input data, $\mathbf{K}_y = \kappa(\mathbf{X},\mathbf{X}) + \sigma_y^2\mathbf{I} = \mathbf{K} + \sigma_y^2\mathbf{I}$ is $N \times N$, $\mathbf{K}_* = \kappa(\mathbf{X},\mathbf{X}_*)$ is $N \times N_*$ and $\mathbf{K}_{**} = \kappa(\mathbf{X}_*,\mathbf{X}_*)$ is $N_* \times N_*$. $\sigma_y^2$ is the noise term in the diagonal of $\mathbf{K_y}$. It is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to $\boldsymbol{0}$ for notational simplicity. The sufficient statistics of the posterior predictive distribution, $\boldsymbol{\mu}_*$ and $\boldsymbol{\Sigma}_*$, can be computed with

\begin{align*} \boldsymbol{\mu_*} &= \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{y}\tag{4}\label{eq4} \\
\boldsymbol{\Sigma_*} &= \mathbf{K}_{**} - \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{K}_*\tag{5}\label{eq5} \end{align*}

The optimal hyperparameters of the GP (e.g. length scales of squared exponential kernel) are obtained by maximising the \textit{log marginal likelihood} given by

\[ \log p(\mathbf{y} \lvert \mathbf{X}) = \log \mathcal{N}(\mathbf{y} \lvert \boldsymbol{0},\mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y} -\frac{1}{2} \log \begin{vmatrix}\mathbf{K}_y\end{vmatrix} -\frac{N}{2} \log(2\pi) \tag{7} \]

Nice resources 🔗

🍑 AB - Introduction to Gaussian Processes - Part I
- Very good for intuition
🍑 Gaussian processes - Martin Krasser’s Blog: Math and code
- Source: Colab notebook <3 “the minimum we need to know for implementing Gaussian processes” theory + implementation
https://www.biorxiv.org/content/biorxiv/early/2017/10/10/095190.full.pdf: A lot more rigorous
Aidan Scannell’s post and equivalent notebook <3: theory + implementation in more detail
https://jejjohnson.netlify.com/post/2019%5Funcertain%5Fgps/ nice graphical model

Advantages 🔗

Disadvantages 🔗

GP’s are a §nonparametric_method.

Key advantages summarize from https://mlss2011.comp.nus.edu.sg/uploads/Site/lect1gp.pdf: (1) With GP’s you save yourself the grid search for the kernel parameters. GP’s offer a likelihood that can be maximized to determine kernel parameters

(2) Same story as above for regularization if you need it.

(3) You have a framework for feature selection in terms of automatic relevance determination.