BOLFI 🔗

“Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models” (2016) by Michael U. Gutmann and Jukka Corander.

Available as 🔗 presentation slides or as a 🔗 blogpost.

Problem 🔗

Topic: Statistical inference for models where:
- The likelihood \(p_{\mathbf{y}|\mathbf{\theta}}\) is intractable 🔮 (e.g. analytical form is unknown/costly to compute).
  - \(\implies\) inference with likelihood \(\mathcal{L}(\mathbf{\theta}) = p_{\mathbf{y}|\mathbf{\theta}}(\mathbf{y}_o | \mathbf{\theta})\) is not possible.
- Simulating data 🤖 from the model is possible.
  - Using a simulator-based statistical model (implicit/generative model), i.e. a family of pdfs \(\left\{p_{\mathbf{y} \mid \mathbf{\theta}} \right\}_{\mathbf{\theta}}\) of unknown analytical form which allows for exact sampling of data \(\mathbf{y} \sim p(\mathbf{y}|\mathbf{\theta})\)
Inference principle: Find parameter values \(\theta\) for which there is a small distance between
- the posterior of the simulated data \(\mathbf{y}\), and
- the observed data \(\mathbf{y}_o\).

Other assumptions 🔗

Only a small number of parameters are of interest (theta is low-dimensional)
Data generating process can be very complex

Existing methods 🔗

For likelihood-free inference with simulator-based models, the basic idea is to identify model parameters by finding values which yield simulated data that resemble the observed data.

“Approximate Bayesian computation” (ABC), originated in genetics 🔬
“Indirect inference”, originated in economics 📈
“Synthetic likelihood”, originated in ecology 🌲

Conventional ABC 🔗

“Bayesian forward modeling”, i.e. likelihood-free rejection sampling (LFRS)
Let \(\mathbf{y}_o\) be the observed data. For many iterations:
1. Sample \(\theta\) from proposal distribution \(q(\theta)\).
2. Sample \(\mathbf{y}|\theta\) according to the data model.
3. Compute distance \(d(\mathbf{y}, \mathbf{y}_o)\)
4. Keep if \(d(\mathbf{y}, \mathbf{y}_o) \leq \epsilon\); discard otherwise.
Different \(q(\theta)\) for different algorithms
If \(\epsilon\) is small enough, kept samples are samples from an approximate posterior

Implicit likelihood approximation 🔗

There is an implicit likelihood approximation going on here.

Compute likelihood (probability of generating data like \(\mathbf{y}_o\) given hypothesis \(\theta\)) empirically
- Proportion of kept (green) samples
- \(L(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\left(d\left(y_{o}^{(i)}, y^{\circ}\right) \leq \epsilon\right)\)

Illustration of rejection sampling (Lintusaari et al., 2016)

Downsides 🔗

Inefficiencies of rejection sampling (Florent Leclercq, 2018)

\(L(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\left(d\left(y_{o}^{(i)}, y^{\circ}\right) \leq \epsilon\right)\)

Rejects most samples when \(\epsilon\) is small
Does not make assumptions about the shape of \(L(\theta)\)
1. Dependency of L on \(\theta\) is via the parameter used to generate theta, if we moved the threshold by just a little bit we don’t constrain the likelihood to be similar at all; no constraint on likelihood’s smoothness at all. Algorithm is given a lot of freedom but it becomes expensive. A lot of dependency on theta, as opposed to depending on say smoothness constraint.
Does not use all information available.
1. You could, say, stop early after many rejections and conclude that the likelihood is low with high certainty.
Aims at equal accuracy for all parameters
1. Computational effort \(N\) doesn’t depend on \(\theta\)
2. E.g. prioritise for modal area

BOLFI 🔗

Advantages of BOLFI (Florent Leclercq, 2018)

Does not reject samples; learns from them (i.e. builds a statistical model of the distances w.r.t. the parameters).
Models the distances, assuming the average distance is smooth.
Use Bayes’ theorem to update the proposal of new points.
Prioritize parameter regions with small distances.

Regressing effective likelihood 🔗

Data are tuples \((\theta_i, d_i)\), where \(d_i = d(\mathbf{y}_\theta^{(i)}, \mathbf{y}_o)\)
Model conditional distribution \(d \mid \theta\)
Approximate likelihood function for some choice of \(\epsilon\):
- Use the probability under that extimated model that the distance is smaller than \(\epsilon\).
- \(\hat{L}(\theta) \propto \hat{P}(d\leq \epsilon \mid \theta)\)
- Choice of \(\epsilon\) is delayed until the end, after the learning of the model.

Regressing effective likelihood 🔗

Fit a log Gaussian process (GP) to regress how the parameters affect the distances and use Bayesian optimization.
- Squared exponential covariance function
- Log transform because distances are non-negative
Approach is not restricted to GPs

Data acquisition 🔗

Use Bayesian optimization to prioritize regions of \(\theta\) where \(d\) tends to be small.
Sample \(\theta\) from an adaptively constructed proposal distribution, e.g. the lower confidence bound acquisition function.
- \(\mathcal{A}_{t}(\theta)=\underbrace{\mu_{t}(\theta)}_{\text {post mean }}-\sqrt{\underbrace{\eta_{t}^{2}}_{\text {weight }} \underbrace{v_t(\theta)}_{\text {post var }}}\), \(t\): no. of samples acquired
- “We used the simple heuristic that \(\theta_{t+1}\) is sampled from a Gaussian with diagonal covariance and mean equal to the minimizer of the acquisition function. The standard deviations were determined by finding the end-points of the interval where the acquisition function was within a certain (relative) tolerance.”
Approach is not restricted to this acquisition function.

Bayesian optimization in action 🔗

An animation of Bayesian optimization (Florent Leclercq, 2018)

Close the loop 🔗

High-level mechanism of Bayesian optimization (Florent Leclercq, 2018)

Exploration: where the uncertainty is high
Exploitation: where posterior mean is smallest
Use Bayes’ theorem to update the model in light of new data
As opposed to usual applications of Bayesian optimization, here the objective function is highly stochastic.

Results 🔗

Roughly equal results with 1000x fewer simulations
- Monte Carlo ABC: 4.5 days with 300 cores
- BOLFI: 90 minutes with seven cores
Data of bacterial infections in child care centers.
- Data generating process is defined via a latent continuous-time Markov chain and an observation model.
- “Developed by Numminen et a.l (2013) to infer transmission dynamics of bacterial infections in day care centers.”

Figure 12 🔗

Computational cost on log scale. 6 means a million; 3 means a thousand
Blue line is reference for the inferred posterior mean at the very end of the standard PMC-ABC posterior method.
Red line is the best, model-based posterior
Confidence intervals are wider than conventional approach

Advantages 🔗

Inference more efficient, far more comprehensive data analysis.
Enables inference for e.g. models of evolution where simulating a single data set takes 12-24 hours.
Enables investigation of model identifiability, or the influence of distance measure on likelihood function.
- We get an explicit construction of an approximate likelihood function, whereas with conventional ABC only had an implicit likelihood function.

Further research 🔗

Distance measures
Acquisition function
Efficient high-dimensional inference
- Use this advantage from Bayesian optimization

Conclusion 🔗

BOLFI combines …

Statistical modeling (GP regression) of the distance between observed and simulated data
Decision making under uncertainty (Bayesian optimization).

…to increase efficiency of inference by several orders of magnitude.

Links and references 🔗

Bibliography

Lintusaari, J., Gutmann, M. U., Dutta, R., Kaski, S., & Corander, J. (2016), Fundamentals and Recent Developments in Approximate Bayesian Computation, Systematic Biology. ↩

Leclercq, F. (2018). Bayesian optimisation for likelihood-free cosmological inference. Retrieved from http://www.florent-leclercq.eu/talks/2018-10-22_IHP.pdf. (Accessed on 11/03/2019. Recording at https://www.youtube.com/watch?v=orDbPZFd7Gk&t=41s.). ↩