“I can’t believe it’s not Bayesian” - Chelsea Finn, ICML 2019 meta-learning workshop

*This post is also available as Reveal.JS slides 📺 (press the right arrow key \(\rightarrow\) to advance).*

## Agenda 🔗

- Meta-learning: why, what, and how?
- Using Bayesian principles in meta-learning.
- A few examples, fast and furious edition:
- Model-agnostic meta-learning (MAML) as hierarchical Bayes
- Neural Process (NP)
- Deep meta-learning GPs

## Why meta-learning? 🔗

- It is more
**human/animal-like 👪🐕**: humans can learn from a rich ensemble of partially related tasks, extracting shared information from them and applying that on new tasks with few samples (Lake et al., 2016)- Learning-to-learn has been studied in cognitive science (Lake et al., 2015) and psychology (Hospedales et al., 2020), (Griffiths et al., 2019),

- It seeks to address
**data-hungry**💸 supervised deep learning.**Data efficiency**using prior knowledge transferred from related tasks.

- Successful applications in few-shot image recognition, data efficient reinforcement learning (RL), and neural architecture search (NAS).
- EfficientNet’s, current SoTA beating ResNet, are found via NAS.

## What is meta-learning? 🔗

- Actually, you already know it—it’s a broad framework that encompasses a commonplace machine learning (ML) practice:
**hyperparameter search**.

### But really, what is it? 🔗

- Difficult to define, as it has been used in different ways, but a good start is:

The salient characteristic of contemporary neural-network meta-learning is an explicitly defined meta-level objective, and end-to-end optimization of the inner algorithm with respect to this objective (Hospedales et al., 2020)

### Conventional vs meta-learning 🔗

- a): Single dataset \(\mathcal{D} = (x, y)_{i=1}^n, y_i = f_{\theta, \color{blue}{ \omega}}(x_i )\)
- Learn \(f_{\theta, \color{blue}{ \omega}}:\mathcal{X} \rightarrow \mathcal{Y}\) that is
*explicitly parameterized*by \(\theta\) (e.g. neural network weights) and*implicitly pre-specified*by \(\color{blue}{\text{a fixed } \omega}\) (e.g. learning rate, optimizer). - Optimize for \(\theta^* = \arg \min_\theta \mathcal{L}_\theta(\mathcal{D}; \theta, \color{blue}{\omega})\).

- Learn \(f_{\theta, \color{blue}{ \omega}}:\mathcal{X} \rightarrow \mathcal{Y}\) that is
- b): Dataset of datasets \(\{\mathcal{D}_t \}_{t=1}^{3}\) from task distribution \(p(\mathcal{T}), \mathcal{T} = {\mathcal{D}, \mathcal{L}}\).
- Objective: \(\color{blue}{\text{A learnable }\omega}\) is optimized over \(p(\mathcal{T})\), i.e. \(\displaystyle \color{blue}{\omega^*} = \min _{\color{blue}{\omega}} \underset{\tau \sim p(\mathcal{T})}{\mathbb{E}} \mathcal{L}(\mathcal{D} ; \color{blue}{\omega})\).
- Meta-knowledge \(\color{blue}{\omega}\) specifies
**“how to learn”**\(\theta\). - E.g. Shared meta-knowledge \(\color{blue}{\omega}\) can encode the family of sine functions (everything but phase and amplitude) while \(\theta\) encodes the phase and amplitude.

### Two phases of meta-learning methods 🔗

- Generally split into two phases (Hospedales et al., 2020):
- Meta-training on \(\mathscr{D}_{\text {source}}=\left\{\left(\mathcal{D}_{\text {source}}^{\text {train}}, \mathcal{D}_{\text {source}}^{\text {val}}\right)^{(i)}\right\}_{i=1}^{M}\) entails \(\omega^{*}=\arg \max _{\omega} \log p\left(\omega | \mathscr{D}_{\text {source }}\right)\)
- Meta-testing (online adaptation) on \(\mathscr{D}_{\text {target}}=\left\{\left(\mathcal{D}_{\text {target}}^{\text {train}}, \mathcal{D}_{\text {target}}^{\text {test}}\right)^{(i)}\right\}_{i=1}^{Q}\) entails \(\theta^{*(i)}=\arg \max _{\theta} \log p\left(\theta | \omega^{*}, \mathcal{D}_{\text {target}}^{\text {train}}(i)\right)\). Evaluation done on \(\mathcal{D}_{\text {target}}^{\text {test}}\).

## MAML: Model-agnostic meta-learning 🔗

- MAML (Finn et al., 2017) is a simple and popular approach of two phases:
- During meta-training (bold line), learn a good weight initialization \(\color{blue}{\omega^*}\) for fine-tuning on target tasks.
- During meta-testing (dashed lines), find the optimal \(\theta^*_i\) for each new task \(i\).

- Good performance
- MAML substantially outperformed other approaches on few-shot image classification (Omniglot, MiniImageNet2), and improved adaptability of RL agent.

#### A little side-note 🔗

*Feature reuse*, not*rapid learning*, is the dominant component in MAML.

#### A little side-note: “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML” 🔗

*Feature reuse*, not*rapid learning*, is the dominant component in MAML.- Used CCA and CKA to study the learnt representations.
- Led to a simplified algorithm, ANIL, which almost completely removes the inner optimization loop with no reduction in performance.
- Benchmark datasets (e.g. Omniglot, MiniImageNet2) are
**artifically segmented**from the same dataset, hence it might be very easy to reuse features. Interesting to consider less similar tasks (e.g. Meta-dataset, Triantafillou et al. 2019).

### Concrete examples of \(\color{blue}{\omega}\) and \(\theta\) 🔗

Conventional ML: \(\color{blue}{\text{fixed }\omega}\).

Shared meta-knowledge \(\color{blue}{\omega}\) | Task-specific \(\theta\) | |
---|---|---|

NN | Hyperparameters (e.g. learning rate, weight initialization scheme, optimizer, architecture design) | Network weights |

Meta-learning: \(\color{blue}{\text{learnt }\omega}\) from \(\mathcal{D}_{\mathrm{source}}\).

Shared meta-knowledge \(\color{blue}{\omega}\) | Task-specific \(\theta\) | |
---|---|---|

Hyperopt | Hyperparameters (e.g. learning rate) | Network weights |

MAML | Network weights (initialization learnt from \(\mathcal{D}_{\mathrm{source}}\)) | Network weights (tuned on \(\mathcal{D}_{\mathrm{target}}\)) |

NP | Network weights | Aggregated target context [latent] representation |

Meta-GP | Deep mean/kernel function parameters | None (a GP is fit on \(\mathcal{D}_{\mathrm{target}}^{\mathrm{train}}\)) |

## A \(\color{blue}{\omega}\) by any other name… 🌹 🔗

- \(\color{blue}{\omega}\) is shared/task-general parameters that work well across different tasks.
- Examples of parameterized task-general components include a metric space (Vinyals et al., 2016), an RNN (Duan et al., 2016), a memory-augmented NN (Santoro et al., 2016).

- Again, meta-knowledge \(\color{blue}{\omega}\) specifies
**“how to learn”**\(\theta\). - \(\color{blue}{\omega}\) is a starting inductive bias for new tasks from old tasks; equivalent to ‘learning a prior’.
- \(\implies\) There is a loose analogy between any meta-learning approach and hierarchical Bayesian inference (Griffiths et al., 2019).
- Bayesian inference generically indicates how a learner should combine data with a prior distribution over hypotheses
- Hierarchical Bayesian inference for meta-learning learns that prior through experience (data from related tasks).

## Bayesian meta-learning 🔗

- Opens opportunities to (Griffiths et al., 2019):
- Translate cognitive science insights, which has focused on hierarchical Bayesian models (HBMs), to ML.
- Use probabilistic generative models from Bayesian deep learning toolbox for meta-learning.

- Useful for safety-critical few-shot learning (e.g. medical imaging), active learning, and exploration in RL (à la Marc Deisenroth’s talk on probabilistic RL).

## MAML as a HBM 🔗

- Grant et al. (2018) (Grant et al., 2018) show that:
- The few steps of gradient descent by the task-specific learners result in \(\theta^*\), which is an approximation to the Bayesian estimate of \(\theta\) for that task, with a prior that depends on the initial parameterization \(\color{blue}{\omega^*}\) .
- \(\implies\) Learning \(\color{blue}{\omega}\) is equivalent to learning a prior.

## LVM + amortized VI for meta-learning 🔗

- Standard VAE optimizes ELBO \(p(\mathbf{x}) \geq \text{ELBO} = \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})} \left[ \log p_\theta(\mathbf{x}\vert\mathbf{z}) \right] - D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}))\)
- Neural Processes (Garnelo et al., 2018), Versa (Jonathan Gordon et al., 2019) etc. take inspiration from VAE for meta-learning by treating \(\theta\) as \(\mathbf{z}\).
- \(\log p\left(\mathbf{Y}_{T} \mid \mathbf{X}_{T}, \mathbf{X}_{C}, \mathbf{Y}_{C}\right) \geq \mathbb{E}_{q\left(\mathbf{z} \mid \mathbf{s}_{T}\right)}\left[\log p\left(\mathbf{Y}_{T} \mid \mathbf{X}_{T}, \mathbf{r}_{C}, \mathbf{z}\right)\right] - D_{\mathrm{KL}}\left(q\left(\mathbf{z} \mid \mathbf{s}_{C}, \mathbf{s}_{T}\right) \| q\left(\mathbf{z} \mid \mathbf{s}_{C}\right)\right)\)

## GPs for meta-learning 🔗

- Use a neural network for the mean or kernel function (Fortuin & R\“atsch, 2019)

## What else can it be used for? 🔗

- RL agent trains on some mazes and is tested on unseen mazes generated by the same process (Duan et al., 2016), (Mishra et al., 2017)

### Many methods and applications 🔗

We can situate NP, MAML, and meta-learning deep mean/kernel GP within the meta-learning taxonomy across 3 independent axes (Hospedales et al., 2020).

## Conclusion 🔗

- The core idea of meta-learning is to optimize a model over a distribution of learning tasks \(p(\mathcal{T})\), rather than just a single task, with the goal of generalizing to other tasks from \(p(\mathcal{T})\).

## Recap of \(\color{blue}{\omega}\) and \(\theta\) 🔗

Conventional ML has fixed/pre-specified meta-knowledge \(\color{blue}{\omega}\).

Shared meta-knowledge \(\color{blue}{\omega}\) | Task-specific \(\theta\) | |
---|---|---|

NN | Hyperparameters (e.g. learning rate, weight initialization scheme, optimizer, architecture design) | Network weights |

Meta-learning challenges this assumption by learning a prior on \(\color{blue}{\omega}\) from \(\mathcal{D}_{\mathrm{source}}\)

Shared meta-knowledge \(\color{blue}{\omega}\) | Task-specific \(\theta\) | |
---|---|---|

Hyperopt | Hyperparameters (e.g. learning rate) | Network weights |

MAML | Network weights (initialization learnt from \(\mathcal{D}_{\mathrm{source}}\)) | Network weights (tuned on \(\mathcal{D}_{\mathrm{target}}\)) |

NP | Network weights | Aggregated target context [latent] representation |

Meta-GP | Deep mean/kernel function parameters | None (a GP is fit on \(\mathcal{D}_{\mathrm{target}}^{\mathrm{train}}\)) \(\bigstar\) |

\(\bigstar\) In GPs, while the data is used at test time, no optimization is really done since GPs are nonparametric, i.e. there’s no (or infinite) parameters to explicitly learn for meta-test time.

# Bibliography

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016), *Building machines that learn and think like people*, CoRR. ↩

Lake, B., Salakhutdinov, R., & Tenenbaum, J. (2015), [*Human-level concept learning through probabilistic program induction*](), Science. ↩

Hospedales, T., Antoniou, A., Micaelli, P., & Storkey, A. (2020), *Meta-learning in neural networks: a survey*, CoRR. ↩

Griffiths, T. L., Callaway, F., Chang, M. B., Grant, E., Krueger, P. M., & Lieder, F. (2019), [*Doing more with less: meta-reasoning and meta-learning in humans and machines*](), Current Opinion in Behavioral Sciences. ↩

Finn, C., Abbeel, P., & Levine, S., *Model-agnostic meta-learning for fast adaptation of deep networks*, In , International Conference on Machine Learning (ICML) (pp. 1126–1135) (2017). : . ↩

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., & others, , *Matching networks for one shot learning*, In , Advances in neural information processing systems (pp. 3630–3638) (2016). : . ↩

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016), *Rl$^2$: fast reinforcement learning via slow reinforcement learning*, CoRR. ↩

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T., *Meta-learning with memory-augmented neural networks*, In , International Conference on Machine Learning (ICML) (pp. 1842–1850) (2016). : . ↩

Grant, E., Finn, C., Levine, S., Darrell, T., & Griffiths, T. (2018), *Recasting Gradient-Based Meta-Learning As Hierarchical Bayes*, CoRR. ↩

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., & Teh, Y. W. (2018), [*Neural Processes*](), arXiv preprint: 1807.01622. ↩

Gordon, J., Bronskill, J., Bauer, M., Nowozin, S., & Turner, R. (2019), *Meta-learning probabilistic inference for prediction*, . ↩

Fortuin, V., & R\“atsch, Gunnar (2019), *Deep mean functions for meta-learning in gaussian processes*, CoRR. ↩

Mishra, N., Rohaninejad, M., Chen, X., & Abbeel, P., *A simple neural attentive meta-learner*, In , (pp. ) (2017} # booktitle # {Workshop on Meta-Learning, NeurIPS). : . ↩