“I can’t believe it’s not Bayesian” - Chelsea Finn, ICML 2019 meta-learning workshop

This post is also available as Reveal.JS slides 📺 (press the right arrow key \(\rightarrow\) to advance).

Agenda 🔗

Why meta-learning? 🔗

What is meta-learning? 🔗

Figure 1: Hyperparameter searches (image source)

Figure 1: Hyperparameter searches (image source)

But really, what is it? 🔗

The salient characteristic of contemporary neural-network meta-learning is an explicitly defined meta-level objective, and end-to-end optimization of the inner algorithm with respect to this objective (Hospedales et al., 2020)

Conventional vs meta-learning 🔗

Figure 2: Conventional vs meta-learning for 1D function regression (image source)

Figure 2: Conventional vs meta-learning for 1D function regression (image source)

Two phases of meta-learning methods 🔗

Figure 3: “3-way-2-shot” (few-shot) classification. Each source task’s train set contains 3 classes of 2 examples each (Vinyals et al., 2016). Image modified from Borealis AI blogpost.

Figure 3: “3-way-2-shot” (few-shot) classification. Each source task’s train set contains 3 classes of 2 examples each (Vinyals et al., 2016). Image modified from Borealis AI blogpost.

MAML: Model-agnostic meta-learning 🔗

Figure 4: Visualized MAML (image modified from BAIR blogpost).

Figure 4: Visualized MAML (image modified from BAIR blogpost).

A little side-note 🔗

A little side-note: “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML” 🔗

Figure 5: Rapid learning entails efficient but significant change from \(\color{blue}{\omega^*}\) to \(\theta^*\); feature reuse is where \(\color{blue}{\omega^*}\) already provides high quality representations. Figure 1 from the paper.

Figure 5: Rapid learning entails efficient but significant change from \(\color{blue}{\omega^*}\) to \(\theta^*\); feature reuse is where \(\color{blue}{\omega^*}\) already provides high quality representations. Figure 1 from the paper.

Concrete examples of \(\color{blue}{\omega}\) and \(\theta\) 🔗

Conventional ML: \(\color{blue}{\text{fixed }\omega}\).

Shared meta-knowledge \(\color{blue}{\omega}\) Task-specific \(\theta\)
NN Hyperparameters (e.g. learning rate, weight initialization scheme, optimizer, architecture design) Network weights

Meta-learning: \(\color{blue}{\text{learnt }\omega}\) from \(\mathcal{D}_{\mathrm{source}}\).

Shared meta-knowledge \(\color{blue}{\omega}\) Task-specific \(\theta\)
Hyperopt Hyperparameters (e.g. learning rate) Network weights
MAML Network weights (initialization learnt from \(\mathcal{D}_{\mathrm{source}}\)) Network weights (tuned on \(\mathcal{D}_{\mathrm{target}}\))
NP Network weights Aggregated target context [latent] representation
Meta-GP Deep mean/kernel function parameters None (a GP is fit on \(\mathcal{D}_{\mathrm{target}}^{\mathrm{train}}\))

A \(\color{blue}{\omega}\) by any other name… 🌹 🔗

Bayesian meta-learning 🔗

MAML as a HBM 🔗

Figure 6: MAML and its corresponding probabilistic graphical model. Figure 2 from Griffiths et al. (2019).

Figure 6: MAML and its corresponding probabilistic graphical model. Figure 2 from Griffiths et al. (2019).

LVM + amortized VI for meta-learning 🔗

Figure 7: Same at meta-train \(\left(\mathcal{D}_{\text {source}}^{\text {train}}, \mathcal{D}_{\text {source}}^{\text {val}}\right)\) and meta-test \(\left(\mathcal{D}_{\text {target}}^{\text {train}}, \mathcal{D}_{\text {target}}^{\text {test}}\right)\) time.

Figure 7: Same at meta-train \(\left(\mathcal{D}_{\text {source}}^{\text {train}}, \mathcal{D}_{\text {source}}^{\text {val}}\right)\) and meta-test \(\left(\mathcal{D}_{\text {target}}^{\text {train}}, \mathcal{D}_{\text {target}}^{\text {test}}\right)\) time.

Figure 8: In an NP, meta-parameters \(\color{blue}{\omega}\) are the weights of the encoder and decoder NNs.

Figure 8: In an NP, meta-parameters \(\color{blue}{\omega}\) are the weights of the encoder and decoder NNs.

GPs for meta-learning 🔗

Figure 9: Figure 1 from Fortuin et al. (2019). Corresponding GPFlow code in purple.

Figure 9: Figure 1 from Fortuin et al. (2019). Corresponding GPFlow code in purple.

What else can it be used for? 🔗

Many methods and applications 🔗

We can situate NP, MAML, and meta-learning deep mean/kernel GP within the meta-learning taxonomy across 3 independent axes (Hospedales et al., 2020).

Figure 10: Taxonomy modified from Figure 1 in Hospedales et al. (2020).

Figure 10: Taxonomy modified from Figure 1 in Hospedales et al. (2020).

Conclusion 🔗

Recap of \(\color{blue}{\omega}\) and \(\theta\) 🔗

Conventional ML has fixed/pre-specified meta-knowledge \(\color{blue}{\omega}\).

Shared meta-knowledge \(\color{blue}{\omega}\) Task-specific \(\theta\)
NN Hyperparameters (e.g. learning rate, weight initialization scheme, optimizer, architecture design) Network weights

Meta-learning challenges this assumption by learning a prior on \(\color{blue}{\omega}\) from \(\mathcal{D}_{\mathrm{source}}\)

Shared meta-knowledge \(\color{blue}{\omega}\) Task-specific \(\theta\)
Hyperopt Hyperparameters (e.g. learning rate) Network weights
MAML Network weights (initialization learnt from \(\mathcal{D}_{\mathrm{source}}\)) Network weights (tuned on \(\mathcal{D}_{\mathrm{target}}\))
NP Network weights Aggregated target context [latent] representation
Meta-GP Deep mean/kernel function parameters None (a GP is fit on \(\mathcal{D}_{\mathrm{target}}^{\mathrm{train}}\)) \(\bigstar\)

\(\bigstar\) In GPs, while the data is used at test time, no optimization is really done since GPs are nonparametric, i.e. there’s no (or infinite) parameters to explicitly learn for meta-test time.

Bibliography

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016), Building machines that learn and think like people, CoRR.

Lake, B., Salakhutdinov, R., & Tenenbaum, J. (2015), [Human-level concept learning through probabilistic program induction](), Science.

Hospedales, T., Antoniou, A., Micaelli, P., & Storkey, A. (2020), Meta-learning in neural networks: a survey, CoRR.

Griffiths, T. L., Callaway, F., Chang, M. B., Grant, E., Krueger, P. M., & Lieder, F. (2019), [Doing more with less: meta-reasoning and meta-learning in humans and machines](), Current Opinion in Behavioral Sciences.

Finn, C., Abbeel, P., & Levine, S., Model-agnostic meta-learning for fast adaptation of deep networks, In , International Conference on Machine Learning (ICML) (pp. 1126–1135) (2017). : .

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., & others, , Matching networks for one shot learning, In , Advances in neural information processing systems (pp. 3630–3638) (2016). : .

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016), Rl$^2$: fast reinforcement learning via slow reinforcement learning, CoRR.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T., Meta-learning with memory-augmented neural networks, In , International Conference on Machine Learning (ICML) (pp. 1842–1850) (2016). : .

Grant, E., Finn, C., Levine, S., Darrell, T., & Griffiths, T. (2018), Recasting Gradient-Based Meta-Learning As Hierarchical Bayes, CoRR.

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., & Teh, Y. W. (2018), [Neural Processes](), arXiv preprint: 1807.01622.

Gordon, J., Bronskill, J., Bauer, M., Nowozin, S., & Turner, R. (2019), Meta-learning probabilistic inference for prediction, .

Fortuin, V., & R\“atsch, Gunnar (2019), Deep mean functions for meta-learning in gaussian processes, CoRR.

Mishra, N., Rohaninejad, M., Chen, X., & Abbeel, P., A simple neural attentive meta-learner, In , (pp. ) (2017} # booktitle # {Workshop on Meta-Learning, NeurIPS). : .