To further improve the propagation of training signal, we use deep residual networks [24, 25] with batch normalization [31] and weight normalization [2, 54] in s and t. As described in Appendix E we introduce and use a novel variant of batch normalization which is based on a running average over recent minibatches, and is thus more robust when training with very small minibatches.

Probabilistic models in general can also benefit from batch normalization techniques as applied in this paper.

(Dinh et al., 2017)

We also found that using weight normalisation [18] within every s and t function was crucial for successful training of large models.

(Korshunova et al., 2018)

(Papamakarios et al., 2019)

Bibliography

Dinh, L., Sohl-Dickstein, J., & Bengio, S., Density estimation using Real NVP, In , ICLR (pp. ) (2017). : .

Korshunova, I., Degrave, J., Husz\‘ar, Ferenc, Gal, Y., Gretton, A., & Dambre, J., BRUNO: A deep recurrent model for exchangeable data, In , NeurIPS (pp. ) (2018). : .

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2019), Normalizing flows for probabilistic modeling and inference, CoRR.