To further improve the propagation of training signal, we use deep residual networks [24, 25] with batch normalization  and weight normalization [2, 54] in s and t. As described in Appendix E we introduce and use a novel variant of batch normalization which is based on a running average over recent minibatches, and is thus more robust when training with very small minibatches.
Probabilistic models in general can also benefit from batch normalization techniques as applied in this paper.
We also found that using weight normalisation  within every s and t function was crucial for successful training of large models.
Dinh, L., Sohl-Dickstein, J., & Bengio, S., Density estimation using Real NVP, In , ICLR (pp. ) (2017). : . ↩
Korshunova, I., Degrave, J., Husz\‘ar, Ferenc, Gal, Y., Gretton, A., & Dambre, J., BRUNO: A deep recurrent model for exchangeable data, In , NeurIPS (pp. ) (2018). : . ↩
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2019), Normalizing flows for probabilistic modeling and inference, CoRR. ↩