To further improve the propagation of training signal, we use deep residual networks [24, 25] with batch normalization [31] and weight normalization [2, 54] in s and t. As described in Appendix E we introduce and use a novel variant of batch normalization which is based on a running average over recent minibatches, and is thus more robust when training with very small minibatches.

…

Probabilistic models in general can also benefit from batch normalization techniques as applied in this paper.

(Dinh et al., 2017)

We also found that using weight normalisation [18] within every s and t function was crucial for successful training of large models.

(Korshunova et al., 2018)

(Papamakarios et al., 2019)

Bibliography

Dinh, L., Sohl-Dickstein, J., & Bengio, S., Density estimation using Real NVP, In , ICLR (pp. ) (2017). : . ↩

Korshunova, I., Degrave, J., Husz\‘ar, Ferenc, Gal, Y., Gretton, A., & Dambre, J., BRUNO: A deep recurrent model for exchangeable data, In , NeurIPS (pp. ) (2018). : . ↩

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2019), Normalizing flows for probabilistic modeling and inference, CoRR. ↩