To further improve the propagation of training signal, we use deep residual networks [24, 25] with batch normalization [31] and weight normalization [2, 54] in s and t. As described in Appendix E we introduce and use a novel variant of batch normalization which is based on a running average over recent minibatches, and is thus more robust when training with very small minibatches.
…
Probabilistic models in general can also benefit from batch normalization techniques as applied in this paper.
We also found that using weight normalisation [18] within every s and t function was crucial for successful training of large models.
Bibliography
Dinh, L., Sohl-Dickstein, J., & Bengio, S., Density estimation using Real NVP, In , ICLR (pp. ) (2017). : . ↩
Korshunova, I., Degrave, J., Husz\‘ar, Ferenc, Gal, Y., Gretton, A., & Dambre, J., BRUNO: A deep recurrent model for exchangeable data, In , NeurIPS (pp. ) (2018). : . ↩
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2019), Normalizing flows for probabilistic modeling and inference, CoRR. ↩