According to Jane Street Tech Blog - L2 Regularization and Batch Norm:
When used together with batch normalization in a convolutional neural net with typical architectures, an L2 objective penalty no longer has its original regularizing effect. Instead it becomes essentially equivalent to an adaptive adjustment of the learning rate!
With batch norm, scaling w by a factor of λ causes the gradients to scale by a factor of 1/λ. This effectively scales the learning rate of w by a factor of 1/λ^2, causing it to decay greatly over time.
But L2 penalty prevents the effective learning rate from decaying.
In summary, an L2 penalty or weight decay on any layers preceding batch normalization layers, rather than functioning as a direct regularizer preventing overfitting of the layer weights, instead takes on a role as the sole control on the weight scale of that layer. This prevents the gradients and therefore the “effective” learning rate for that layer from decaying over time, making weight decay essentially equivalent to a form of adaptive learning rate scaling for those layers.