The Kullbackā€“Leibler divergence \(D_{KL}(P \parallel Q)\) of \(Q\) from \(P\) is an asymmetric distance measure because it measures how much information is (expected to be) lost if the distribution \(Q\) is used to approximate \(P\).

\[D_{\text{KL}}(P\parallel Q) =\mathbb{E}_{x\sim P(x)} \log\frac{P(x)}{Q(x)}\]

In other words, it is the relative entropy of \(Q\) with respect to \(P\)), or how different \(Q\) is from the perspective of \(P\), since itā€™s the cross entropy between \(P\) and \(Q\) (i.e. \(H(p,q)=-\operatorname {E} _{p}[\log q]\)) minus the entropy of \(P\).

\begin{align} D_{\text{KL}}(P\parallel Q) &= -\sum _{x\in {\mathcal {X}}}p(x)\log q(x) - (-\sum _{x\in {\mathcal {X}}}p(x)\log p(x))\\
&=\mathrm {H} (P,Q)-\mathrm {H} (P) \\
\end{align}

When the target distribution \(P\) is fixed, minimizing the cross entropy implies minimizing KL divergence.

In the limit, as N goes to infinity, maximizing likelihood is equivalent to minimizing forward KL-Divergence (as derived by wiseodd, ā­Colin Raffelā€™s GANs and Divergence Minimization, and ā­Wasserstein GAN Ā· Depth First Learning).

Forward vs Reverse KL šŸ”—

In mean-field variational Bayes, we typically use reverse KL (explained well in Eric Jang: A Beginner’s Guide to Variational Methods: Mean-Field Approximation).

Figure 1: From ā€œLoss Function: ELBOā€ heading in Lilian Weng’s VAE post

Figure 1: From ā€œLoss Function: ELBOā€ heading in Lilian Weng’s VAE post

Minimizing forward KL stretches Q over P, while reverse KL divergence squeezes Q under P.

Image from GAN Tutorial (Goodfellow, 2016)

Great resources šŸ”—

Bibliography

Goodfellow, I. (2016). Nips 2016 tutorial: generative adversarial networks. Retrieved from [](). cite arxiv:1701.00160Comment: v2-v4 are all typo fixes. No substantive changes relative to v1. ā†©