The Kullback–Leibler divergence \(D_{KL}(P \parallel Q)\) of \(Q\) from \(P\) is an asymmetric distance measure because it measures how much information is (expected to be) lost if the distribution \(Q\) is used to approximate \(P\).

\[D_{\text{KL}}(P\parallel Q) =\mathbb{E}_{x\sim P(x)} \log\frac{P(x)}{Q(x)}\]

In other words, it is the relative entropy of \(Q\) with respect to \(P\)), or how different \(Q\) is from the perspective of \(P\), since it’s the cross entropy between \(P\) and \(Q\) (i.e. \(H(p,q)=-\operatorname {E} _{p}[\log q]\)) minus the entropy of \(P\).

\begin{align} D_{\text{KL}}(P\parallel Q) &= -\sum _{x\in {\mathcal {X}}}p(x)\log q(x) - (-\sum _{x\in {\mathcal {X}}}p(x)\log p(x))\\
&=\mathrm {H} (P,Q)-\mathrm {H} (P) \\
\end{align}

When the target distribution \(P\) is fixed, minimizing the cross entropy implies minimizing KL divergence.

In the limit, as N goes to infinity, maximizing likelihood is equivalent to minimizing forward KL-Divergence (as derived by wiseodd, ⭐Colin Raffel’s GANs and Divergence Minimization, and ⭐Wasserstein GAN · Depth First Learning).

Forward vs Reverse KL 🔗

In mean-field variational Bayes, we typically use reverse KL (explained well in Eric Jang: A Beginner’s Guide to Variational Methods: Mean-Field Approximation).

Figure 1: From “Loss Function: ELBO” heading in Lilian Weng’s VAE post

Figure 1: From “Loss Function: ELBO” heading in Lilian Weng’s VAE post

Minimizing forward KL stretches Q over P, while reverse KL divergence squeezes Q under P.

Image from GAN Tutorial (Goodfellow, 2016)

Great resources 🔗

Bibliography

Goodfellow, I. (2016). Nips 2016 tutorial: generative adversarial networks. Retrieved from [](). cite arxiv:1701.00160Comment: v2-v4 are all typo fixes. No substantive changes relative to v1.