The Kullback–Leibler divergence $$D_{KL}(P \parallel Q)$$ of $$Q$$ from $$P$$ is an asymmetric distance measure because it measures how much information is (expected to be) lost if the distribution $$Q$$ is used to approximate $$P$$.

$D_{\text{KL}}(P\parallel Q) =\mathbb{E}_{x\sim P(x)} \log\frac{P(x)}{Q(x)}$

In other words, it is the relative entropy of $$Q$$ with respect to $$P$$), or how different $$Q$$ is from the perspective of $$P$$, since it’s the cross entropy between $$P$$ and $$Q$$ (i.e. $$H(p,q)=-\operatorname {E} _{p}[\log q]$$) minus the entropy of $$P$$.

\begin{align} D_{\text{KL}}(P\parallel Q) &= -\sum _{x\in {\mathcal {X}}}p(x)\log q(x) - (-\sum _{x\in {\mathcal {X}}}p(x)\log p(x))\\
&=\mathrm {H} (P,Q)-\mathrm {H} (P) \\
\end{align}

When the target distribution $$P$$ is fixed, minimizing the cross entropy implies minimizing KL divergence.

In the limit, as N goes to infinity, maximizing likelihood is equivalent to minimizing forward KL-Divergence (as derived by wiseodd, ⭐Colin Raffel’s GANs and Divergence Minimization, and ⭐Wasserstein GAN · Depth First Learning).

## Forward vs Reverse KL 🔗

In mean-field variational Bayes, we typically use reverse KL (explained well in Eric Jang: A Beginner’s Guide to Variational Methods: Mean-Field Approximation).