The KullbackāLeibler divergence $$D_{KL}(P \parallel Q)$$ of $$Q$$ from $$P$$ is an asymmetric distance measure because it measures how much information is (expected to be) lost if the distribution $$Q$$ is used to approximate $$P$$.

$D_{\text{KL}}(P\parallel Q) =\mathbb{E}_{x\sim P(x)} \log\frac{P(x)}{Q(x)}$

In other words, it is the relative entropy of $$Q$$ with respect to $$P$$), or how different $$Q$$ is from the perspective of $$P$$, since itās the cross entropy between $$P$$ and $$Q$$ (i.e. $$H(p,q)=-\operatorname {E} _{p}[\log q]$$) minus the entropy of $$P$$.

\begin{align} D_{\text{KL}}(P\parallel Q) &= -\sum _{x\in {\mathcal {X}}}p(x)\log q(x) - (-\sum _{x\in {\mathcal {X}}}p(x)\log p(x))\\
&=\mathrm {H} (P,Q)-\mathrm {H} (P) \\
\end{align}

When the target distribution $$P$$ is fixed, minimizing the cross entropy implies minimizing KL divergence.

In the limit, as N goes to infinity, maximizing likelihood is equivalent to minimizing forward KL-Divergence (as derived by wiseodd, ā­Colin Raffelās GANs and Divergence Minimization, and ā­Wasserstein GAN Ā· Depth First Learning).

## Forward vs Reverse KL š

In mean-field variational Bayes, we typically use reverse KL (explained well in Eric Jang: A Beginner’s Guide to Variational Methods: Mean-Field Approximation).

Minimizing forward KL stretches Q over P, while reverse KL divergence squeezes Q under P.

Image from GAN Tutorial (Goodfellow, 2016)

# Bibliography

Goodfellow, I. (2016). Nips 2016 tutorial: generative adversarial networks. Retrieved from [](). cite arxiv:1701.00160Comment: v2-v4 are all typo fixes. No substantive changes relative to v1. ā©