## Information Inequality, Kullback-Leibler Divergence

We begin with an inequality for probability density functions, sometimes referred to as the information inequality. We use this inequality to prove that the Kullback-Leibler divergence, a distance’ between two probability measures, is always nonnegative.

Proposition 1.

$\displaystyle\int_{-\infty}^{\infty}f(x)\ln g(x)dx \leq\int_{-\infty}^{\infty}f(x)\ln f(x)dx$

for all nonnegative functions measurable functions $f: \mathbb{R} \rightarrow [0,\infty)$ satisfying $\int_{-\infty}^{\infty}f(x)dx = 1$.

Proof. Let $X$ be a random variable with probability density function $f(x)$. Consider the random variable $\ln(\frac{f(X)}{g(X)})$. We will show that

$\displaystyle\mathbb{E}\left[\ln\left(\frac{f(X)}{g(X)}\right)\right]=\int_{-\infty}^{\infty}f(x)\ln\left(\frac{f(x)}{g(x)}\right)dx\geq 0$

Since $x \mapsto \ln x$ is concave, Jensen’s inequality gives

$\begin{array}{lcl}\displaystyle\mathbb{E}\left[\ln\left(\frac{g(X)}{f(X)}\right)\right]\leq\ln\mathbb{E}\left[\frac{g(X)}{f(X)}\right]=\ln\left(\int_{-\infty}^{\infty}\frac{g(x)}{f(x)}f(x)dx\right)&=&\ln\left(\int_{-\infty}^{\infty}g(x)dx\right)\\&=&\ln 1\\&=&0\end{array}$

Observing that $\mathbb{E}[\ln(\frac{g(X)}{f(X)})]=-\mathbb{E}[\ln(\frac{f(X)}{g(X)})]$ completes the proof. $\Box$

The information inequality is used to showed that the Kullback-Leibler divergence between two distributions $P, Q$ of continuous random variables with probability density functions $f$ and $g$, respectively, defined by

$\displaystyle D_{KL}(P\parallel Q):=\int_{-\infty}^{\infty}\ln\left(\frac{f(x)}{g(x)}\right)f(x)dx$

is nonnegative. Note that the Kullback-Leibler divergence can be generalized to the case where $P,Q$ are probability measures on a measure space $(X,\mathcal{F})$ and $P$ is absolutely continuous with respect to $Q$. In the case, we define

$\displaystyle D_{KL}(P\parallel Q):=\int_{X}\ln\left(\frac{dP}{dQ}\right)dP,$

where $\frac{dP}{dQ}$ is the Radon-Nikodym derivative of $P$ with respect to $Q$. To see that $D_{KL}(\cdot \parallel \cdot)$ is sitll nonnegative, we apply Jensen’s inequality to the convex function $x\mapsto x\ln(x)$ to obtain

$\begin{array}{lcl}\displaystyle\int_{X}\ln\left(\frac{dP}{dQ}\right)dP=\int_{X}\ln\left(\frac{dP}{dQ}\right)\frac{dP}{dQ}dQ&\geq&\ln\left(\int_{X}\frac{dP}{dQ}dQ\right)\left(\int_{X}\frac{dP}{dQ}dQ\right)\\&=&\ln\left(P(X)\right)P(X)\\&=&\ln(1)\\&=&0\end{array}$

Note that, although the KL-divergence can be intuivitely thought of as a distance’ between two probability measures, it does not define a metric since it need not be symmetric. For a simple counter example, let $p_{1} = p_{2} = \frac{1}{2}$ and $q_{1} = \frac{1}{3}, q_{2} = \frac{2}{3}$. Then

$\displaystyle D_{KL}(P \parallel Q)=\frac{1}{2}\left[\ln\left(\frac{3}{2}\right)+\ln\left(\frac{3}{4}\right)\right]\approx . 059,$

while

$\displaystyle D_{KL}(P\parallel Q)=\frac{1}{3}\ln\left(\frac{2}{3}\right)+\frac{2}{3}\ln\left(\frac{4}{3}\right)\approx .057$

We now apply the information inequality to proving some results concerning maximal entropy. Define the entropy of $X$ to be

$\displaystyle H(X) := \int_{-\infty}^{\infty}f(x)\ln f(x)dx$

We also use the notation $H(f)=H(X)$.

We now prove that the $\text{Unif}(a,b)$ distribution has maximum entropy among all probability density functions supported on $(a,b)$. For any pdf supported on $(a,b)$ and $g(x):=(b-a)^{-1}\mathbf{1}_{[a,b]}(x)$, we have by the information inequality that

$\begin{array}{lcl}\displaystyle H(f)\leq-\int_{a}^{b}f(x)\ln g(x)dx=-\int_{a}^{b}f(x)\ln\left(\frac{1}{b-a}\right)dx&=&\ln(b-a)\int_{a}^{b}f(x)dx\\&=&\ln(b-a)\end{array}$

Taking $f = g$ completes the proof.

We now prove that $\text{N}(\mu,\sigma^{2})$ has maximum entropy among all probability density functions $f$ on $\mathbb{R}$ that have mean $\mu$ and variance $\sigma^{2}$. If $g$ is the density function of a $\text{N}(\mu,\sigma^{2})$ random variable, then

$\begin{array}{lcl}\displaystyle H(f)\leq-\int_{-\infty}^{\infty}f(x)\ln\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\right)dx&=&\displaystyle-\int_{-\infty}^{\infty}f(x)\ln\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)dx+\int_{-\infty}^{\infty}f(x)\frac{(x-\mu)^{2}}{2\sigma^{2}}dx\\&=&\dfrac{\ln(2\pi\sigma^{2})+1}{2}\end{array}$

since if $X$ has pdf $f$, then

$\displaystyle\sigma^{2}=\text{Var}(X)=\mathbb{E}[(X-\mathbb{E}[X])^{2}]=\mathbb{E}[(X-\mu)^{2}]=\int_{-\infty}^{\infty}f(x)(x-\mu)^{2}dx$

Taking $f=g$ shows that this upper bound is attained.