Information Inequality, Kullback-Leibler Divergence

We begin with an inequality for probability density functions, sometimes referred to as the information inequality. We use this inequality to prove that the Kullback-Leibler divergence, a `distance’ between two probability measures, is always nonnegative.

Proposition 1.

$\displaystyle\int_{-\infty}^{\infty}f(x)\ln g(x)dx \leq\int_{-\infty}^{\infty}f(x)\ln f(x)dx$

for all nonnegative functions measurable functions $f: \mathbb{R} \rightarrow [0,\infty)$ satisfying $\int_{-\infty}^{\infty}f(x)dx = 1$ .

Proof. Let $X$ be a random variable with probability density function $f(x)$ . Consider the random variable $\ln(\frac{f(X)}{g(X)})$ . We will show that

$\displaystyle\mathbb{E}\left[\ln\left(\frac{f(X)}{g(X)}\right)\right]=\int_{-\infty}^{\infty}f(x)\ln\left(\frac{f(x)}{g(x)}\right)dx\geq 0$

Since $x \mapsto \ln x$ is concave, Jensen’s inequality gives

$\begin{array}{lcl}\displaystyle\mathbb{E}\left[\ln\left(\frac{g(X)}{f(X)}\right)\right]\leq\ln\mathbb{E}\left[\frac{g(X)}{f(X)}\right]=\ln\left(\int_{-\infty}^{\infty}\frac{g(x)}{f(x)}f(x)dx\right)&=&\ln\left(\int_{-\infty}^{\infty}g(x)dx\right)\\&=&\ln 1\\&=&0\end{array}$

Observing that $\mathbb{E}[\ln(\frac{g(X)}{f(X)})]=-\mathbb{E}[\ln(\frac{f(X)}{g(X)})]$ completes the proof. $\Box$

The information inequality is used to showed that the Kullback-Leibler divergence between two distributions $P, Q$ of continuous random variables with probability density functions $f$ and $g$ , respectively, defined by

$\displaystyle D_{KL}(P\parallel Q):=\int_{-\infty}^{\infty}\ln\left(\frac{f(x)}{g(x)}\right)f(x)dx$

is nonnegative. Note that the Kullback-Leibler divergence can be generalized to the case where $P,Q$ are probability measures on a measure space $(X,\mathcal{F})$ and $P$ is absolutely continuous with respect to $Q$ . In the case, we define

$\displaystyle D_{KL}(P\parallel Q):=\int_{X}\ln\left(\frac{dP}{dQ}\right)dP,$

where $\frac{dP}{dQ}$ is the Radon-Nikodym derivative of $P$ with respect to $Q$ . To see that $D_{KL}(\cdot \parallel \cdot)$ is sitll nonnegative, we apply Jensen’s inequality to the convex function $x\mapsto x\ln(x)$ to obtain

$\begin{array}{lcl}\displaystyle\int_{X}\ln\left(\frac{dP}{dQ}\right)dP=\int_{X}\ln\left(\frac{dP}{dQ}\right)\frac{dP}{dQ}dQ&\geq&\ln\left(\int_{X}\frac{dP}{dQ}dQ\right)\left(\int_{X}\frac{dP}{dQ}dQ\right)\\&=&\ln\left(P(X)\right)P(X)\\&=&\ln(1)\\&=&0\end{array}$

Note that, although the KL-divergence can be intuivitely thought of as a `distance’ between two probability measures, it does not define a metric since it need not be symmetric. For a simple counter example, let $p_{1} = p_{2} = \frac{1}{2}$ and $q_{1} = \frac{1}{3}, q_{2} = \frac{2}{3}$ . Then

$\displaystyle D_{KL}(P \parallel Q)=\frac{1}{2}\left[\ln\left(\frac{3}{2}\right)+\ln\left(\frac{3}{4}\right)\right]\approx . 059,$

while

$\displaystyle D_{KL}(P\parallel Q)=\frac{1}{3}\ln\left(\frac{2}{3}\right)+\frac{2}{3}\ln\left(\frac{4}{3}\right)\approx .057$

We now apply the information inequality to proving some results concerning maximal entropy. Define the entropy of $X$ to be

$\displaystyle H(X) := \int_{-\infty}^{\infty}f(x)\ln f(x)dx$

We also use the notation $H(f)=H(X)$ .

We now prove that the $\text{Unif}(a,b)$ distribution has maximum entropy among all probability density functions supported on $(a,b)$ . For any pdf supported on $(a,b)$ and $g(x):=(b-a)^{-1}\mathbf{1}_{[a,b]}(x)$ , we have by the information inequality that

$\begin{array}{lcl}\displaystyle H(f)\leq-\int_{a}^{b}f(x)\ln g(x)dx=-\int_{a}^{b}f(x)\ln\left(\frac{1}{b-a}\right)dx&=&\ln(b-a)\int_{a}^{b}f(x)dx\\&=&\ln(b-a)\end{array}$

Taking $f = g$ completes the proof.

We now prove that $\text{N}(\mu,\sigma^{2})$ has maximum entropy among all probability density functions $f$ on $\mathbb{R}$ that have mean $\mu$ and variance $\sigma^{2}$ . If $g$ is the density function of a $\text{N}(\mu,\sigma^{2})$ random variable, then

$\begin{array}{lcl}\displaystyle H(f)\leq-\int_{-\infty}^{\infty}f(x)\ln\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\right)dx&=&\displaystyle-\int_{-\infty}^{\infty}f(x)\ln\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)dx+\int_{-\infty}^{\infty}f(x)\frac{(x-\mu)^{2}}{2\sigma^{2}}dx\\&=&\dfrac{\ln(2\pi\sigma^{2})+1}{2}\end{array}$

since if $X$ has pdf $f$ , then

$\displaystyle\sigma^{2}=\text{Var}(X)=\mathbb{E}[(X-\mathbb{E}[X])^{2}]=\mathbb{E}[(X-\mu)^{2}]=\int_{-\infty}^{\infty}f(x)(x-\mu)^{2}dx$

Taking $f=g$ shows that this upper bound is attained.

4 Responses to Information Inequality, Kullback-Leibler Divergence

rprmthtuynn@gmail.com says:

August 24, 2013 at 2:00 am

It is in reality a nice and helpful piece of info. I am happy that you simply shared this helpful info with us. Please stay us informed like this. Thank you for sharing.

Anonymous says:

December 2, 2013 at 4:54 pm

You’re using Jensen’s inequality the wrong way around! You’ll need to rethink your first proof…

- matthewhr says:
  
  December 3, 2013 at 1:10 am
  
  Thank you for your comment. I take it that you are referring to the proof of Proposition 1. I reviewed my proof, and I’m not seeing the mistake. f(x) = ln(x) is a concave function (compute the second derivative), so Jensen’s inequality is reversed.
  
Anonymous says:

May 31, 2017 at 12:09 pm

I think there is a subtle problem with the first proof: g(x)/f(x) * f(x) = g(x) only on the set {f(x) != 0}, which is not necessarily dx-measure zero (although it is measure zero with respect to the probability measure corresponding to f(x)). On the set {f(x) = 0}, we have instead g(x)/f(x) * f(x) = 0. As such,

int g(x)/f(x) * f(x) dx = int_{f(x) != 0} g(x) dx

which in general is only equal to

int g(x) dx

when {f(x) != 0} is dx-measure zero.

However, I agree with your proof of the more general case, which can easily be specialised to this case here.