Information Inequality, Kullback-Leibler Divergence

We begin with an inequality for probability density functions, sometimes referred to as the information inequality. We use this inequality to prove that the Kullback-Leibler divergence, a `distance’ between two probability measures, is always nonnegative.

Proposition 1. 

\displaystyle\int_{-\infty}^{\infty}f(x)\ln g(x)dx \leq\int_{-\infty}^{\infty}f(x)\ln f(x)dx

for all nonnegative functions measurable functions f: \mathbb{R} \rightarrow [0,\infty) satisfying \int_{-\infty}^{\infty}f(x)dx = 1.

Proof. Let X be a random variable with probability density function f(x). Consider the random variable \ln(\frac{f(X)}{g(X)}). We will show that

\displaystyle\mathbb{E}\left[\ln\left(\frac{f(X)}{g(X)}\right)\right]=\int_{-\infty}^{\infty}f(x)\ln\left(\frac{f(x)}{g(x)}\right)dx\geq 0

Since x \mapsto \ln x is concave, Jensen’s inequality gives

\begin{array}{lcl}\displaystyle\mathbb{E}\left[\ln\left(\frac{g(X)}{f(X)}\right)\right]\leq\ln\mathbb{E}\left[\frac{g(X)}{f(X)}\right]=\ln\left(\int_{-\infty}^{\infty}\frac{g(x)}{f(x)}f(x)dx\right)&=&\ln\left(\int_{-\infty}^{\infty}g(x)dx\right)\\&=&\ln 1\\&=&0\end{array}

Observing that \mathbb{E}[\ln(\frac{g(X)}{f(X)})]=-\mathbb{E}[\ln(\frac{f(X)}{g(X)})] completes the proof. \Box

The information inequality is used to showed that the Kullback-Leibler divergence between two distributions P, Q of continuous random variables with probability density functions f and g, respectively, defined by

\displaystyle D_{KL}(P\parallel Q):=\int_{-\infty}^{\infty}\ln\left(\frac{f(x)}{g(x)}\right)f(x)dx

is nonnegative. Note that the Kullback-Leibler divergence can be generalized to the case where P,Q are probability measures on a measure space (X,\mathcal{F}) and P is absolutely continuous with respect to Q. In the case, we define

\displaystyle D_{KL}(P\parallel Q):=\int_{X}\ln\left(\frac{dP}{dQ}\right)dP,

where \frac{dP}{dQ} is the Radon-Nikodym derivative of P with respect to Q. To see that D_{KL}(\cdot \parallel \cdot) is sitll nonnegative, we apply Jensen’s inequality to the convex function x\mapsto x\ln(x) to obtain

\begin{array}{lcl}\displaystyle\int_{X}\ln\left(\frac{dP}{dQ}\right)dP=\int_{X}\ln\left(\frac{dP}{dQ}\right)\frac{dP}{dQ}dQ&\geq&\ln\left(\int_{X}\frac{dP}{dQ}dQ\right)\left(\int_{X}\frac{dP}{dQ}dQ\right)\\&=&\ln\left(P(X)\right)P(X)\\&=&\ln(1)\\&=&0\end{array}

Note that, although the KL-divergence can be intuivitely thought of as a `distance’ between two probability measures, it does not define a metric since it need not be symmetric. For a simple counter example, let p_{1} = p_{2} = \frac{1}{2} and q_{1} = \frac{1}{3}, q_{2} = \frac{2}{3}. Then

\displaystyle D_{KL}(P \parallel Q)=\frac{1}{2}\left[\ln\left(\frac{3}{2}\right)+\ln\left(\frac{3}{4}\right)\right]\approx . 059,

while

\displaystyle D_{KL}(P\parallel Q)=\frac{1}{3}\ln\left(\frac{2}{3}\right)+\frac{2}{3}\ln\left(\frac{4}{3}\right)\approx .057

We now apply the information inequality to proving some results concerning maximal entropy. Define the entropy of X to be

\displaystyle H(X) := \int_{-\infty}^{\infty}f(x)\ln f(x)dx

We also use the notation H(f)=H(X).

We now prove that the \text{Unif}(a,b) distribution has maximum entropy among all probability density functions supported on (a,b). For any pdf supported on (a,b) and g(x):=(b-a)^{-1}\mathbf{1}_{[a,b]}(x), we have by the information inequality that

\begin{array}{lcl}\displaystyle H(f)\leq-\int_{a}^{b}f(x)\ln g(x)dx=-\int_{a}^{b}f(x)\ln\left(\frac{1}{b-a}\right)dx&=&\ln(b-a)\int_{a}^{b}f(x)dx\\&=&\ln(b-a)\end{array}

Taking f = g completes the proof.

We now prove that \text{N}(\mu,\sigma^{2}) has maximum entropy among all probability density functions f on \mathbb{R} that have mean \mu and variance \sigma^{2}. If g is the density function of a \text{N}(\mu,\sigma^{2}) random variable, then

\begin{array}{lcl}\displaystyle H(f)\leq-\int_{-\infty}^{\infty}f(x)\ln\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\right)dx&=&\displaystyle-\int_{-\infty}^{\infty}f(x)\ln\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)dx+\int_{-\infty}^{\infty}f(x)\frac{(x-\mu)^{2}}{2\sigma^{2}}dx\\&=&\dfrac{\ln(2\pi\sigma^{2})+1}{2}\end{array}

since if X has pdf f, then

\displaystyle\sigma^{2}=\text{Var}(X)=\mathbb{E}[(X-\mathbb{E}[X])^{2}]=\mathbb{E}[(X-\mu)^{2}]=\int_{-\infty}^{\infty}f(x)(x-\mu)^{2}dx

Taking f=g shows that this upper bound is attained.

This entry was posted in math.PR and tagged , . Bookmark the permalink.

4 Responses to Information Inequality, Kullback-Leibler Divergence

  1. It is in reality a nice and helpful piece of info. I am happy that you simply shared this helpful info with us. Please stay us informed like this. Thank you for sharing.

  2. Anonymous says:

    You’re using Jensen’s inequality the wrong way around! You’ll need to rethink your first proof…

    • matthewhr says:

      Thank you for your comment. I take it that you are referring to the proof of Proposition 1. I reviewed my proof, and I’m not seeing the mistake. f(x) = ln(x) is a concave function (compute the second derivative), so Jensen’s inequality is reversed.

  3. Anonymous says:

    I think there is a subtle problem with the first proof: g(x)/f(x) * f(x) = g(x) only on the set {f(x) != 0}, which is not necessarily dx-measure zero (although it is measure zero with respect to the probability measure corresponding to f(x)). On the set {f(x) = 0}, we have instead g(x)/f(x) * f(x) = 0. As such,

    int g(x)/f(x) * f(x) dx = int_{f(x) != 0} g(x) dx

    which in general is only equal to

    int g(x) dx

    when {f(x) != 0} is dx-measure zero.

    However, I agree with your proof of the more general case, which can easily be specialised to this case here.

Leave a comment