We begin with an inequality for probability density functions, sometimes referred to as the information inequality. We use this inequality to prove that the Kullback-Leibler divergence, a `distance’ between two probability measures, is always nonnegative.
Proposition 1.
for all nonnegative functions measurable functions satisfying .
Proof. Let be a random variable with probability density function . Consider the random variable . We will show that
Since is concave, Jensen’s inequality gives
Observing that completes the proof.
The information inequality is used to showed that the Kullback-Leibler divergence between two distributions of continuous random variables with probability density functions and , respectively, defined by
is nonnegative. Note that the Kullback-Leibler divergence can be generalized to the case where are probability measures on a measure space and is absolutely continuous with respect to . In the case, we define
where is the Radon-Nikodym derivative of with respect to . To see that is sitll nonnegative, we apply Jensen’s inequality to the convex function to obtain
Note that, although the KL-divergence can be intuivitely thought of as a `distance’ between two probability measures, it does not define a metric since it need not be symmetric. For a simple counter example, let and . Then
while
We now apply the information inequality to proving some results concerning maximal entropy. Define the entropy of to be
We also use the notation .
We now prove that the distribution has maximum entropy among all probability density functions supported on . For any pdf supported on and , we have by the information inequality that
Taking completes the proof.
We now prove that has maximum entropy among all probability density functions on that have mean and variance . If is the density function of a random variable, then
since if has pdf , then
Taking shows that this upper bound is attained.
It is in reality a nice and helpful piece of info. I am happy that you simply shared this helpful info with us. Please stay us informed like this. Thank you for sharing.
You’re using Jensen’s inequality the wrong way around! You’ll need to rethink your first proof…
Thank you for your comment. I take it that you are referring to the proof of Proposition 1. I reviewed my proof, and I’m not seeing the mistake. f(x) = ln(x) is a concave function (compute the second derivative), so Jensen’s inequality is reversed.
I think there is a subtle problem with the first proof: g(x)/f(x) * f(x) = g(x) only on the set {f(x) != 0}, which is not necessarily dx-measure zero (although it is measure zero with respect to the probability measure corresponding to f(x)). On the set {f(x) = 0}, we have instead g(x)/f(x) * f(x) = 0. As such,
int g(x)/f(x) * f(x) dx = int_{f(x) != 0} g(x) dx
which in general is only equal to
int g(x) dx
when {f(x) != 0} is dx-measure zero.
However, I agree with your proof of the more general case, which can easily be specialised to this case here.