We begin with an inequality for probability density functions, sometimes referred to as the information inequality. We use this inequality to prove that the Kullback-Leibler divergence, a `distance’ between two probability measures, is always nonnegative.
for all nonnegative functions measurable functions satisfying .
Proof. Let be a random variable with probability density function . Consider the random variable . We will show that
Since is concave, Jensen’s inequality gives
Observing that completes the proof.
The information inequality is used to showed that the Kullback-Leibler divergence between two distributions of continuous random variables with probability density functions and , respectively, defined by
is nonnegative. Note that the Kullback-Leibler divergence can be generalized to the case where are probability measures on a measure space and is absolutely continuous with respect to . In the case, we define
where is the Radon-Nikodym derivative of with respect to . To see that is sitll nonnegative, we apply Jensen’s inequality to the convex function to obtain
Note that, although the KL-divergence can be intuivitely thought of as a `distance’ between two probability measures, it does not define a metric since it need not be symmetric. For a simple counter example, let and . Then
We now apply the information inequality to proving some results concerning maximal entropy. Define the entropy of to be
We also use the notation .
We now prove that the distribution has maximum entropy among all probability density functions supported on . For any pdf supported on and , we have by the information inequality that
Taking completes the proof.
We now prove that has maximum entropy among all probability density functions on that have mean and variance . If is the density function of a random variable, then
since if has pdf , then
Taking shows that this upper bound is attained.