## Inverse Jensen Inequality

Lately, I’ve been on a bit of a convexity kick. It’s one of many subjects that I never studied in detail as a student, despite being acquainted with its importance. Not constrained by the curricula of offered courses, I have the freedom to indulge my mathematical curiosity. As one might sense from my recent posts, this curiosity is currently gorging itself on the basic theory of convex functions.

I also really like inequalities. So it’s not surprising that when you combine the property of convexity and the study of inequalities,  I might want to write about Jensen’s inequality. Indeed, this is the case.

Jensen’s inequality (in its discrete form) says that given real numbers $a_{1},\cdots,a_{n}\in I$, a convex function $f: I\rightarrow\mathbb{R}$, and convex combination $\lambda_{1},\cdots,\lambda_{n}\in [0,1]$,

$\displaystyle f\left(\sum_{k=1}^{n}\lambda_{k}a_{k}\right)\leq\sum_{k=1}^{n}\lambda_{k}f(a_{k})$

Note that if $f$ is concave, the inequality is just reversed. Jensen’s inequality is well-known to beginning math students, especially if they’ve learned a bit of probability theory, but less well-known–my sample size is 1, myself–are upper bounds for the quantity

$\displaystyle\sum_{k=1}^{n}\lambda_{k}f(x_{k})-f\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)\geq 0,\indent\forall(x_{1},\cdots,x_{n})\in I^{n}$

viewed as a function on the cube $I^{n}$. Today’s post is intended as a brief exposition of such bounds. We begin by stating the main theorem in the elementary theory of such bounds, then proving a weaker form of Lagrange’s mean value theorem for nondifferentiable functions, which we will need in the proof of the preceding theorem, and we’ll conclude by deriving some well-known inequalities, including Kantorovich’s.

Theorem 1. Let $f:[a,b]\rightarrow\mathbb{R}$ be a convex function and let $[m_{1},M_{1}],\cdots,[m_{n},M_{n}]$ be compact subintervals of $[a,b]$. Given $\lambda_{1},\cdots,\lambda_{n}\in[0,1]$ satisfying $\sum_{k=1}^{n}\lambda_{k}=1$, the function $E:\mathbb{R}^{n}\rightarrow\mathbb{R}$ defined by

$\displaystyle E(x_{1},\cdots,x_{n}):=\sum_{k=1}^{n}\lambda_{k}f(x_{k})-f\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)$

attains its supremum at a vertex of $\Omega:=\prod_{k=1}^{n}[m_{k},M_{k}]$.

The following lemma generalizes Lagrange’s mean value theorem for real-valued functions on an interval I by only requiring the weaker hypothesis that

$\displaystyle\underline{D}f(a)=\liminf_{x\rightarrow a}\dfrac{f(x)-f(a)}{x-a}\text{ and }\overline{D}f(a)=\limsup_{x\rightarrow a}\dfrac{f(x)-f(a)}{x-a}$,

respectively called the lower and upper derivatives of $f$ at the point $a$, exist  everywhere on $I$.

Lemma 2. Let $h: [a,b]\rightarrow\mathbb{R}$ be a continuous function. Then there exists a point $c\in (a,b)$ such that

$\displaystyle\underline{D}h(c)\leq\dfrac{h(b)-h(a)}{b-a}\leq\overline{D}h(c)$

Proof. Define a function $H: [a,b]\rightarrow\mathbb{R}$ by

$\displaystyle H(x):=h(x)-\dfrac{h(b)-h(a)}{b-a}(x-a),\indent\forall x\in(a,b)$

Observe that

$\displaystyle H(a)=h(a)=h(b)-\dfrac{h(b)-h(a)}{b-a}(b-a)=H(b)$

$H$ is continuous, being the composition of continuous functions, so by Weierstrass’ theorem, $H$ attains its infimum and supremum on $[a,b]$. Suppose $H$ attains its supremum at some point $c\in (a,b)$. Then for $\delta>0$ small,

$\displaystyle\dfrac{H(c+\delta)-H(c)}{\delta}\leq 0\text{ and }\dfrac{H(c+\delta)-H(c)}{-\delta}\geq 0$,

which implies that $\underline{D}H(c)\leq 0\leq\overline{D}H(c)$. If $H$ attains its infimum at $c \in(a,b)$, then reversing the preceding inequalities yields the same result. In either case, we have

$\begin{array}{lcl} \displaystyle 0\geq\underline{D}H(c)=\liminf_{x\rightarrow c}\dfrac{H(x)-H(c)}{x-c}&=&\displaystyle\dfrac{h(x)-\frac{h(b)-h(a)}{b-a}(x-a)-h(c)+\frac{h(b)-(a)}{b-a}(c-a)}{x-c}\\&=&\displaystyle\liminf_{x\rightarrow }\dfrac{h(x)-h(c)+\frac{h(b)-h(a)}{b-a}(c-x)}{x-c}\\&=&\displaystyle\underline{D}h(c)\end{array}$

and similarly, $\overline{D}h(c)\geq 0$. If $H$ attains its supremum and infimum at the endpoints, then $H$ is necessarily constant, and the conclusion of the lemma is immediate. $\Box$

We now use the lemma to prove the theorem.

Proof. Fix compact subintervals $[m_{1},M_{1}],\cdots,[m_{n},M_{n}]$. Since $f$ is convex, it is continuous; since $\Omega:=\prod_{k=1}^{n}[m_{k},M_{k}]$ is compact, $f$ attains its supremum and infimum on $\Omega$. By induction, it suffices to show that

$\displaystyle E(x_{1},\cdots,x_{n})\leq \sup_{\Omega}\left\{E(x_{1},\cdots,m_{k},\cdots,x_{n}),E(x_{1},\cdots,M_{k},\cdots,x_{n})\right\}$

for all $(x_{1},\cdots,x_{n})\in\Omega$ and $1\leq k\leq n$.

Assume the contrary: that there exists $(x_{1},\cdots,x_{n})\in\Omega$ and $k\in\left\{1,\cdots,n\right\}$ such that

$\displaystyle E(x_{1},\cdots,x_{n})>\sup\left\{E(x_{1},\cdots,m_{k},\cdots,x_{n}),E(x_{1},\cdots,M_{k},\cdots,x_{n})\right\}$

Note that this assumption forces $\lambda_{k}\in(0,1)$. By symmetry, we may assume that $k=1$. Consider the function

$\displaystyle h:[m_{1},M_{1}]\rightarrow\mathbb{R},\indent h(x):=E(x,x_{2},\cdots,x_{n})\ \forall x\in [m_{1},M_{1}]$

By the lemma, there exists $\xi\in (m_{1},x_{1})$ such that

$\displaystyle\dfrac{h(x_{1})-h(m_{1})}{x_{1}-m_{1}}\leq \overline{D}h(\xi)\Longleftrightarrow h(x_{1})-h(m_{1})\leq\overline{D}h(\xi)(x_{1}-m_{1})$

Since $h(x_{1})>h(m_{1})$, it follows that $\overline{D}h(\xi)>0$ and therefore

$\begin{array}{lcl}\displaystyle0&<&\displaystyle\limsup_{x\rightarrow \xi}\dfrac{h(x)-h(\xi)}{x-\xi}=\limsup_{x\rightarrow\xi}\dfrac{\lambda_{1}\left[f(x)-f(\xi)\right]-f\left(\lambda_{1}x+\sum_{k=2}^{n}\lambda_{k}x_{k}\right)+f\left(\lambda_{1}\xi+\sum_{k=2}^{n}\lambda_{k}x_{k}\right)}{x-\xi}\\&\Longleftrightarrow&\displaystyle\lambda_{1}\limsup_{x\rightarrow\xi}\dfrac{f(x)-f(\xi)}{x-\xi}>\limsup_{x\rightarrow x_{1}}\dfrac{f\left(\lambda_{1}x+\sum_{k=2}^{n}\lambda_{k}x_{k}\right)-f\left(\lambda_{1}\xi+\sum_{k=2}^{n}\lambda_{k}x_{k}\right)}{x-\xi}\\&\Longleftrightarrow&\displaystyle\overline{D}f(\xi)>\limsup_{x\rightarrow x_{1}}\dfrac{f\left(\lambda_{1}x+\sum_{k=2}^{n}\lambda_{k}x_{k}\right)-f\left(\lambda_{1}\xi+\sum_{k=2}^{n}\lambda_{k}x_{k}\right)}{\lambda_{1}x-\lambda_{1}\xi}\\&\Longleftrightarrow&\displaystyle\overline{D}f(\xi)>\overline{D}f(\lambda_{1}\xi+\lambda_{2}+\cdots+\lambda_{n}x_{n})\end{array}$

By the convexity of $f$, the $\overline{D}f$ is nondecreasing on $(a,b)$ and therefore $\xi>\lambda_{1}\xi+\lambda_{2}x_{2}+\cdots+\lambda_{n}x_{n}$. Thus,

$\displaystyle\xi=\sum_{k=1}^{n}\lambda_{k}\xi>\lambda_{1}\xi+\sum_{k=2}^{n}\lambda_{k}x_{k}\Longleftrightarrow\xi>\dfrac{\sum_{k=2}^{n}\lambda_{k}x_{k}}{\sum_{k=2}^{n}\lambda_{k}}$

We apply lemma 2 to $h|_{[x_{1},M_{1}]}$ and the same reasoning to obtain $\eta\in (x_{1},M_{1})$ such that $\overline{D}h(\eta)<0$,

$\displaystyle\overline{D}f(\eta)<\overline{D}f(\lambda_{1}\eta+\lambda_{2}x_{2}+\cdots+\lambda_{n}x_{n})\Longrightarrow \eta<\dfrac{\sum_{k=2}^{n}\lambda_{k}x_{k}}{\sum_{k=2}^{n}\lambda_{k}}$

But this is a contradiction, since $\eta>x_{1}>\xi$. $\Box$

If we impose the additional hypothesis that $f$ is differentiable on an interval $[a,b]$, then we have the inequality

$\displaystyle\sum_{k=1}^{n}\lambda_{k}f(x_{k})-f\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)\leq\dfrac{(b-a)(f'(b)-f'(a))}{4}$

Proof. By Theorem 1,

$\displaystyle E(x_{1},\cdots,x_{n}):=\sum_{k=1}^{n}\lambda_{k}f(x_{k})-f\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)$

attains its maximum at a point $(c_{1},\cdots,c_{n})\in\left\{a,b\right\}^{n}$. Let $I\subset\left\{1,\cdots,n\right\}$ denote the subset of indices $k$ such that $c_{k}=a$. Then

$\begin{array}{lcl}\displaystyle\sum_{k=1}^{n}\lambda_{k}f(x_{k})-f\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)&\leq&\displaystyle\sum_{k\in I}\lambda_{k}f(a)+\sum_{k\in I^{c}}\lambda_{k}f(b)-f\left(\sum_{k\in I}\lambda_{k}a+\sum_{k\in I^{c}}\lambda_{k}b\right)\\[.9 em]&=&\displaystyle f(a)\underbrace{\sum_{k\in I}\lambda_{k}}_{:=s}+f(b)\sum_{k\in I^{c}}-f\left(a\sum_{k\in I}\lambda_{k}+b\sum_{k\in I^{c}}\lambda_{k}\right)\\[.9 em]&=&\displaystyle sf(a)+(1-s)f(b)-f(sa+(1-s)b)\end{array}$

Since $f$ is convex and differentiable, we have

$\displaystyle f(x)\geq f(t)+(x-t)f'(t),\indent\forall x,t\in [a,b]$

In particular,

$\displaystyle f(sa+(1-s)b)\geq f(a)+(1-s)(b-a)f'(a),\indent f(sa+(1-s)b)\geq f(b)-s(b-a)f'(b)$

Hence,

$\begin{array}{lcl} \displaystyle sf(a)+(1-s)f(b)-f(sa+(1-s)b)&=&\displaystyle s\left[f(a)-f(sa+(1-s)b)\right]+(1-s)\left[f(b)-f(sa+(1-s)b)\right]\\&\leq&\displaystyle -s(1-s)(b-a)f'(a)+(1-s)s(b-a)f'(b)\\&=&\displaystyle s(1-s)(b-a)[f'(b)-f'(a)]\end{array}$

The conclusion follows from noting that $s(1-s)\leq \frac{1}{4}$ for $s\in [0,1]$. $\Box$

Theorem 1 also yields the following simple corollary.

If $f: [a,b]\rightarrow\mathbb{R}$ be a convex function, then

$\displaystyle\dfrac{f(a)+f(b)}{2}-f\left(\dfrac{a+b}{2}\right)\geq\dfrac{f(c)+f(d)}{2}-f\left(\dfrac{c+d}{2}\right)$

for all $a\leq c\leq d \leq b$.

Proof. We apply Theorem 1 with $m_{1}=m_{2}=a$ and $M_{1}=M_{2}=b$ and $\lambda_{1}=\lambda_{2}=\frac{1}{2}$. The function

$\displaystyle E(x_{1},x_{n}):=\dfrac{f(x_{1})+f(x_{2})}{2}-f\left(\dfrac{x_{1}+x_{2}}{2}\right)$

attains its supremum at a vertex of the cube $[a,b]^{2}$. Since $E(x_{1},x_{2})\geq 0$ by Jensen’s inequality, we see that the conclusion of the corollary is immediate if the supremum is attained at $(x_{1},x_{2})\in\left\{(a,a),(b,b)\right\}$. If the supremum is attained at $(x_{1},x_{2})\in\left\{(a,b),(b,a)\right\}$, then the conclusion follows from noting that $c,d\in[a,b]$ by hypothesis. $\Box$

We can use Theorem 1 to prove an upper bound for how good an approximation Jensen’s inequality, which Niculescu and Persson attribute to L.G. Khanin.

Proposition 4. Let $p>1$, $x_{1},\cdots,x_{n}\in[0,M]$, and $\lambda_{1},\cdots,\lambda_{n}\in[0,1]$ satisfying $\sum_{k=1}^{n}\lambda_{k}=1$. Then

$\displaystyle\sum_{k=1}^{n}\lambda_{k}x_{k}^{p}\leq\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)^{p}+(p-1)p^{\frac{p}{1-p}}M^{p}$

Proof. Fix $p>1$ and define $E:[0,M]^{n}\rightarrow\mathbb{R}$ by

$\displaystyle E(x_{1},\cdots,x_{n}):=\sum_{k=1}^{n}\lambda_{k}x_{k}^{p}-\left(\sum_{k=1}^{n}\lambda_{k}x_{k}\right)^{p},\indent\forall(x_{1},\cdots,x_{n})\in[0,M]^{n}$

By Theorem 1, $E$ attains its supremum at a point $(c_{1},\cdots,c_{n})\in\left\{0,M\right\}^{n}$. Let $I\subset\left\{1,\cdots,n\right\}$ denote the subset of indices $k$ such that $c_{k}=M$. Then

$\begin{array}{lcl}\displaystyle E(x_{1},\cdots,x_{n})\leq E(c_{1},\cdots,c_{n})=\sum_{k\in I}\lambda_{k}c_{k}^{p}-\left(\sum_{k\in I}\lambda_{k}c_{k}\right)^{p}&=&\displaystyle\sum_{k\in I}\lambda_{k}M^{p}-\left(\sum_{k\in I}\lambda_{k}M\right)^{p}\\[.9 em]&=&\displaystyle M^{p}\underbrace{\sum_{k\in I}\lambda_{k}}_{:=s}-M^{p}\left(\sum_{k\in I}\lambda_{k}\right)^{p}\\[.9 em]&\leq&\displaystyle M^{p}\sup_{t\in [0,1]}t-t^{p}\end{array}$

We can use the calculus to compute $\sup_{t\in [0,1]}t-t^{p}$. Indeed, define a function $g:[0,1]\rightarrow\mathbb{R}$ by $g(t):=t-t^{p}$. Then

$\displaystyle g'(t)=1-pt^{p-1}=0\Longleftrightarrow t=p^{\frac{1}{1-p}}$,

$g'(t)>0$ for $t\in [0,p^{\frac{1}{1-p}})$, and $g'(t)<0$ for $t\in (p^{\frac{1}{1-p}},1]$. Hence, $t=p^{\frac{1}{1-p}}$ maximizes $g$, and we conclude that

$\displaystyle E(x_{1},\cdots,x_{n})\leq M^{p}\left(p^{\frac{1}{1-p}}-p^{\frac{p}{1-p}}\right)=M^{p}p^{\frac{p}{1-p}}\left(p-1\right)$

$\Box$