## Rao-Blackwell Theorem

Suppose we have a collection of probability measures $\left\{\mathbb{P}_{\theta}:\theta\in\Theta\right\}$ index by some set $\Theta$. We call $\theta$ the parameter and $\Theta$ the parameter space. For example, we can take $\Theta := (0,\infty)$ and let $\left\{\mathbb{P}_{\theta}\right\}$ be the collection of exponential distributions with parameter $\theta$.

We first prove an elementary result for conditional expectation that is sometimes called the smoothing lemma.

Lemma 1. Let $X$ be a $L^{1}$ random variable on a probability space $(\Omega,\mathcal{F},\mathbb{P})$, and suppose that $\mathcal{F}_{1}\subset\mathcal{F}_{2}$ are sub-$\sigma$-algebras of $\mathcal{F}$. Then

$\displaystyle\mathbb{E}\left[\mathbb{E}\left[X\mid\mathcal{F}_{2}\right]\mid\mathcal{F}{1}\right]=\mathbb{E}\left[\mathbb{E}\left[X\mid\mathcal{F}_{1}\right]\mid\mathcal{F}_{2}\right]=\mathbb{E}\left[X_{2}\mid\mathcal{F}_{1}\right]$

Proof. For any $A\in\mathcal{F}_{1}$,

$\begin{array}{lcl}\displaystyle\int_{A}\mathbb{E}[\mathbb{E}[X\mid\mathcal{F}_{2}]\mid\mathcal{F}_{1}]d\mathbb{P}=\int_{A}\mathbb{E}[X\mid\mathcal{F}_{2}]d\mathbb{P}&=&\displaystyle\int_{A}Xd\mathbb{P}\\[0.9em]&=&\displaystyle\int_{A}\mathbb{E}[X\mid\mathcal{F}_{1}]d\mathbb{P}\\[0.9em]&=&\displaystyle\int_{A}\mathbb{E}[\mathbb{E}[X\mid\mathcal{F}_{1}]\mid\mathcal{F}_{2}]d\mathbb{P}\end{array},$

since $A\in\mathcal{F}_{2}$. $\Box$

Using Lemma 1, we can obtain the conditional analogue of the computational formula for the variance of a random variable.

Lemma 2. If $Y \in L^{2}$ and $\mathcal{G}\subset\mathcal{F}$ is a sub-$\sigma$-algebra, then

$\displaystyle \mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]=\mathbb{E}[Y^{2}]-\mathbb{E}[(\mathbb{E}[Y\mid\mathcal{G}])^{2}]$

$\begin{array}{lcl}\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]&=&\displaystyle\mathbb{E}\left[Y^{2}-2Y\mathbb{E}[Y\mid\mathcal{G}]+(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\\&=&\displaystyle\mathbb{E}\left[Y^{2}\right]-2\mathbb{E}\left[\mathbb{E}\left[Y\mathbb{E}[Y\mid\mathcal{G}]\mid\mathcal{G}\right]\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\\&=&\displaystyle\mathbb{E}[Y^{2}]-2\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\\&=&\displaystyle\mathbb{E}[Y^{2}]-\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\end{array}$

where we use that $\mathbb{E}[Y\mid\mathcal{G}]$ is, by definition, $\mathcal{G}$-measurable. $\Box$

We define the conditional variance of an $L^{2}$ random variable $X$ with respect to a sub-$\sigma$-algebra by

$\displaystyle\text{Var}(X\mid\mathcal{G}):=\mathbb{E}\left[(X-\mathbb{E}[X\mid\mathcal{G}])^{2}\mid\mathcal{G}\right]$

Note that $\mathbb{E}[X^{2}]<\infty$ implies that $\mathbb{E}[(\mathbb{E}[X\mid\mathcal{G}])^{2}]<\infty$ by conditional Jensen’s inequality, so conditional variance is well-defined.

Our last lemma before getting to the main result of this post, the Rao-Blackwell theorem,  is a lower bound for the approximation of an $L^{2}$ random variable $Y$ by an $L^{2}$ random variable $g(X)$.

Lemma 3. Let $X,Y$ be random variables with finite variance, let $\mathcal{G}$ be a sub-$\sigma$-algebra of $\mathcal{F}$, and suppose that $X$ is $\mathcal{G}$-measurable. Then

$\displaystyle\mathbb{E}\left[(Y-X)^{2}\right] = \mathbb{E}\left[\text{Var}(Y\mid\mathcal{G})\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]\geq\mathbb{E}[\text{Var}(Y\mid\mathcal{G})],$

where equality holds if and only if $X=\mathbb{E}[Y\mid\mathcal{G}]$ a.s.

Proof. We add $0$ and expanding the quadratic to obtain

$\begin{array}{lcl}\displaystyle\mathbb{E}\left[(Y-X)^{2}\right]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}+2(Y-\mathbb{E}[Y\mid\mathcal{G}])(\mathbb{E}[Y\mid\mathcal{G}]-X)+(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]\\[0.3em]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]+2\mathbb{E}\left[\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])(\mathbb{E}[Y\mid\mathcal{G}]-X)\mid\mathcal{G}\right]\right]\\[0.3em]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]+2\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)\mathbb{E}[Y-\mathbb{E}[Y\mid\mathcal{G}]\mid\mathcal{G}]\right]\\[0.3em]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]\end{array}$

since $\mathbb{E}[Y-\mathbb{E}[Y\mid\mathcal{G}]]=0$ by the smoothing lemma. The equality condition is immediate. $\Box$

Lemma 3 is really the “best approximation” property of orthogonal projections in Hilbert space theory translated into the language of probability theory.

Recall that we say a random variable $\hat{\theta}$ is an unbiased estimator of a parameter $\theta$ if $\mathbb{E}[\hat{\theta}]=\theta$, where $\theta$ represents the “unknown parameter.” If we have a sample $Y$, a statistic $T = T(Y)$ is said to be sufficient if the conditional distribution $\mathbb{P}_{Y\mid T}(\cdot, T=t)$ is independent of the value $\theta$. Intuitively, once we observe a random sample $X$ and compute the sufficient statistic $T(X)$, the original data do not contain any additional information about the unknown parameter $\theta$.

An important result in statistical theory for determining whether a statistic is sufficient is the Fisher-Neyman factorization theorem, which we will not prove. A special case of the factorization theorem says a statistic $\phi(X)$ of a sample $X=(X_{1},\cdots,X_{n})$ with parameter $\theta$ is sufficient if the joint density function $f_{X,\theta}$ with parameter $\theta$ can be factored

$\displaystyle f_{X,\theta}(x)=h(x)g(\phi(x),\theta)$

where $h,g$ are Borel-measurable function.

We use the factorization theorem to show that the sample mean of independent random normal variables $X_{1},\cdots,X_{n}$ with unknown mean $\theta$ and variance $1$ is sufficient. Indeed, by independence, the joint density function of the $X_{j}$ is

$\begin{array}{lcl}\displaystyle f_{X_{1},\cdots,X_{n}}(x_{1},\cdots,x_{n})=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{(x_{j}-\theta)^{2}}{2}\right)=\displaystyle(2\pi)^{-\frac{n}{2}}\exp\left(\dfrac{1}{2}\sum_{j=1}^{n}(x_{j}-\theta)^{2}\right)\end{array}$

Since the sum of i.i.d. $\text{N}(\theta,1)$ random variables is distributed $\text{N}(n\theta,n)$, we have by the scaling properties of the normal distribution that $\overline{X}:=\frac{1}{n}\sum_{j=1}^{n}\sim\text{N}(\theta,n^{-1})$. It follows that the density function of $X\mid\overline{X}$, where $X := (X_{1},\cdots,X_{n})$, is

$\begin{array}{lcl}\displaystyle\dfrac{(2\pi)^{-\frac{n}{2}}\exp\left(\sum_{j=1}^{n}\dfrac{(x_{j}-\theta)^{2}}{2}\right)}{(2\pi n^{-1})^{-\frac{1}{2}}\exp\left(\dfrac{(\overline{x}-\theta)^{2}}{2n^{-1}}\right)}&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\sum_{j=1}^{n}\dfrac{(x_{j}-\theta)^{2}}{2}-\dfrac{n(\overline{x}-\theta)^{2}}{2}\right)\\&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\sum_{j=1}^{n}\dfrac{x_{j}^{2}-2x_{j}\theta+\theta^{2}}{2}-\dfrac{n(\overline{x}^{2}-2\overline{x}\theta+\theta^{2}}{2}\right)\\&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\dfrac{(\sum_{j=1}^{n}x_{j}^{2})-2n\overline{x}\theta+n\theta^{2}-n\overline{x}^{2}+2n\overline{x}\theta-n\theta^{2}}{2}\right)\\&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\dfrac{\sum_{j=1}^{n}(x_{j}^{2}-\overline{x}^{2})}{2}\right)\end{array}$

This last expression is evidently independent of $\theta$, which shows that $\overline{X}$ is sufficient.

If we start with an estimator $Y$, a sufficient statistic allow us to obtain an estimator $\mathbb{E}[Y\mid T]$, known as the Rao-Blackwell estimator, which has lower expected square loss than the original estimator $Y$. This result is the Rao-Blackwell theorem, which we state and prove now.

Theorem 4. (Rao-Blackwell) Suppose that $T$ is a sufficient statistic for $\theta in \Theta$, and suppose that $Y$ is an unbiased estimator of $\theta$ such that

$\displaystyle\mathbb{E}_{\theta}\left[(Y-\theta)^{2}\right]<\infty, \indent \forall \theta\in\Theta$

Then $\mathbb{E}[Y\mid T]$ is an unbiased estimator of $\theta$ and

$\displaystyle\mathbb{E}_{\theta}\left[(Y-\theta)^{2}\right]=\mathbb{E}\left[(Y-\mathbb{E}[Y\mid T])^{2}\right]+\mathbb{E}_{\theta}\left[(\mathbb{E}[Y\mid T]-\theta)^{2}\right]$

Proof. That $\mathbb{E}[Y\mid T]$ is a consistent estimator of $\theta$ is immediate from the tower property of conditional expectation:

$\displaystyle\mathbb{E}_{\theta}\left[\mathbb{E}[Y\mid T]\right]=\mathbb{E}_{\theta}[Y]=\theta$

The equality in the statement of the theorem follows from application of the preceding lemmas. We have by Lemma 3 that

$\begin{array}{lcl}\displaystyle\mathbb{E}_{\theta}\left[(Y-\theta)^{2}\right]&=&\displaystyle\mathbb{E}_{\theta}\left[\mathbb{E}_{\theta}\left[(Y-\mathbb{E}[Y\mid T])^{2}\mid T\right]\right]+\mathbb{E}_{\theta}\left[(\mathbb{E}[Y\mid T]-\theta)^{2}\right]\\[.3em]&=&\displaystyle\mathbb{E}_{\theta}\left[(Y-\mathbb{E}[Y\mid T])^{2}\right]+\mathbb{E}_{\theta}\left[(\mathbb{E}[Y\mid T]-\theta)^{2}\right]\end{array}$

Noting that both terms are nonnegative completes the proof. $\Box$

A useful consequence of the Rao-Blackwell theorem is that we can restrict our search for minimum-variance unbiased estimators (MVUEs) to sufficient statistics.