Rao-Blackwell Theorem

Suppose we have a collection of probability measures \left\{\mathbb{P}_{\theta}:\theta\in\Theta\right\} index by some set \Theta. We call \theta the parameter and \Theta the parameter space. For example, we can take \Theta := (0,\infty) and let \left\{\mathbb{P}_{\theta}\right\} be the collection of exponential distributions with parameter \theta.

We first prove an elementary result for conditional expectation that is sometimes called the smoothing lemma.

Lemma 1. Let X be a L^{1} random variable on a probability space (\Omega,\mathcal{F},\mathbb{P}), and suppose that \mathcal{F}_{1}\subset\mathcal{F}_{2} are sub-\sigma-algebras of \mathcal{F}. Then

\displaystyle\mathbb{E}\left[\mathbb{E}\left[X\mid\mathcal{F}_{2}\right]\mid\mathcal{F}{1}\right]=\mathbb{E}\left[\mathbb{E}\left[X\mid\mathcal{F}_{1}\right]\mid\mathcal{F}_{2}\right]=\mathbb{E}\left[X_{2}\mid\mathcal{F}_{1}\right]

Proof. For any A\in\mathcal{F}_{1},

\begin{array}{lcl}\displaystyle\int_{A}\mathbb{E}[\mathbb{E}[X\mid\mathcal{F}_{2}]\mid\mathcal{F}_{1}]d\mathbb{P}=\int_{A}\mathbb{E}[X\mid\mathcal{F}_{2}]d\mathbb{P}&=&\displaystyle\int_{A}Xd\mathbb{P}\\[0.9em]&=&\displaystyle\int_{A}\mathbb{E}[X\mid\mathcal{F}_{1}]d\mathbb{P}\\[0.9em]&=&\displaystyle\int_{A}\mathbb{E}[\mathbb{E}[X\mid\mathcal{F}_{1}]\mid\mathcal{F}_{2}]d\mathbb{P}\end{array},

since A\in\mathcal{F}_{2}. \Box

Using Lemma 1, we can obtain the conditional analogue of the computational formula for the variance of a random variable.

Lemma 2. If Y \in L^{2} and \mathcal{G}\subset\mathcal{F} is a sub-\sigma-algebra, then 

\displaystyle \mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]=\mathbb{E}[Y^{2}]-\mathbb{E}[(\mathbb{E}[Y\mid\mathcal{G}])^{2}]

Proof. Expanding the quadratic,

\begin{array}{lcl}\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]&=&\displaystyle\mathbb{E}\left[Y^{2}-2Y\mathbb{E}[Y\mid\mathcal{G}]+(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\\&=&\displaystyle\mathbb{E}\left[Y^{2}\right]-2\mathbb{E}\left[\mathbb{E}\left[Y\mathbb{E}[Y\mid\mathcal{G}]\mid\mathcal{G}\right]\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\\&=&\displaystyle\mathbb{E}[Y^{2}]-2\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\\&=&\displaystyle\mathbb{E}[Y^{2}]-\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]\end{array}

where we use that \mathbb{E}[Y\mid\mathcal{G}] is, by definition, \mathcal{G}-measurable. \Box

We define the conditional variance of an L^{2} random variable X with respect to a sub-\sigma-algebra by

\displaystyle\text{Var}(X\mid\mathcal{G}):=\mathbb{E}\left[(X-\mathbb{E}[X\mid\mathcal{G}])^{2}\mid\mathcal{G}\right]

Note that \mathbb{E}[X^{2}]<\infty implies that \mathbb{E}[(\mathbb{E}[X\mid\mathcal{G}])^{2}]<\infty by conditional Jensen’s inequality, so conditional variance is well-defined.

Our last lemma before getting to the main result of this post, the Rao-Blackwell theorem,  is a lower bound for the approximation of an L^{2} random variable Y by an L^{2} random variable g(X).

Lemma 3. Let X,Y be random variables with finite variance, let \mathcal{G} be a sub-\sigma-algebra of \mathcal{F}, and suppose that X is \mathcal{G}-measurable. Then

\displaystyle\mathbb{E}\left[(Y-X)^{2}\right] = \mathbb{E}\left[\text{Var}(Y\mid\mathcal{G})\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]\geq\mathbb{E}[\text{Var}(Y\mid\mathcal{G})],

where equality holds if and only if X=\mathbb{E}[Y\mid\mathcal{G}] a.s.

Proof. We add 0 and expanding the quadratic to obtain

\begin{array}{lcl}\displaystyle\mathbb{E}\left[(Y-X)^{2}\right]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}+2(Y-\mathbb{E}[Y\mid\mathcal{G}])(\mathbb{E}[Y\mid\mathcal{G}]-X)+(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]\\[0.3em]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]+2\mathbb{E}\left[\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])(\mathbb{E}[Y\mid\mathcal{G}]-X)\mid\mathcal{G}\right]\right]\\[0.3em]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]+2\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)\mathbb{E}[Y-\mathbb{E}[Y\mid\mathcal{G}]\mid\mathcal{G}]\right]\\[0.3em]&=&\displaystyle\mathbb{E}\left[(Y-\mathbb{E}[Y\mid\mathcal{G}])^{2}\right]+\mathbb{E}\left[(\mathbb{E}[Y\mid\mathcal{G}]-X)^{2}\right]\end{array}

since \mathbb{E}[Y-\mathbb{E}[Y\mid\mathcal{G}]]=0 by the smoothing lemma. The equality condition is immediate. \Box

Lemma 3 is really the “best approximation” property of orthogonal projections in Hilbert space theory translated into the language of probability theory.

Recall that we say a random variable \hat{\theta} is an unbiased estimator of a parameter \theta if \mathbb{E}[\hat{\theta}]=\theta, where \theta represents the “unknown parameter.” If we have a sample Y, a statistic T = T(Y) is said to be sufficient if the conditional distribution \mathbb{P}_{Y\mid T}(\cdot, T=t) is independent of the value \theta. Intuitively, once we observe a random sample X and compute the sufficient statistic T(X), the original data do not contain any additional information about the unknown parameter \theta.

An important result in statistical theory for determining whether a statistic is sufficient is the Fisher-Neyman factorization theorem, which we will not prove. A special case of the factorization theorem says a statistic \phi(X) of a sample X=(X_{1},\cdots,X_{n}) with parameter \theta is sufficient if the joint density function f_{X,\theta} with parameter \theta can be factored

\displaystyle f_{X,\theta}(x)=h(x)g(\phi(x),\theta)

where h,g are Borel-measurable function.

We use the factorization theorem to show that the sample mean of independent random normal variables X_{1},\cdots,X_{n} with unknown mean \theta and variance 1 is sufficient. Indeed, by independence, the joint density function of the X_{j} is

\begin{array}{lcl}\displaystyle f_{X_{1},\cdots,X_{n}}(x_{1},\cdots,x_{n})=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{(x_{j}-\theta)^{2}}{2}\right)=\displaystyle(2\pi)^{-\frac{n}{2}}\exp\left(\dfrac{1}{2}\sum_{j=1}^{n}(x_{j}-\theta)^{2}\right)\end{array}

Since the sum of i.i.d. \text{N}(\theta,1) random variables is distributed \text{N}(n\theta,n), we have by the scaling properties of the normal distribution that \overline{X}:=\frac{1}{n}\sum_{j=1}^{n}\sim\text{N}(\theta,n^{-1}). It follows that the density function of X\mid\overline{X}, where X := (X_{1},\cdots,X_{n}), is

\begin{array}{lcl}\displaystyle\dfrac{(2\pi)^{-\frac{n}{2}}\exp\left(\sum_{j=1}^{n}\dfrac{(x_{j}-\theta)^{2}}{2}\right)}{(2\pi n^{-1})^{-\frac{1}{2}}\exp\left(\dfrac{(\overline{x}-\theta)^{2}}{2n^{-1}}\right)}&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\sum_{j=1}^{n}\dfrac{(x_{j}-\theta)^{2}}{2}-\dfrac{n(\overline{x}-\theta)^{2}}{2}\right)\\&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\sum_{j=1}^{n}\dfrac{x_{j}^{2}-2x_{j}\theta+\theta^{2}}{2}-\dfrac{n(\overline{x}^{2}-2\overline{x}\theta+\theta^{2}}{2}\right)\\&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\dfrac{(\sum_{j=1}^{n}x_{j}^{2})-2n\overline{x}\theta+n\theta^{2}-n\overline{x}^{2}+2n\overline{x}\theta-n\theta^{2}}{2}\right)\\&=&\displaystyle\dfrac{1}{(\sqrt{2\pi})^{n-1}\sqrt{n}}\exp\left(\dfrac{\sum_{j=1}^{n}(x_{j}^{2}-\overline{x}^{2})}{2}\right)\end{array}

This last expression is evidently independent of \theta, which shows that \overline{X} is sufficient.

If we start with an estimator Y, a sufficient statistic allow us to obtain an estimator \mathbb{E}[Y\mid T], known as the Rao-Blackwell estimator, which has lower expected square loss than the original estimator Y. This result is the Rao-Blackwell theorem, which we state and prove now.

Theorem 4. (Rao-Blackwell) Suppose that T is a sufficient statistic for \theta in \Theta, and suppose that Y is an unbiased estimator of \theta such that

\displaystyle\mathbb{E}_{\theta}\left[(Y-\theta)^{2}\right]<\infty, \indent \forall \theta\in\Theta

Then \mathbb{E}[Y\mid T] is an unbiased estimator of \theta and

\displaystyle\mathbb{E}_{\theta}\left[(Y-\theta)^{2}\right]=\mathbb{E}\left[(Y-\mathbb{E}[Y\mid T])^{2}\right]+\mathbb{E}_{\theta}\left[(\mathbb{E}[Y\mid T]-\theta)^{2}\right]

Proof. That \mathbb{E}[Y\mid T] is a consistent estimator of \theta is immediate from the tower property of conditional expectation:

\displaystyle\mathbb{E}_{\theta}\left[\mathbb{E}[Y\mid T]\right]=\mathbb{E}_{\theta}[Y]=\theta

The equality in the statement of the theorem follows from application of the preceding lemmas. We have by Lemma 3 that

\begin{array}{lcl}\displaystyle\mathbb{E}_{\theta}\left[(Y-\theta)^{2}\right]&=&\displaystyle\mathbb{E}_{\theta}\left[\mathbb{E}_{\theta}\left[(Y-\mathbb{E}[Y\mid T])^{2}\mid T\right]\right]+\mathbb{E}_{\theta}\left[(\mathbb{E}[Y\mid T]-\theta)^{2}\right]\\[.3em]&=&\displaystyle\mathbb{E}_{\theta}\left[(Y-\mathbb{E}[Y\mid T])^{2}\right]+\mathbb{E}_{\theta}\left[(\mathbb{E}[Y\mid T]-\theta)^{2}\right]\end{array}

Noting that both terms are nonnegative completes the proof. \Box

A useful consequence of the Rao-Blackwell theorem is that we can restrict our search for minimum-variance unbiased estimators (MVUEs) to sufficient statistics.

Advertisements
This entry was posted in math.ST and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s