Maximum Likelihood Distribution: Poisson Distribution

Last Friday, a coworker and I were chatting about our current projects. Since her work inspired me to write this post, I obviously found it more interesting than what was I doing. Plus, I am still stuck on how best to approach mine. And so I will save the details for another day.

My coworker was trying to estimate the distribution of a certain metric for a class of orders processed by the warehouse on a given day of the week. The distribution is discrete, taking integer values greater than or equal to 4. It’s also very right-skewed–it has a drawn-out right tail. Learning of these properties, I thought the Poisson distribution might be a good fit. For a review of the basics of the Poisson distribution, check out the Wikipedia page.

The Poisson distribution is specified by a single real parameter \lambda>0. So to fit a distribution to a sample, we need an estimator for \lambda. One (popular) option is the maximum likelihood estimator \hat{\lambda}, which is informally the value of \lambda that is most “likely” to give the observed sample. Given i.i.d. random variables X_{1},\cdots,X_{n}, the maximum likelihood estimator, assuming it exists, is the random variable \hat{\lambda} which maximizes the likelihood function

\displaystyle\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})=\prod_{k=1}^{n}f(X_{k};\lambda)

Suppose X_{1},\cdots,X_{n} are i.i.d. \text{Pois}(\lambda) random variables, where \lambda>0. The likelihood function of the observations is

\displaystyle\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})=\prod_{k=1}^{n}\dfrac{\lambda^{X_{k}}}{X_{k}!}e^{-\lambda}=\dfrac{\lambda^{\sum_{k=1}^{n}X_{k}}}{\prod_{k=1}^{n}X_{k}!}e^{-n\lambda}

To find the critical points \hat{\lambda}, we take the natural logarithm and then the partial derivative with respect to \lambda. Setting the derivative equal to 0, we see that

\begin{array}{lcl}\displaystyle0=\dfrac{\partial}{\partial\lambda}\log\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})&=&\displaystyle\log(\hat{\lambda})\sum_{k=1}^{n}X_{k}-n\hat{\lambda}-\sum_{k=1}^{n}\log(X_{k}!)\\[.9 em]&=&\displaystyle\dfrac{\sum_{k=1}^{n}X_{k}}{\hat{\lambda}}-n,\end{array}

which implies that


Since \frac{\partial}{\partial\lambda}\log\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})>0, for \lambda<\hat{\lambda}, and <0, for \lambda>\hat{\lambda}, we conclude that \hat{\lambda} maximizes the likelihood function.

The maximum likelihood estimator \hat{\lambda} is unbiased, meaning its expectation equals \lambda, since the linearity of the expectation implies that


Furthermore, \hat{\lambda} is efficient, which means that the estimator’s variance achieves the lower bound for the variance of any estimator of \hat{\lambda}. To formulate efficiency, we need the notion of the Fisher information of a random variable. We define the Fisher information \mathcal{I}(\theta) of X with unknown parameter \theta by 

\displaystyle\mathcal{I}(\theta):=\mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}\log f(X;\theta)\right)^{2}\mid\theta\right]

If the log-likelihood function is twice differentiable with respect to \theta, then

\begin{array}{lcl}\displaystyle-\mathbb{E}\left[\dfrac{\partial^{2}}{\partial\theta^{2}}\log f(X;\theta)\right]=-\mathbb{E}\left[\dfrac{\partial}{\partial\theta}\dfrac{\frac{\partial}{\partial\theta}f(X;\theta)}{f(X;\theta)}\right]&=&\displaystyle-\mathbb{E}\left[\dfrac{\frac{\partial^{2}}{\partial\theta^{2}}f(X;\theta)}{f(X;\theta)}-\dfrac{\left(\frac{\partial}{\partial\theta}f(X;\theta)\right)^{2}}{\left(f(X;\theta)\right)^{2}}\right]\\[2 em]&=&\displaystyle\dfrac{\partial^{2}}{\partial\theta^{2}}\int f(x;\theta)dx+\mathbb{E}\left[\left(\dfrac{\frac{\partial}{\partial\theta}f(X;\theta)}{f(X;\theta)}\right)^{2}\right] \\[2 em]&=&\displaystyle\dfrac{\partial^{2}}{\partial\theta^{2}}[1]+\mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}\log f(X;\theta)\right)^{2}\right]\\[2 em]&=&\displaystyle\mathcal{I}(\theta)\end{array}

An important result in statistics called the Cramér–Rao bound relates the Fisher information of a random variable to the variance of an estimator \hat{\theta} of the unknown parameter \theta through the inequality


Returning to my claim that \hat{\lambda} is efficient, observe that

To see this, observe that the Fisher information of \hat{\lambda} is


A fortiori, \hat{\lambda} is a minimum-variance unbiased estimator

I have no idea whether the Poisson distribution is a good fit for the data mentioned in the introduction. Laughably, I have not looked at the actual data yet!

This entry was posted in math.OC, math.PR, math.ST and tagged , , , , , . Bookmark the permalink.

2 Responses to Maximum Likelihood Distribution: Poisson Distribution

  1. Anonymous says:


  2. pinkfloyda says:

    Thanks for the article, really helps my homework, :-), by the way, the maximum likelihood is only to be used after you have decided to use a certain distribution. But how to pick a distribution to fit data well is tricky.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s