## Maximum Likelihood Distribution: Poisson Distribution

Last Friday, a coworker and I were chatting about our current projects. Since her work inspired me to write this post, I obviously found it more interesting than what was I doing. Plus, I am still stuck on how best to approach mine. And so I will save the details for another day.

My coworker was trying to estimate the distribution of a certain metric for a class of orders processed by the warehouse on a given day of the week. The distribution is discrete, taking integer values greater than or equal to $4$. It’s also very right-skewed–it has a drawn-out right tail. Learning of these properties, I thought the Poisson distribution might be a good fit. For a review of the basics of the Poisson distribution, check out the Wikipedia page.

The Poisson distribution is specified by a single real parameter $\lambda>0$. So to fit a distribution to a sample, we need an estimator for $\lambda$. One (popular) option is the maximum likelihood estimator $\hat{\lambda}$, which is informally the value of $\lambda$ that is most “likely” to give the observed sample. Given i.i.d. random variables $X_{1},\cdots,X_{n}$, the maximum likelihood estimator, assuming it exists, is the random variable $\hat{\lambda}$ which maximizes the likelihood function

$\displaystyle\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})=\prod_{k=1}^{n}f(X_{k};\lambda)$

Suppose $X_{1},\cdots,X_{n}$ are i.i.d. $\text{Pois}(\lambda)$ random variables, where $\lambda>0$. The likelihood function of the observations is

$\displaystyle\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})=\prod_{k=1}^{n}\dfrac{\lambda^{X_{k}}}{X_{k}!}e^{-\lambda}=\dfrac{\lambda^{\sum_{k=1}^{n}X_{k}}}{\prod_{k=1}^{n}X_{k}!}e^{-n\lambda}$

To find the critical points $\hat{\lambda}$, we take the natural logarithm and then the partial derivative with respect to $\lambda$. Setting the derivative equal to $0$, we see that

$\begin{array}{lcl}\displaystyle0=\dfrac{\partial}{\partial\lambda}\log\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})&=&\displaystyle\log(\hat{\lambda})\sum_{k=1}^{n}X_{k}-n\hat{\lambda}-\sum_{k=1}^{n}\log(X_{k}!)\\[.9 em]&=&\displaystyle\dfrac{\sum_{k=1}^{n}X_{k}}{\hat{\lambda}}-n,\end{array}$

which implies that

$\displaystyle\hat{\lambda}=\dfrac{1}{n}\sum_{k=1}^{n}X_{k}$

Since $\frac{\partial}{\partial\lambda}\log\mathcal{L}(\lambda\mid X_{1},\cdots,X_{n})>0$, for $\lambda<\hat{\lambda}$, and $<0$, for $\lambda>\hat{\lambda}$, we conclude that $\hat{\lambda}$ maximizes the likelihood function.

The maximum likelihood estimator $\hat{\lambda}$ is unbiased, meaning its expectation equals $\lambda$, since the linearity of the expectation implies that

$\displaystyle\mathbb{E}[\hat{\lambda}]=\mathbb{E}\left[\dfrac{1}{n}\sum_{k=1}^{n}X_{k}\right]=\dfrac{1}{n}\sum_{k=1}^{n}\mathbb{E}[X_{k}]=\dfrac{1}{n}\sum_{k=1}^{n}\lambda=\lambda$

Furthermore, $\hat{\lambda}$ is efficient, which means that the estimator’s variance achieves the lower bound for the variance of any estimator of $\hat{\lambda}$. To formulate efficiency, we need the notion of the Fisher information of a random variable. We define the Fisher information $\mathcal{I}(\theta)$ of $X$ with unknown parameter $\theta$ by

$\displaystyle\mathcal{I}(\theta):=\mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}\log f(X;\theta)\right)^{2}\mid\theta\right]$

If the log-likelihood function is twice differentiable with respect to $\theta$, then

$\begin{array}{lcl}\displaystyle-\mathbb{E}\left[\dfrac{\partial^{2}}{\partial\theta^{2}}\log f(X;\theta)\right]=-\mathbb{E}\left[\dfrac{\partial}{\partial\theta}\dfrac{\frac{\partial}{\partial\theta}f(X;\theta)}{f(X;\theta)}\right]&=&\displaystyle-\mathbb{E}\left[\dfrac{\frac{\partial^{2}}{\partial\theta^{2}}f(X;\theta)}{f(X;\theta)}-\dfrac{\left(\frac{\partial}{\partial\theta}f(X;\theta)\right)^{2}}{\left(f(X;\theta)\right)^{2}}\right]\\[2 em]&=&\displaystyle\dfrac{\partial^{2}}{\partial\theta^{2}}\int f(x;\theta)dx+\mathbb{E}\left[\left(\dfrac{\frac{\partial}{\partial\theta}f(X;\theta)}{f(X;\theta)}\right)^{2}\right] \\[2 em]&=&\displaystyle\dfrac{\partial^{2}}{\partial\theta^{2}}[1]+\mathbb{E}\left[\left(\dfrac{\partial}{\partial\theta}\log f(X;\theta)\right)^{2}\right]\\[2 em]&=&\displaystyle\mathcal{I}(\theta)\end{array}$

An important result in statistics called the Cramér–Rao bound relates the Fisher information of a random variable to the variance of an estimator $\hat{\theta}$ of the unknown parameter $\theta$ through the inequality

$\displaystyle\text{Var}(\hat{\theta})=\dfrac{1}{\mathcal{I}(\theta)}$

Returning to my claim that $\hat{\lambda}$ is efficient, observe that

To see this, observe that the Fisher information of $\hat{\lambda}$ is

$\begin{array}{lcl}\displaystyle\mathcal{I}(\lambda)=-\mathbb{E}\left[\dfrac{\partial^{2}}{\partial\lambda^{2}}\log\left(\dfrac{(n\lambda)^{X}}{X!}e^{-n\lambda}\right)\right]=\displaystyle-\mathbb{E}\left[\dfrac{\partial}{\partial\lambda}\left[\dfrac{X}{\lambda}-n\right]\right]=\displaystyle\dfrac{1}{\lambda^{2}}\mathbb{E}\left[X\right]=\displaystyle\dfrac{n\lambda}{\lambda^{2}}=\displaystyle\dfrac{1}{\text{Var}(\hat{\lambda})}\end{array}$

A fortiori, $\hat{\lambda}$ is a minimum-variance unbiased estimator

I have no idea whether the Poisson distribution is a good fit for the data mentioned in the introduction. Laughably, I have not looked at the actual data yet!