## Brief Observation on Outliers

Outlier” is a good example of a word which is not a term of art but nevertheless has a well-understood meaning or connotation: something set apart from the rest of the pack by a measurable or observable attribute. While connotations are useful as intuition guiding the development of mathematical theory, they are not useful when we get down to the business of rigorous proof. So, what is an outlier in mathematical statistics? If we have a sample of $n$ observations $X_{1},\ldots,X_{n}$, when should we call a sample $X_{i}$ an outlier?

Some authors define an observation $X_{i}$ to be an outlier if it is at least $k$ standard deviations away from the sample mean, where $k$ is some author-dependent integer.

$\displaystyle\left|X_{i}-\overline{X}\right|\geq k\cdots_{X}$, where $\displaystyle\overline{X}=\dfrac{1}{n}\sum_{j=1}^{n}X_{j}\text{ and }s_{X}=\dfrac{1}{n-1}\sum_{j=1}^{n}\left(X_{j}-\overline{X}\right)^{2}$

A problem with this definition of outlier is that the sample mean and the (unbiased) sample standard deviation is itself sensitive to outliers. This sensitivity leads to a problem in small samples, where observations that are intuitively “outliers” are not outliers by our mathematical definition. Consider the following example. We have a sample of annual incomes (measured in \$ thousands ) for five individuals:

$\displaystyle X_{1}=35, X_{2}=50, X_{3}=87, X_{4}=200, X_{5}=2000000$

Bill Gates or some other Forbes-list billionaire happened to be in the sample, which explains the fifth observation. The sample mean and sample standard deviation are respectively given by

$\displaystyle\overline{X}=400074.4\text{ and }s_{X}=894385.6$

Suppose we define an outlier to be an observation at least five standard deviations away from the sample mean. Since five times the sample standard deviation is $\displaystyle4471928$, we see that there are no outliers in the sample. Even if we redefine an outlier to be three standard deviations away from the sample mean, there are still no outliers as

$\displaystyle 3\cdot s_{X}=2683157$

Our definition has failed us in this example.

More generally, we can ask the following question: given some fixed choice $k$, how large of a sample would we need to actually observe an outlier? If the observations are identical, then the sample standard deviation is zero and there are no outliers. So we may assume otherwise. By hypothesis, we have the inequality

$\displaystyle\left|X_{i}-\overline{X}\right|\geq k\cdot s_{X}=\dfrac{k}{\sqrt{n-1}}\left(\sum_{j=1}^{n}(X_{j}-\overline{X})^{2}\right)^{\frac{1}{2}}$

Squaring both sides and rearranging, we see that

$\displaystyle n\geq 1+k^{2}\sum_{j=1}^{n}\dfrac{\left(X_{j}-\overline{X}\right)^{2}}{\left(X_{i}-\overline{X}\right)^{2}}=1+k^{2}+\sum_{{j=1}\atop{j\neq i}}^{n}\dfrac{\left(X_{j}-\overline{X}\right)^{2}}{\left(X_{i}-\overline{X}\right)^{2}}\geq1+k^{2}$

Actually, you need at least $k^{2}+2$ observations. To see this, suppose that the equality $n=k^{2}+1$ held. Then $k^{2}$ observations must be equal to the sample mean. Without loss of generality, we may assume that $X_{1}=\cdots=X_{k^{2}}=\overline{X}$ and $\left|X_{k^{2}+1}-\overline{X}\right|=s_{X}$. Observe that

$\displaystyle\overline{X}=\dfrac{1}{k^{2}+1}\left(X_{1}+\cdots+X_{k^{2}+1}\right)=\dfrac{1}{k^{2}+1}\left(k^{2}\overline{X}+X_{k^{2}+1}\right)\Rightarrow\overline{X}=X_{k^{2}+1}$

But then the sample variance is zero, which contradicts our hypothesis.