Brief Observation on Outliers

Outlier” is a good example of a word which is not a term of art but nevertheless has a well-understood meaning or connotation: something set apart from the rest of the pack by a measurable or observable attribute. While connotations are useful as intuition guiding the development of mathematical theory, they are not useful when we get down to the business of rigorous proof. So, what is an outlier in mathematical statistics? If we have a sample of n observations X_{1},\ldots,X_{n}, when should we call a sample X_{i} an outlier?

Some authors define an observation X_{i} to be an outlier if it is at least k standard deviations away from the sample mean, where k is some author-dependent integer.

\displaystyle\left|X_{i}-\overline{X}\right|\geq k\cdots_{X}, where \displaystyle\overline{X}=\dfrac{1}{n}\sum_{j=1}^{n}X_{j}\text{ and }s_{X}=\dfrac{1}{n-1}\sum_{j=1}^{n}\left(X_{j}-\overline{X}\right)^{2}

A problem with this definition of outlier is that the sample mean and the (unbiased) sample standard deviation is itself sensitive to outliers. This sensitivity leads to a problem in small samples, where observations that are intuitively “outliers” are not outliers by our mathematical definition. Consider the following example. We have a sample of annual incomes (measured in $ thousands ) for five individuals:

\displaystyle X_{1}=35, X_{2}=50, X_{3}=87, X_{4}=200, X_{5}=2000000

Bill Gates or some other Forbes-list billionaire happened to be in the sample, which explains the fifth observation. The sample mean and sample standard deviation are respectively given by

\displaystyle\overline{X}=400074.4\text{ and }s_{X}=894385.6

Suppose we define an outlier to be an observation at least five standard deviations away from the sample mean. Since five times the sample standard deviation is \displaystyle4471928, we see that there are no outliers in the sample. Even if we redefine an outlier to be three standard deviations away from the sample mean, there are still no outliers as

\displaystyle 3\cdot s_{X}=2683157

Our definition has failed us in this example.

More generally, we can ask the following question: given some fixed choice k, how large of a sample would we need to actually observe an outlier? If the observations are identical, then the sample standard deviation is zero and there are no outliers. So we may assume otherwise. By hypothesis, we have the inequality

\displaystyle\left|X_{i}-\overline{X}\right|\geq k\cdot s_{X}=\dfrac{k}{\sqrt{n-1}}\left(\sum_{j=1}^{n}(X_{j}-\overline{X})^{2}\right)^{\frac{1}{2}}

Squaring both sides and rearranging, we see that

\displaystyle n\geq 1+k^{2}\sum_{j=1}^{n}\dfrac{\left(X_{j}-\overline{X}\right)^{2}}{\left(X_{i}-\overline{X}\right)^{2}}=1+k^{2}+\sum_{{j=1}\atop{j\neq i}}^{n}\dfrac{\left(X_{j}-\overline{X}\right)^{2}}{\left(X_{i}-\overline{X}\right)^{2}}\geq1+k^{2}

Actually, you need at least k^{2}+2 observations. To see this, suppose that the equality n=k^{2}+1 held. Then k^{2} observations must be equal to the sample mean. Without loss of generality, we may assume that X_{1}=\cdots=X_{k^{2}}=\overline{X} and \left|X_{k^{2}+1}-\overline{X}\right|=s_{X}. Observe that

\displaystyle\overline{X}=\dfrac{1}{k^{2}+1}\left(X_{1}+\cdots+X_{k^{2}+1}\right)=\dfrac{1}{k^{2}+1}\left(k^{2}\overline{X}+X_{k^{2}+1}\right)\Rightarrow\overline{X}=X_{k^{2}+1}

But then the sample variance is zero, which contradicts our hypothesis.

Advertisements
This entry was posted in math.PR, math.ST and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s