“Outlier” is a good example of a word which is not a term of art but nevertheless has a well-understood meaning or connotation: something set apart from the rest of the pack by a measurable or observable attribute. While connotations are useful as intuition guiding the development of mathematical theory, they are not useful when we get down to the business of rigorous proof. So, what is an outlier in mathematical statistics? If we have a sample of observations , when should we call a sample an outlier?
Some authors define an observation to be an outlier if it is at least standard deviations away from the sample mean, where is some author-dependent integer.
A problem with this definition of outlier is that the sample mean and the (unbiased) sample standard deviation is itself sensitive to outliers. This sensitivity leads to a problem in small samples, where observations that are intuitively “outliers” are not outliers by our mathematical definition. Consider the following example. We have a sample of annual incomes (measured in $ thousands ) for five individuals:
Bill Gates or some other Forbes-list billionaire happened to be in the sample, which explains the fifth observation. The sample mean and sample standard deviation are respectively given by
Suppose we define an outlier to be an observation at least five standard deviations away from the sample mean. Since five times the sample standard deviation is , we see that there are no outliers in the sample. Even if we redefine an outlier to be three standard deviations away from the sample mean, there are still no outliers as
Our definition has failed us in this example.
More generally, we can ask the following question: given some fixed choice , how large of a sample would we need to actually observe an outlier? If the observations are identical, then the sample standard deviation is zero and there are no outliers. So we may assume otherwise. By hypothesis, we have the inequality
Squaring both sides and rearranging, we see that
Actually, you need at least observations. To see this, suppose that the equality held. Then observations must be equal to the sample mean. Without loss of generality, we may assume that and . Observe that
But then the sample variance is zero, which contradicts our hypothesis.