Thursday, January 10, 2013

How to find an outlier


How do we know when a data point is an outlier?  Take a look at the figure below.  It represents 15 data points that were gathered in some experiment.  Would you say that the left-most point is an outlier? 


Maybe the instrument that collected this data point had a malfunction, or maybe the subject that produced that data did not follow the instructions.  If we have no other information than the data, how would we decide?

When we say a data point is an outlier, we are saying that it is unlikely that it was generated by the same process that generated the rest of our data.  For example, if we assume that our data was generated by a random process with a Gaussian distribution, then there is only a 0.13% chance that we would collect a data point that is 3 standard deviations from the mean.  So what we need to do is try to estimate the standard deviation of the underlying process that generated the data.  Here I will review two approaches, and then show how successful they are in labeling outliers.

Median Absolute Deviation (MAD)                                          
Hampel (1974) suggested that we begin with finding the median of the data set.  


Next, we make a new data set consisting of the distance (this is a positive number) between each data point and the median.  Finally, we find the median of the new data set.  That is, we compute the following:

MAD = b median( abs(x – median(x) ) )

If we set b=1.4826, then MAD is an estimate of the standard deviation of our data set, assuming that the true underlying data came from a Gaussian distribution.  For our data set above, here is the estimate of the standard deviation, centered on the median:



Based on MAD estimate of the standard deviation, we would say that the left-most data point is indeed more than 3 estimated standard deviations (MADs) from our estimate of the mean (the median). 

So a typical approach is to label as ‘outlier’ a data point that is farther than 3 times the MAD (standard deviation) than the median of the data.  That is, compute the following for each data point:

abs(x – median(x) ) / MAD

Label as ‘outlier’ the data points for which this measure gives you a number greater than 3.   But how good is this method?  To check it, I did the following experiment.  I generated data sets drawn from a normal distribution with a constant mean and standard deviation, and then computed the probability of a false positive, that is, I computed how likely it was that a point would be labeled as outlier by MAD, when in fact it was less than 3 standard deviations from the mean.  Here is the resulting probability, plotted as a function of the data size:


The above plot shows that when the data set is small (say 10 data points), about 20% of the data points that the algorithm picks as outliers are in fact within 3 standard deviations of the mean.  As the data set grows larger, the probability of false positives declines and the algorithm does better.  But even for a data set of size 20, there is better than 15% chance that the bad data point is in fact not bad.


Median Deviation of the Medians (MDM)

Rousseeuw and Croux (1993) suggested a method that, as we will see, is better.  For each data point xi, we find the distance to all other data points and find the resulting median.  We do this for all data points and we get n medians.  Now we find the median of this new data set:

MDM = c median( median( abs(xi –xj) ) )

If we set c=1.1926, then MDM is a robust estimate of the standard deviation of the data set, assuming that the true underlying data came from a Gaussian distribution.  For our data set above, here is the estimate of the standard deviation:


To check how this method compares with MAD, I generated data sets drawn from a normal distribution with a constant mean and standard deviation, and then computed the probability of a false positive, that is, I computed how likely it was that a point would be labeled as outlier by MDM, when in fact it was less than 3 standard deviations from the mean.  Here is the resulting probability, plotted as a function of the data size:


The above plot shows that regardless of the size of the data (here ranging from 6 data points to 20), a data point that MDM labels as an outlier has about 9% chance of being a false positive, i.e., not an outlier.  For small data sets, MDM is two to three times better than MAD.

References
Hampel FR (1974) The influence curve and its role in robust estimation. Journal of American Statistical Association 69:383-393.
Rousseeuw PJ, Croux C (1993) Alternatives to the median absolute deviation. Journal of American Statistical Association 88:1273-1283.


2 comments:

  1. But aren't the outliers the data points that are most likely to yield interesting results when they are closely examined?

    I thought that most scientific discoveries came when someone said"Hmmm. Now that's funny.".

    ReplyDelete
  2. This was very useful. Thanks for posting it. I am curious to know how these two methods compare in terms of sensitivity (false negatives)?
    Answering Michael Turner, I guess the point for detecting outliers is not always to exclude them. You are absolutely correct. It all depends on the dataset. You closely examine them (and after ruling out technical reasons) some can be real biological outliers. Anyhow it would be useful to detect outliers in your data using these methods.

    ReplyDelete