How do we know when a data point is an outlier? Take a look at the figure below. It represents 15 data points that were
gathered in some experiment. Would you
say that the left-most point is an outlier?
Maybe the instrument that collected this data point had a
malfunction, or maybe the subject that produced that data did not follow the
instructions. If we have no other
information than the data, how would we decide?
When we say a data point is an outlier, we are saying that
it is unlikely that it was generated by the same process that generated the
rest of our data. For example, if we
assume that our data was generated by a random process with a Gaussian
distribution, then there is only a 0.13% chance that we would collect a data
point that is 3 standard deviations from the mean. So what we need to do is try to estimate the
standard deviation of the underlying process that generated the data. Here I will review two approaches, and then
show how successful they are in labeling outliers.
Median Absolute Deviation (MAD)
Hampel (1974) suggested that we begin with finding the
median of the data set.
Next, we make a new data set consisting of the distance
(this is a positive number) between each data point and the median. Finally, we find the median of the new data
set. That is, we compute the following:
MAD = b median( abs(x – median(x) ) )
If we set b=1.4826, then MAD is an estimate of the standard
deviation of our data set, assuming that the true underlying data came from a
Gaussian distribution. For our data set
above, here is the estimate of the standard deviation, centered on the median:
Based on MAD estimate of the standard deviation, we would
say that the left-most data point is indeed more than 3 estimated standard
deviations (MADs) from our estimate of the mean (the median).
So a typical approach is to label as ‘outlier’ a data point
that is farther than 3 times the MAD (standard deviation) than the median of
the data. That is, compute the following
for each data point:
abs(x – median(x) ) / MAD
Label as ‘outlier’ the data points for which this measure
gives you a number greater than 3. But
how good is this method? To check it, I
did the following experiment. I generated data sets drawn from a normal distribution with a
constant mean and standard deviation, and then computed the probability of a
false positive, that is, I computed how likely it was that a point would be
labeled as outlier by MAD, when in fact it was less than 3 standard deviations
from the mean. Here is the resulting
probability, plotted as a function of the data size:
The above plot shows that when the data set is small (say 10
data points), about 20% of the data points that the algorithm picks as outliers
are in fact within 3 standard deviations of the mean. As the data set grows larger, the probability of
false positives declines and the algorithm does better. But even for a data set of size 20, there is
better than 15% chance that the bad data point is in fact not bad.
Median Deviation of the Medians (MDM)
Rousseeuw and Croux (1993) suggested a method that, as we
will see, is better. For each data point
xi, we find the distance to all other data points and find the resulting
median. We do this for all data points
and we get n medians. Now we find the
median of this new data set:
MDM = c median( median( abs(xi –xj) ) )
If we set c=1.1926, then MDM is a robust estimate of the
standard deviation of the data set, assuming that the true underlying data came
from a Gaussian distribution. For our
data set above, here is the estimate of the standard deviation:
To check how this method compares with MAD, I generated
data sets drawn from a normal distribution with a constant mean and standard
deviation, and then computed the probability of a false positive, that is, I
computed how likely it was that a point would be labeled as outlier by MDM,
when in fact it was less than 3 standard deviations from the mean. Here is the resulting probability, plotted as
a function of the data size:
The
above plot shows that regardless of the size of the data (here ranging from 6
data points to 20), a data point that MDM labels as an outlier has about 9%
chance of being a false positive, i.e., not an outlier. For small data sets, MDM is two to three
times better than MAD.
References
Hampel FR (1974) The influence
curve and its role in robust estimation. Journal of American Statistical
Association 69:383-393.
Rousseeuw PJ, Croux C (1993)
Alternatives to the median absolute deviation. Journal of American Statistical
Association 88:1273-1283.
But aren't the outliers the data points that are most likely to yield interesting results when they are closely examined?
ReplyDeleteI thought that most scientific discoveries came when someone said"Hmmm. Now that's funny.".
This was very useful. Thanks for posting it. I am curious to know how these two methods compare in terms of sensitivity (false negatives)?
ReplyDeleteAnswering Michael Turner, I guess the point for detecting outliers is not always to exclude them. You are absolutely correct. It all depends on the dataset. You closely examine them (and after ruling out technical reasons) some can be real biological outliers. Anyhow it would be useful to detect outliers in your data using these methods.