Outliers exclusion

Finding and discarding faulty and biased values

Interquartile range is a difference between the first and the third quartile, i.e., the 25th and 75th percentile. If the data are normally distributed, 50% of values lie inside the interquartile range, 82.3% in its twofold, 99.3% in fourfold and more than 99.99999% in its sevenfold. The situation is different for lognormal distribution, where the interquartile range covers as well 50%, twofold covers 83.9%, fourfold 92.2%, but sevenfold still covers only 96.7% of values.

Values, which are substantially shifted (up or down as concentration on y-axis in the plot) compared to other values should attract your attention. They are called outliers and often are a product of an error in measurement, data creation, transfer or storage. However, the outliers can also describe some unusual but real event in the environment and therefore needn’t to be meaningless part of the time series. The outliers exclusion thus demands very careful and coherent approach.

According to a purpose of the study, it is necessary to decide, whether and which outlier will be removed from the data, since they could highly bias the final results of the analysis.

For example, if PAHs are monitored on a site and some ad hoc event occurs in the area (it could be, e.g., some industrial accident, but also a simple leaves burning in the garden), it is of key importance to decide whether it is a typical (and therefore permanent) environmental burden (and not exclude the value) or an extraordinary deviation, which should be excluded from the data.

If we don’t know the process, leading to the formation of the outlier, we need a guidance (i.e., threshold value), to indicate, which value is too high (or too low eventually) and should be rather removed.

There are lots of methods in the statistical practice, of which the most common are:

1. parametric approach leading to exclusion of everything outside the mean ± 3 ∙ standard deviation range (for normal distribution) or

2. non-parametric approach excluding values outside the multiple of an interquartile range (try both in our example).

Determination of threshold values for outliers exclusion

Unlike the LoQs, which represent unknown but real values, outliers are usually records without any real basis - the value is destroyed completely by an exceptional process such as device measurement error, incorrect transcription of a numerical value or an incident in the measurement area. Exclusion of the outliers is therefore reasoned, however it must be justified well.

There are numerous methods how to determine the threshold values. The parmetric ones use a central tendency and different multiples of concentration variance measures, such as mean and standard deviation:

      

       

or the geometric mean and geometric standard deviation:

.

In the case of non-parametric methods, median is used as the central tendency and geometric standard deviation as the measure of variance:

.

Previous step.

References

Halsall, C.; Bailey, R.; Stern, G.; Barrie, L.; Fellin, P.; Muir, D.; Rosenberg, B.; Rovinsky, F.; Kononov, E.; Pastukhov, B., Multi-year observations of organohalogen pesticides in the Arctic atmosphere. Environmental Pollution 1998, 102 (1), 51-62.

 

Hodge, V.; Austin, J., A survey of outlier detection methodologies. Artificial Intelligence Review 2004, 22 (2), 85-126.

Next step
3: Outliers exclusion
3: Outliers exclusion