Outliers exclusion
Finding and discarding faulty and biased values
Values, which are substantially shifted (up or down as concentration on y-axis in the plot) compared to other values should attract your attention. They are called outliers and often are a product of an error in measurement, data creation, transfer or storage. However, the outliers can also describe some unusual but real event in the environment and therefore needn’t to be meaningless part of the time series. The outliers exclusion thus demands very careful and coherent approach.
According to a purpose of the study, it is necessary to decide, whether and which outlier will be removed from the data, since they could highly bias the final results of the analysis.
For example, if PAHs are monitored on a site and some ad hoc event occurs in the area (it could be, e.g., some industrial accident, but also a simple leaves burning in the garden), it is of key importance to decide whether it is a typical (and therefore permanent) environmental burden (and not exclude the value) or an extraordinary deviation, which should be excluded from the data.
If we don’t know the process, leading to the formation of the outlier, we need a guidance (i.e., threshold value), to indicate, which value is too high (or too low eventually) and should be rather removed.
There are lots of methods in the statistical practice, of which the most common are:
1. parametric approach leading to exclusion of everything outside the mean ± 3 ∙ standard deviation range (for normal distribution) or
2. non-parametric approach excluding values outside the multiple of an interquartile range (try both in our example).
Determination of threshold values for outliers exclusion
Unlike the LoQs, which represent unknown but real values, outliers are usually records without any real basis - the value is destroyed completely by an exceptional process such as device measurement error, incorrect transcription of a numerical value or an incident in the measurement area. Exclusion of the outliers is therefore reasoned, however it must be justified well.
There are numerous methods how to determine the threshold values. The parmetric ones use a central tendency and different multiples of concentration variance measures, such as mean and standard deviation:
![]() |
![]() |
or the geometric mean and geometric standard deviation:
.
In the case of non-parametric methods, median is used as the central tendency and geometric standard deviation as the measure of variance:
.
![]() 3: Outliers exclusion |