Visual inspection of the data

Filtering damaged datasets, drawing plots and emerging hidden patterns

Most of data on POPs concentration exhibit roughly log-normal distribution, which corresponds to the processes described by 1st order chemical equation. Usually, if no source of a new pollutant is present, the histogram should exhibit log-normal shape. In contrast, if there is some source of the pollutant in the environment, concentration values don’t shift to the lower values so quickly and the concentration level settles near a specific value. In such cases, rather normal or more complicated distribution occurs.

As a first step of any POPs data analysis, the visual inspection of the data is highly recommended. Simple time series plot with the time on x-axis and concentration on y-axis with linear or logarithmic scale, in which individual measurements are represented as points, could provide a valuable information on the seasonality, variance and often also on the long-term trend.

A simple visualisation can also serve as a good base for the decision, whether it is necessary to exclude outliers or made some other pre-processing; moreover, advanced analysts can suggest data transformation as well.

If there are too many time points with higher seasonality or variance, it could be useful to connect consequent pairs of points by a polyline; sometimes hidden seasonality or trend occur without any further computation necessary (check it in our examples).

There are other methods of visualisation beyond the simple time series plot. A histogram, which shows columns representing number of values in selected intervals on the x-axis, could suggest a statistical distribution of concentration, which could reveal processes in the environment.

A variance of concentration values is apparent from the box & whisker plot, which draws all the values on a line along the y-axis. Selected quantiles are highlighted, usually the minimum, maximum, quartiles and median.

Time series plot

This is a simple xy plot, where time is on the x-axis and the concentration is on the y-axis. Each POPs measurement is depicted as a point with x-axis equal to the middle time of a measurement and y-axis showing the resulting concentration. It could be useful to connect points by linear polyline, which is a set of consequent line segments of the shape

for consequent points [x1;y1], [x2;y2].

Trend plot

Trend plot adds a trend line to the time series plot. If some of the 30 example compounds measured in Kosetice is chosen, the exponential trend line fitted by least squares method is drawn:

otherwise - in case of an user defined input, the linear trend is depicted:

Histogram

Although the histogram is basically an xy plot as well, the meaning of axes is different here. The concentration is now on the x-axis and the y-axis shows the frequency of concentration values inside ranges defined by width of the columns. E.g. if some column starts at 1 ng/filter and ends in 2 ng/filter on the x-axis and its height is 21, it means that 21 values are higher than 1 ng/filter and lower or equal to 2 ng/filter.

In case of example compounds, there is a log-normal curve added to the histogam, showing an ideal log-normal distribution, which is expected. The shape of the curve is following:

where the notion is the same as in the previous equations, i.e. the concentration is denoted by y and frequency by f(y). Note, that there is no variable x in the equation - we consider the distribution of the values time independent.

Box & whisker plot

In box & whisker plot, there are five lines corresponding to the values of interest. The concentration is now back on the y-axis and the dark violet line in the middle of each column depictes the median (2nd quartile) of the concentration, the box margins denote the 1st and 3rd quartiles and the whiskers denote the minimal and maximal measured value (0th and 4th quartiles). As in the previous case, this statistics are time-independent:

P(y<Q(p))=p      i.e.     Q(p)=F-1(p)

where Q(p) denotes the p-th quantile, i.e. 4p quartile, P denotes probability and F denotes the distribution function, obtained as an integral of the probability density function.

Previous step.

References

Fu, T., A review on time series data mining. Engineering Applications of Artificial Intelligence 2011, 24 (1), 164-181.

Next step
1: Visual inspection
1: Visual inspection