Du
01 September 2004
au
31 May 2007

Détection de valeurs aberrantes dans des mélanges de distributions dissymétriques pour des ensembles de données avec contraintes spatiales

Outliers detection in mixtures of dissymmetric distribution for datasets with spatial constraints

Context

Over the last decade, various areas of research such as precision farming and geographical information systems (GIS) have grown very rapidly. As these techniques have developed, laboratories are facing to increased demand for all kinds of analyses. Such analyses may involve the chemical composition of soils, nitrate or heavy metal content, wheat quality, etc. Analytical laboratories are required to manage large volumes of data to automate observation collecting and recording raises the issue of the quality of the information generated. Automation of data acquisition is creating a situation where it is more and more difficult for the user to see the significance and dimension of the data, with evident difficulties in understanding its appropriateness in some cases. Despite all the precautions that may be taken to standardise the data, errors can occur with respect to units, orders of magnitude, etc. In view of such problems there is a need to detect outliers in databases. This study of outliers is an informal process of data examination prior to a more detailed analysis (statistical processing, cartographic representation, etc.) with clearly defined aims. Methods for detecting outliers are therefore essential in database management, especially when integrating new observations, in order to build consistent sets of information. This work is a part of researches on methods for detecting outliers to be applied operationally in the case of databases containing geographical information.

Objectives

The general aim of this study is to propose an operating method for detecting outliers that can be applied to large sets of spatially referenced data. The method must ensure statistically acceptance or rejection of data taking into account spatial consistency due to, for example, to the presence of soils associations in the district. The method must therefore permit checking data samples from very dissymmetrical distributions. In order to achieve this aim a method has to be devised that makes it possible to determine optimally the limiting values beyond which a value for inclusion in a database is regarded as aberrant, taking the spatial component into account. Another aim is to set up a frame of reference by specific geographical unit, such as districts or groupings of neighbouring districts. A spatial clustering of districts, based on parameters of distributions would allow merging districts with similar features. These characteristics should correspond to analogous pedological zones. Parameters calculated from clustered zones will allow furnishing a robust system of data validation with spatial constraints. Since 1994, the Unit of Biometry manages the RéQuaSud database. The method developed will be applied in order to improve the quality of the information provided by this database.

Description of tasks

? The chemical composition of soils is a very interesting part of the information held in spatially referenced databases. Such analyses can be accurately referenced to the plot from which the sample was taken (GPS), but in most cases they are referenced to the district in which the plot is situated. In the latter case, account has to be taken of various problems generally associated with spatial constraint due to soils associations that can be found in the district. ? Frequency distributions of elements values studied in soil analysis are right skewed and heavy tailed. The presence of a large number of very high or extreme values at the right of the distribution makes it difficult to estimate the parameters required to perform tests to detect outliers. ? Another problem that arises is the possible mixing of several dissymmetrical distributions within the database, due to the presence of various soil associations in a single district. ? Outliers at the left of distributions also require close attention. The main part of the work on extreme values in fact concentrates only on the right of dissymmetrical distributions. ? The final point to be considered is the ease of use of the proposed methods. Methods for detecting outliers need to be quick and easy to apply, in view of the volume of data in databases.

Partners

Prof. J.J. Claustriaux. Gembloux Agricultural University. Unit of Statistics and computer science. Prof. J. Beirlant. Catholic University Leuven Department of Mathematics.

Funding

  • CRA-W - Walloon Agricultural Research Centre