Search | VHL Regional Portal

Data pre-processing to improve the mining of large feed databases.

Maroto-Molina, F; Gómez-Cabrera, A; Guerrero-Ginel, J E; Garrido-Varo, A; Sauvant, D; Tran, G; Heuzé, V; Pérez-Marín, D C.

Animal ; 7(7): 1128-36, 2013 Jul.

Article in English | MEDLINE | ID: mdl-23473337

ABSTRACT

The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

Subject(s)

Animal Feed , Animal Husbandry/methods , Data Mining/methods , Databases, Factual , Data Interpretation, Statistical , Medicago sativa

Handling of missing data to improve the mining of large feed databases.

Maroto-Molina, F; Gómez-Cabrera, A; Guerrero-Ginel, J E; Garrido-Varo, A; Sauvant, D; Tran, G; Heuzé, V; Pérez-Marín, D C.

J Anim Sci ; 91(1): 491-500, 2013 Jan.

Article in English | MEDLINE | ID: mdl-23048146

ABSTRACT

Feed databases often have missing data. Despite their potentially major effect on data analysis (e.g., as a source of biased results and loss of statistical power), database managers and nutrition researchers have paid little attention to missing data. This study evaluated various methods of handling missing data using mining outputs from a database containing data on chemical composition and nutritive value for 18,864 alfalfa samples. A complete reference dataset was obtained comprising the 2,303 cases with no missing data for the attributes CP, crude fiber (CF), NDF, ADF and ADL. This dataset was used to simulate 2 types of missing data (at random and not at random), each with 2 loss intensities (33 and 66%), thus yielding a total of 4 incomplete datasets. Missing data from these datasets were handled using 2 deletion methods and 4 imputation methods, and outputs in terms of the identification and typing of alfalfa (using ANOVA and descriptive statistics) and of correlations between attributes (using regressions) were compared with outputs from the complete dataset. Imputation methods, particularly model-based versions, were found to perform better than deletion methods in terms of maximizing information use and minimizing bias although the extent of differences between methods depended on the type of missing data. The best approximation to the uncertainty value was provided by multiple imputation methods. It was concluded that the choice of the most suitable method for handling missing data depended both on the type of missing data and on the purpose of data analysis.

Subject(s)

Animal Feed , Data Mining/methods , Databases, Factual , Data Interpretation, Statistical

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL