Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 25(1): 188, 2024 May 14.
Article in English | MEDLINE | ID: mdl-38745112

ABSTRACT

BACKGROUND: Microbiome dysbiosis has recently been associated with different diseases and disorders. In this context, machine learning (ML) approaches can be useful either to identify new patterns or learn predictive models. However, data to be fed to ML methods can be subject to different sampling, sequencing and preprocessing techniques. Each different choice in the pipeline can lead to a different view (i.e., feature set) of the same individuals, that classical (single-view) ML approaches may fail to simultaneously consider. Moreover, some views may be incomplete, i.e., some individuals may be missing in some views, possibly due to the absence of some measurements or to the fact that some features are not available/applicable for all the individuals. Multi-view learning methods can represent a possible solution to consider multiple feature sets for the same individuals, but most existing multi-view learning methods are limited to binary classification tasks or cannot work with incomplete views. RESULTS: We propose irBoost.SH, an extension of the multi-view boosting algorithm rBoost.SH, based on multi-armed bandits. irBoost.SH solves multi-class classification tasks and can analyze incomplete views. At each iteration, it identifies one winning view using adversarial multi-armed bandits and uses its predictions to update a shared instance weight distribution in a learning process based on boosting. In our experiments, performed on 5 multi-view microbiome datasets, the model learned by irBoost.SH always outperforms the best model learned from a single view, its closest competitor rBoost.SH, and the model learned by a multi-view approach based on feature concatenation, reaching an improvement of 11.8% of the F1-score in the prediction of the Autism Spectrum disorder and of 114% in the prediction of the Colorectal Cancer disease. CONCLUSIONS: The proposed method irBoost.SH exhibited outstanding performances in our experiments, also compared to competitor approaches. The obtained results confirm that irBoost.SH can fruitfully be adopted for the analysis of microbiome data, due to its capability to simultaneously exploit multiple feature sets obtained through different sequencing and preprocessing pipelines.


Subject(s)
Algorithms , Machine Learning , Microbiota , Humans
2.
Sci Rep ; 13(1): 3205, 2023 02 24.
Article in English | MEDLINE | ID: mdl-36828900

ABSTRACT

Pollen monitoring have become data-intensive in recent years as real-time detectors are deployed to classify airborne pollen grains. Machine learning models with a focus on deep learning, have an essential role in the pollen classification task. Within this study we developed an explainable framework to unveil a deep learning model for pollen classification. Model works on data coming from single particle detector (Rapid-E) that records for each particle optical fingerprint with scattered light and laser induced fluorescence. Morphological properties of a particle are sensed with the light scattering process, while chemical properties are encoded with fluorescence spectrum and fluorescence lifetime induced by high-resolution laser. By utilizing these three data modalities, scattering, spectrum, and lifetime, deep learning-based models with millions of parameters are learned to distinguish different pollen classes, but a proper understanding of such a black-box model decisions demands additional methods to employ. Our study provides the first results of applied explainable artificial intelligence (xAI) methodology on the pollen classification model. Extracted knowledge on the important features that attribute to the predicting particular pollen classes is further examined from the perspective of domain knowledge and compared to available reference data on pollen sizes, shape, and laboratory spectrofluorometer measurements.


Subject(s)
Artificial Intelligence , Deep Learning , Spectrometry, Fluorescence , Data Collection , Pollen
3.
Sci Total Environ ; 851(Pt 2): 158234, 2022 Dec 10.
Article in English | MEDLINE | ID: mdl-36007635

ABSTRACT

Pollen is the most common cause of seasonal allergies, affecting over 33 % of the European population, even when considering only grasses. Informing the population and clinicians in real-time about the actual presence of pollen in the atmosphere is essential to reduce its harmful health and economic impact. Thus, there is a growing network of automatic particle analysers, and the reproducibility and transferability of implemented models are recommended since a reference dataset for local pollen of interest needs to be collected for each device to classify pollen, which is complex and time-consuming. Therefore, it would be beneficial to incorporate the reference dataset collected from other devices in different locations. However, it must be considered that laser-induced data are prone to device-specific noise due to laser and detector sensibility. This study collected data from two Rapid-E bioaerosol identifiers in Serbia and Italy and implemented a multi-modal convolutional neural network for pollen classification. We showed that models lost their performance when trained on data from one and tested on another device, not only in terms of the recognition ability but also in comparison with the manual measurements from Hirst-type traps. To enable pollen classification with just one model in both study locations, we first included the missing pollen classes in the dataset from the other study location, but it showed poor results, implying that data of one pollen class from different devices are more different than data of different pollen classes from one device. Combining all available reference data in a single model enabled the classification of a higher number of pollen classes in both study locations. Finally, we implemented a domain adaptation method, which improved the recognition ability and the correlations of transferred models only for several pollen classes.


Subject(s)
Neural Networks, Computer , Pollen , Reproducibility of Results , Atmosphere , Poaceae , Allergens
4.
Sci Rep ; 11(1): 23109, 2021 11 30.
Article in English | MEDLINE | ID: mdl-34848748

ABSTRACT

Tomato is an important commercial product which is perishable by nature and highly susceptible to fungal incidence once it is harvested. Not all tomatoes are equally vulnerable to pathogenic fungi, and an early detection of the vulnerable ones can help in taking timely preventive actions, ranging from isolating tomato batches to adjusting storage conditions, but also in making right business decisions like dynamic pricing based on quality or better shelf life estimate. More importantly, early detection of vulnerable produce can help in taking timely actions to minimize potential post-harvest losses. This paper investigates Near-infrared (NIR) hyperspectral imaging (1000-1700 nm) and machine learning to build models to automatically predict the susceptibility of sepals of recently harvested tomatoes to future fungal infections. Hyperspectral images of newly harvested tomatoes (cultivar Brioso) from 5 different growers were acquired before the onset of any visible fungal infection. After imaging, the tomatoes were placed under controlled conditions suited for fungal germination and growth for a 4-day period, and then imaged using normal color cameras. All sepals in the color images were ranked for fungal severity using crowdsourcing, and the final severity of each sepal was fused using principal component analysis. A novel hyperspectral data processing pipeline is presented which was used to automatically segment the tomato sepals from spectral images with multiple tomatoes connected via a truss. The key modelling question addressed in this research is whether there is a correlation between the hyperspectral data captured at harvest and the fungal infection observed 4 days later. Using 10-fold and group k-fold cross-validation, XG-Boost and Random Forest based regression models were trained on the features derived from the hyperspectral data corresponding to each sepal in the training set and tested on hold out test set. The best model found a Pearson correlation of 0.837, showing that there is strong linear correlation between the NIR spectra and the future fungal severity of the sepal. The sepal specific predictions were aggregated to predict the susceptibility of individual tomatoes, and a correlation of 0.92 was found. Besides modelling, focus is also on model interpretation, particularly to understand which spectral features are most relevant to model prediction. Two approaches to model interpretation were explored, feature importance and SHAP (SHapley Additive exPlanations), resulting in similar conclusions that the NIR range between 1390-1420 nm contributes most to the model's final decision.


Subject(s)
Plant Diseases/genetics , Plant Diseases/microbiology , Solanum lycopersicum/microbiology , Spectroscopy, Near-Infrared/methods , Algorithms , Calibration , Crops, Agricultural , Deep Learning , Fruit/microbiology , Solanum lycopersicum/genetics , Machine Learning , Microbiology , Pattern Recognition, Automated , Plant Diseases/prevention & control , Principal Component Analysis , Reproducibility of Results , Software
5.
Stud Health Technol Inform ; 285: 165-170, 2021 Oct 27.
Article in English | MEDLINE | ID: mdl-34734869

ABSTRACT

In this study, we investigate faecal microbiota composition, in an attempt to evaluate performance of classification algorithms in identifying Inflammatory Bowel Disease (IBD) and its two types: Crohn's disease (CD) and ulcerative colitis (UC). From many investigated algorithms, a random forest (RF) classifier was selected for detailed evaluation in three-class (CD versus UC versus nonIBD) classification task and two binary (nonIBD versus IBD and CD versus UC) classification tasks. We dealt with class imbalance, performed extensive parameter search, dimensionality reduction and two-level classification. In three-class classification, our best model reaches F1 score of 91% in average, which confirms the strong connection of IBD and gastrointestinal microbiome. Among most important features in three-class classification are species Staphylococcus hominis, Porphyromonas endodontalis, Slackia piriformis and genus Bacteroidetes.


Subject(s)
Colitis, Ulcerative , Crohn Disease , Gastrointestinal Microbiome , Inflammatory Bowel Diseases , Actinobacteria , Bacteroidetes , Colitis, Ulcerative/diagnosis , Colitis, Ulcerative/microbiology , Crohn Disease/diagnosis , Crohn Disease/microbiology , Humans , Inflammatory Bowel Diseases/diagnosis , Inflammatory Bowel Diseases/microbiology , Machine Learning , Porphyromonas endodontalis , Staphylococcus hominis
6.
Sci Rep ; 10(1): 3421, 2020 02 25.
Article in English | MEDLINE | ID: mdl-32099053

ABSTRACT

In this study we used meteorological parameters and predictive modelling interpreted by model explanation to develop stress metrics that indicate the presence of drought and heat stress at the specific environment. We started from the extreme temperature and precipitation indices, modified some of them and introduced additional drought indices relevant to the analysis. Based on maize's sensitivity to stress, the growing season was divided into four stages. The features were calculated throughout the growing season and split in two groups, one for the drought and the other for heat stress. Generated meteorological features were combined with soil features and fed to random forest regression model for the yield prediction. Model explanation gave us the contribution of features to yield decrease, from which we estimated total amount of stress at the environments, which represents new environmental index. Using this index we ranked the environments according to the level of stress. More than 2400 hybrids were tested across the environments where they were grown and based on the yield stability they were marked as either tolerant or susceptible to heat, drought or combined heat and drought stress. Presented methodology and results were produced within the Syngenta Crop Challenge 2019.


Subject(s)
Acclimatization , Genotype , Heat-Shock Response , Hybridization, Genetic , Models, Biological , Zea mays , Crop Production , Meteorology , Plant Leaves/genetics , Plant Leaves/growth & development , Zea mays/genetics , Zea mays/growth & development
7.
PLoS One ; 12(9): e0184198, 2017.
Article in English | MEDLINE | ID: mdl-28863173

ABSTRACT

The aim of this work was to develop a method for selection of optimal soybean varieties for the American Midwest using data analytics. We extracted the knowledge about 174 varieties from the dataset, which contained information about weather, soil, yield and regional statistical parameters. Next, we predicted the yield of each variety in each of 6,490 observed subregions of the Midwest. Furthermore, yield was predicted for all the possible weather scenarios approximated by 15 historical weather instances contained in the dataset. Using predicted yields and covariance between varieties through different weather scenarios, we performed portfolio optimisation. In this way, for each subregion, we obtained a selection of varieties, that proved superior to others in terms of the amount and stability of yield. According to the rules of Syngenta Crop Challenge, for which this research was conducted, we aggregated the results across all subregions and selected up to five soybean varieties that should be distributed across the network of seed retailers. The work presented in this paper was the winning solution for Syngenta Crop Challenge 2017.


Subject(s)
Crops, Agricultural , Glycine max/genetics , Weather , Agriculture/methods , Climate Change , Midwestern United States , Models, Statistical , Regression Analysis , Seeds/genetics , Uncertainty
8.
Stud Health Technol Inform ; 224: 181-3, 2016.
Article in English | MEDLINE | ID: mdl-27225576

ABSTRACT

Lumbar disc herniation (LDH) is the most common disease among working population requiring surgical intervention. This study aims to predict the return to work after operative treatment of LDH based on the observational study including 153 patients. The classification problem was approached using decision trees (DT), support vector machines (SVM) and multilayer perception (MLP) combined with RELIEF algorithm for feature selection. MLP provided best recall of 0.86 for the class of patients not returning to work, which combined with the selected features enables early identification and personalized targeted interventions towards subjects at risk of prolonged disability. The predictive modeling indicated at the most decisive risk factors in prolongation of work absence: psychosocial factors, mobility of the spine and structural changes of facet joints and professional factors including standing, sitting and microclimate.


Subject(s)
Diskectomy/methods , Intervertebral Disc Displacement/surgery , Return to Work , Treatment Outcome , Algorithms , Decision Trees , Female , Humans , Male , Microsurgery/methods , Models, Theoretical , Occupational Medicine , Serbia , Support Vector Machine
9.
Sci Rep ; 6: 19342, 2016 Jan 13.
Article in English | MEDLINE | ID: mdl-26758042

ABSTRACT

An increasing amount of geo-referenced mobile phone data enables the identification of behavioral patterns, habits and movements of people. With this data, we can extract the knowledge potentially useful for many applications including the one tackled in this study - understanding spatial variation of epidemics. We explored the datasets collected by a cell phone service provider and linked them to spatial HIV prevalence rates estimated from publicly available surveys. For that purpose, 224 features were extracted from mobility and connectivity traces and related to the level of HIV epidemic in 50 Ivory Coast departments. By means of regression models, we evaluated predictive ability of extracted features. Several models predicted HIV prevalence that are highly correlated (>0.7) with actual values. Through contribution analysis we identified key elements that correlate with the rate of infections and could serve as a proxy for epidemic monitoring. Our findings indicate that night connectivity and activity, spatial area covered by users and overall migrations are strongly linked to HIV. By visualizing the communication and mobility flows, we strived to explain the spatial structure of epidemics. We discovered that strong ties and hubs in communication and mobility align with HIV hot spots.


Subject(s)
Cell Phone , HIV Infections/epidemiology , Population Surveillance , Spatial Analysis , Adolescent , Adult , Geography, Medical , Humans , Middle Aged , Prevalence , Serbia/epidemiology , Young Adult
10.
IEEE J Biomed Health Inform ; 19(2): 698-708, 2015 Mar.
Article in English | MEDLINE | ID: mdl-24733033

ABSTRACT

Recent developments in molecular biology and techniques for genome-wide data acquisition have resulted in abundance of data to profile genes and predict their function. These datasets may come from diverse sources and it is an open question how to commonly address them and fuse them into a joint prediction model. A prevailing technique to identify groups of related genes that exhibit similar profiles is profile-based clustering. Cluster inference may benefit from consensus across different clustering models. In this paper, we propose a technique that develops separate gene clusters from each of available data sources and then fuses them by means of nonnegative matrix factorization. We use gene profile data on the budding yeast S. cerevisiae to demonstrate that this approach can successfully integrate heterogeneous datasets and yield high-quality clusters that could otherwise not be inferred by simply merging the gene profiles prior to clustering.


Subject(s)
Computational Biology/methods , Gene Expression Profiling/methods , Models, Statistical , Algorithms , Cluster Analysis , Databases, Genetic , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL
...