Search | VHL Regional Portal

A machine learning-enabled open biodata resource inventory from the scientific literature.

Imker, Heidi J; Schackart, Kenneth E; Istrate, Ana-Maria; Cook, Charles E.

PLoS One ; 18(11): e0294812, 2023.

Article in English | MEDLINE | ID: mdl-38015968

ABSTRACT

Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011-2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT).

Subject(s)

Biological Science Disciplines , Publications , Humans , Archives , Electric Power Supplies , Machine Learning

Evaluation of computational phage detection tools for metagenomic datasets.

Schackart, Kenneth E; Graham, Jessica B; Ponsero, Alise J; Hurwitz, Bonnie L.

Front Microbiol ; 14: 1078760, 2023.

Article in English | MEDLINE | ID: mdl-36760501

ABSTRACT

Introduction: As new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks. Methods: In this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools. Results: We find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.

Machine Learning Enhances the Performance of Bioreceptor-Free Biosensors.

Schackart, Kenneth E; Yoon, Jeong-Yeol.

Sensors (Basel) ; 21(16)2021 Aug 17.

Article in English | MEDLINE | ID: mdl-34450960

ABSTRACT

Since their inception, biosensors have frequently employed simple regression models to calculate analyte composition based on the biosensor's signal magnitude. Traditionally, bioreceptors provide excellent sensitivity and specificity to the biosensor. Increasingly, however, bioreceptor-free biosensors have been developed for a wide range of applications. Without a bioreceptor, maintaining strong specificity and a low limit of detection have become the major challenge. Machine learning (ML) has been introduced to improve the performance of these biosensors, effectively replacing the bioreceptor with modeling to gain specificity. Here, we present how ML has been used to enhance the performance of these bioreceptor-free biosensors. Particularly, we discuss how ML has been used for imaging, Enose and Etongue, and surface-enhanced Raman spectroscopy (SERS) biosensors. Notably, principal component analysis (PCA) combined with support vector machine (SVM) and various artificial neural network (ANN) algorithms have shown outstanding performance in a variety of tasks. We anticipate that ML will continue to improve the performance of bioreceptor-free biosensors, especially with the prospects of sharing trained models and cloud computing for mobile computation. To facilitate this, the biosensing community would benefit from increased contributions to open-access data repositories for biosensor data.

Subject(s)

Biosensing Techniques , Machine Learning , Neural Networks, Computer , Spectrum Analysis, Raman , Support Vector Machine

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL