Search | VHL Regional Portal

1.

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.

Nachtegael, Charlotte; De Stefani, Jacopo; Cnudde, Anthony; Lenaerts, Tom.

Database (Oxford) ; 20242024 May 28.

Article in English | MEDLINE | ID: mdl-38805753

ABSTRACT

While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.

Subject(s)

Supervised Machine Learning , Humans , Data Mining/methods , Data Curation/methods , Databases, Genetic

2.

PMBC: a manually curated database for prognostic markers of breast cancer.

Liu, Jiabei; Yu, Yiyi; Li, Mingyue; Wu, Yixuan; Chen, Weijun; Liu, Guanru; Liu, Lingxian; Lin, Jiechun; Peng, Chujun; Sun, Weijun; Wu, Xiaoli; Chen, Xin.

Database (Oxford) ; 20242024 May 15.

Article in English | MEDLINE | ID: mdl-38748636

ABSTRACT

Breast cancer is notorious for its high mortality and heterogeneity, resulting in different therapeutic responses. Classical biomarkers have been identified and successfully commercially applied to predict the outcome of breast cancer patients. Accumulating biomarkers, including non-coding RNAs, have been reported as prognostic markers for breast cancer with the development of sequencing techniques. However, there are currently no databases dedicated to the curation and characterization of prognostic markers for breast cancer. Therefore, we constructed a curated database for prognostic markers of breast cancer (PMBC). PMBC consists of 1070 markers covering mRNAs, lncRNAs, miRNAs and circRNAs. These markers are enriched in various cancer- and epithelial-related functions including mitogen-activated protein kinases signaling. We mapped the prognostic markers into the ceRNA network from starBase. The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs. The majority of the ceRNAs in ABAT belong to pseudogenes. The topology analysis of the ceRNA network reveals that known prognostic RNAs have higher closeness than random. Among all the biomarkers, prognostic lncRNAs have a higher degree, while prognostic mRNAs have significantly higher closeness than random RNAs. These results indicate that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on the ceRNA network. PMBC renders a user-friendly interface and provides detailed information about individual prognostic markers, which will facilitate the precision treatment of breast cancer. PMBC is available at the following URL: http://www.pmbreastcancer.com/.

Subject(s)

Biomarkers, Tumor , Breast Neoplasms , Databases, Genetic , Humans , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Female , Biomarkers, Tumor/genetics , Prognosis , RNA, Long Noncoding/genetics , Gene Regulatory Networks , Data Curation/methods , RNA, Messenger/genetics , RNA, Messenger/metabolism , Gene Expression Regulation, Neoplastic

3.

OMD Curation Toolkit: a workflow for in-house curation of public omics datasets.

Piquer-Esteban, Samuel; Arnau, Vicente; Diaz, Wladimiro; Moya, Andrés.

BMC Bioinformatics ; 25(1): 184, 2024 May 09.

Article in English | MEDLINE | ID: mdl-38724907

ABSTRACT

BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.

Subject(s)

Data Curation , Software , Workflow , Data Curation/methods , Metadata , Databases, Genetic , Genomics/methods , Computational Biology/methods

4.

MSGD: a manually curated database of genomic, transcriptomic, proteomic and drug information for multiple sclerosis.

Wu, Tao; Hou, Yaopan; Xin, Guanghao; Niu, Jingyan; Peng, Shanshan; Xu, Fanfan; Li, Ying; Chen, Yuling; Yu, Yifangfei; Zhang, Huixue; Kong, Xiaotong; Cao, Yuze; Ning, Shangwei; Wang, Lihua; Hao, Junwei.

Database (Oxford) ; 20242024 May 24.

Article in English | MEDLINE | ID: mdl-38788333

ABSTRACT

Multiple sclerosis (MS) is the most common inflammatory demyelinating disease of the central nervous system. 'Omics' technologies (genomics, transcriptomics, proteomics) and associated drug information have begun reshaping our understanding of multiple sclerosis. However, these data are scattered across numerous references, making them challenging to fully utilize. We manually mined and compiled these data within the Multiple Sclerosis Gene Database (MSGD) database, intending to continue updating it in the future. We screened 5485 publications and constructed the current version of MSGD. MSGD comprises 6255 entries, including 3274 variant entries, 1175 RNA entries, 418 protein entries, 313 knockout entries, 612 drug entries and 463 high-throughput entries. Each entry contains detailed information, such as species, disease type, detailed gene descriptions (such as official gene symbols), and original references. MSGD is freely accessible and provides a user-friendly web interface. Users can easily search for genes of interest, view their expression patterns and detailed information, manage gene sets and submit new MS-gene associations through the platform. The primary principle behind MSGD's design is to provide an exploratory platform, aiming to minimize filtration and interpretation barriers while ensuring highly accessible presentation of data. This initiative is expected to significantly assist researchers in deciphering gene mechanisms and improving the prevention, diagnosis and treatment of MS. Database URL: http://bio-bigdata.hrbmu.edu.cn/MSGD.

Subject(s)

Databases, Genetic , Multiple Sclerosis , Proteomics , Transcriptome , Multiple Sclerosis/genetics , Humans , Proteomics/methods , Transcriptome/genetics , Data Curation/methods , Genomics/methods

5.

A study on formalizing the knowledge of data curation activities across different fields.

Minamiyama, Yasuyuki; Takeda, Hideaki; Hayashi, Masaharu; Asaoka, Makoto; Yamaji, Kazutsuna.

PLoS One ; 19(4): e0301772, 2024.

Article in English | MEDLINE | ID: mdl-38662657

ABSTRACT

In recent years, with the trend of open science, there have been many efforts to share research data on the internet. To promote research data sharing, data curation is essential to make the data interpretable and reusable. In research fields such as life sciences, earth sciences, and social sciences, tasks and procedures have been already developed to implement efficient data curation to meet the needs and customs of individual research fields. However, not only data sharing within research fields but also interdisciplinary data sharing is required to promote open science. For this purpose, knowledge of data curation across the research fields is surveyed, analyzed, and organized as an ontology in this paper. As the survey, existing vocabularies and procedures are collected and compared as well as interviews with the data curators in research institutes in different fields are conducted to clarify commonalities and differences in data curation across the research fields. It turned out that the granularity of tasks and procedures that constitute the building blocks of data curation is not formalized. Without a method to overcome this gap, it will be challenging to promote interdisciplinary reuse of research data. Based on the analysis above, the ontology for the data curation process is proposed to describe data curation processes in different fields universally. It is described by OWL and shown as valid and consistent from the logical viewpoint. The ontology successfully represents data curation activities as the processes in the different fields acquired by the interviews. It is also helpful to identify the functions of the systems to support the data curation process. This study contributes to building a knowledge framework for an interdisciplinary understanding of data curation activities in different fields.

Subject(s)

Data Curation , Information Dissemination , Data Curation/methods , Information Dissemination/methods , Humans , Knowledge , Internet

6.

Prediction and curation of missing biomedical identifier mappings with Biomappings.

Hoyt, Charles Tapley; Hoyt, Amelia L; Gyori, Benjamin M.

Bioinformatics ; 39(4)2023 04 03.

Article in English | MEDLINE | ID: mdl-36916735

ABSTRACT

MOTIVATION: Biomedical identifier resources (such as ontologies, taxonomies, and controlled vocabularies) commonly overlap in scope and contain equivalent entries under different identifiers. Maintaining mappings between these entries is crucial for interoperability and the integration of data and knowledge. However, there are substantial gaps in available mappings motivating their semi-automated curation. RESULTS: Biomappings implements a curation workflow for missing mappings which combines automated prediction with human-in-the-loop curation. It supports multiple prediction approaches and provides a web-based user interface for reviewing predicted mappings for correctness, combined with automated consistency checking. Predicted and curated mappings are made available in public, version-controlled resource files on GitHub. Biomappings currently makes available 9274 curated mappings and 40 691 predicted ones, providing previously missing mappings between widely used identifier resources covering small molecules, cell lines, diseases, and other concepts. We demonstrate the value of Biomappings on case studies involving predicting and curating missing mappings among cancer cell lines as well as small molecules tested in clinical trials. We also present how previously missing mappings curated using Biomappings were contributed back to multiple widely used community ontologies. AVAILABILITY AND IMPLEMENTATION: The data and code are available under the CC0 and MIT licenses at https://github.com/biopragmatics/biomappings.

Subject(s)

Data Curation , Vocabulary, Controlled , Humans , Data Curation/methods , Software , User-Computer Interface

7.

An Efficient Semi-Supervised Framework with Multi-Task and Curriculum Learning for Medical Image Segmentation.

Wang, Kaiping; Wang, Yan; Zhan, Bo; Yang, Yujie; Zu, Chen; Wu, Xi; Zhou, Jiliu; Nie, Dong; Zhou, Luping.

Int J Neural Syst ; 32(9): 2250043, 2022 Sep.

Article in English | MEDLINE | ID: mdl-35912583

ABSTRACT

A practical problem in supervised deep learning for medical image segmentation is the lack of labeled data which is expensive and time-consuming to acquire. In contrast, there is a considerable amount of unlabeled data available in the clinic. To make better use of the unlabeled data and improve the generalization on limited labeled data, in this paper, a novel semi-supervised segmentation method via multi-task curriculum learning is presented. Here, curriculum learning means that when training the network, simpler knowledge is preferentially learned to assist the learning of more difficult knowledge. Concretely, our framework consists of a main segmentation task and two auxiliary tasks, i.e. the feature regression task and target detection task. The two auxiliary tasks predict some relatively simpler image-level attributes and bounding boxes as the pseudo labels for the main segmentation task, enforcing the pixel-level segmentation result to match the distribution of these pseudo labels. In addition, to solve the problem of class imbalance in the images, a bounding-box-based attention (BBA) module is embedded, enabling the segmentation network to concern more about the target region rather than the background. Furthermore, to alleviate the adverse effects caused by the possible deviation of pseudo labels, error tolerance mechanisms are also adopted in the auxiliary tasks, including inequality constraint and bounding-box amplification. Our method is validated on ACDC2017 and PROMISE12 datasets. Experimental results demonstrate that compared with the full supervision method and state-of-the-art semi-supervised methods, our method yields a much better segmentation performance on a small labeled dataset. Code is available at https://github.com/DeepMedLab/MTCL.

Subject(s)

Curriculum , Supervised Machine Learning , Data Curation/methods , Data Curation/standards , Datasets as Topic/standards , Datasets as Topic/supply & distribution , Image Processing, Computer-Assisted/methods , Supervised Machine Learning/classification , Supervised Machine Learning/statistics & numerical data , Supervised Machine Learning/trends

8.

PeakForest: a multi-platform digital infrastructure for interoperable metabolite spectral data and metadata management.

Paulhe, Nils; Canlet, Cécile; Damont, Annelaure; Peyriga, Lindsay; Durand, Stéphanie; Deborde, Catherine; Alves, Sandra; Bernillon, Stephane; Berton, Thierry; Bir, Raphael; Bouville, Alyssa; Cahoreau, Edern; Centeno, Delphine; Costantino, Robin; Debrauwer, Laurent; Delabrière, Alexis; Duperier, Christophe; Emery, Sylvain; Flandin, Amelie; Hohenester, Ulli; Jacob, Daniel; Joly, Charlotte; Jousse, Cyril; Lagree, Marie; Lamari, Nadia; Lefebvre, Marie; Lopez-Piffet, Claire; Lyan, Bernard; Maucourt, Mickael; Migne, Carole; Olivier, Marie-Francoise; Rathahao-Paris, Estelle; Petriacq, Pierre; Pinelli, Julie; Roch, Léa; Roger, Pierrick; Roques, Simon; Tabet, Jean-Claude; Tremblay-Franco, Marie; Traïkia, Mounir; Warnet, Anna; Zhendre, Vanessa; Rolin, Dominique; Jourdan, Fabien; Thévenot, Etienne; Moing, Annick; Jamin, Emilien; Fenaille, François; Junot, Christophe; Pujos-Guillot, Estelle.

Metabolomics ; 18(6): 40, 2022 06 14.

Article in English | MEDLINE | ID: mdl-35699774

ABSTRACT

INTRODUCTION: Accuracy of feature annotation and metabolite identification in biological samples is a key element in metabolomics research. However, the annotation process is often hampered by the lack of spectral reference data in experimental conditions, as well as logistical difficulties in the spectral data management and exchange of annotations between laboratories. OBJECTIVES: To design an open-source infrastructure allowing hosting both nuclear magnetic resonance (NMR) and mass spectra (MS), with an ergonomic Web interface and Web services to support metabolite annotation and laboratory data management. METHODS: We developed the PeakForest infrastructure, an open-source Java tool with automatic programming interfaces that can be deployed locally to organize spectral data for metabolome annotation in laboratories. Standardized operating procedures and formats were included to ensure data quality and interoperability, in line with international recommendations and FAIR principles. RESULTS: PeakForest is able to capture and store experimental spectral MS and NMR metadata as well as collect and display signal annotations. This modular system provides a structured database with inbuilt tools to curate information, browse and reuse spectral information in data treatment. PeakForest offers data formalization and centralization at the laboratory level, facilitating shared spectral data across laboratories and integration into public databases. CONCLUSION: PeakForest is a comprehensive resource which addresses a technical bottleneck, namely large-scale spectral data annotation and metabolite identification for metabolomics laboratories with multiple instruments. PeakForest databases can be used in conjunction with bespoke data analysis pipelines in the Galaxy environment, offering the opportunity to meet the evolving needs of metabolomics research. Developed and tested by the French metabolomics community, PeakForest is freely-available at https://github.com/peakforest .

Subject(s)

Metabolomics , Metadata , Data Curation/methods , Mass Spectrometry/methods , Metabolome , Metabolomics/methods

9.

Raspberry Pi-Based Data Archival System for Electroencephalogram Signals From the SedLine Root Device.

Suresha, Pradyumna B; Robichaux, Chad J; Cassim, Tuan Z; García, Paul S; Clifford, Gari D.

Anesth Analg ; 134(2): 380-388, 2022 02 01.

Article in English | MEDLINE | ID: mdl-34673658

ABSTRACT

BACKGROUND: The retrospective analysis of electroencephalogram (EEG) signals acquired from patients under general anesthesia is crucial in understanding the patient's unconscious brain's state. However, the creation of such database is often tedious and cumbersome and involves human labor. Hence, we developed a Raspberry Pi-based system for archiving EEG signals recorded from patients under anesthesia in operating rooms (ORs) with minimal human involvement. METHODS: Using this system, we archived patient EEG signals from over 500 unique surgeries at the Emory University Orthopaedics and Spine Hospital, Atlanta, for about 18 months. For this, we developed a software package that runs on a Raspberry Pi and archives patient EEG signals from a SedLine Root EEG Monitor (Masimo) to a secure Health Insurance Portability and Accountability Act (HIPAA) compliant cloud storage. The OR number corresponding to each surgery was archived along with the EEG signal to facilitate retrospective EEG analysis. We retrospectively processed the archived EEG signals and performed signal quality checks. We also proposed a formula to compute the proportion of true EEG signal and calculated the corresponding statistics. Further, we curated and interleaved patient medical record information with the corresponding EEG signals. RESULTS: We retrospectively processed the EEG signals to demonstrate a statistically significant negative correlation between the relative alpha power (8-12 Hz) of the EEG signal captured under anesthesia and the patient's age. CONCLUSIONS: Our system is a standalone EEG archiver developed using low cost and readily available hardware. We demonstrated that one could create a large-scale EEG database with minimal human involvement. Moreover, we showed that the captured EEG signal is of good quality for retrospective analysis and combined the EEG signal with the patient medical records. This project's software has been released under an open-source license to enable others to use and contribute.

Subject(s)

Data Curation/methods , Electroencephalography/instrumentation , Electroencephalography/methods , Monitoring, Intraoperative/instrumentation , Monitoring, Intraoperative/methods , Adult , Aged , Aged, 80 and over , Data Management/instrumentation , Data Management/methods , Female , Humans , Male , Middle Aged , Retrospective Studies , Young Adult

10.

Self-reporting data assets and their representation in the pharmaceutical industry.

Della Corte, Dennis; Colsman, Wolfgang; Fessenmayr, Heiko; Sawczuk da Silva, Alexandre; Vanderwall, Dana E.

Drug Discov Today ; 27(1): 207-214, 2022 01.

Article in English | MEDLINE | ID: mdl-34332096

ABSTRACT

Standardizing data is crucial for preserving and exchanging scientific information. In particular, recording the context in which data were created ensures that information remains findable, accessible, interoperable, and reusable. Here, we introduce the concept of self-reporting data assets (SRDAs), which preserve data and contextual information. SRDAs are an abstract concept, which requires a suitable data format for implementation. Four promising data formats or languages are popularly used to represent data in pharma: JCAMP-DX, JSON, AnIML, and, more recently, the Allotrope Data Format (ADF). Here, we evaluate these four options in common use cases within the pharmaceutical industry using multiple criteria. The evaluation shows that ADF is the most suitable format for the implementation of SRDAs.

Subject(s)

Data Accuracy , Data Curation , Drug Industry , Information Dissemination/methods , Research Design/standards , Data Curation/methods , Data Curation/standards , Diffusion of Innovation , Drug Industry/methods , Drug Industry/organization & administration , Humans , Proof of Concept Study , Reference Standards , Technology, Pharmaceutical/methods

11.

Curating the Evidence About COVID-19 for Frontline Public Health and Clinical Care: The Novel Coronavirus Research Compendium.

Redd, Andrew D; Peetluk, Lauren S; Jarrett, Brooke A; Hanrahan, Colleen; Schwartz, Sheree; Rao, Amrita; Jaffe, Andrew E; Peer, Austin D; Jones, Carli B; Lutz, Chelsea S; McKee, Clifton D; Patel, Eshan U; Rosen, Joseph G; Garrison Desany, Henri; McKay, Heather S; Muschelli, John; Andersen, Kathleen M; Link, Malen A; Wada, Nikolas; Baral, Prativa; Young, Ruth; Boon, Denali; Grabowski, M Kate; Gurley, Emily S.

Public Health Rep ; 137(2): 197-202, 2022.

Article in English | MEDLINE | ID: mdl-34969294

ABSTRACT

The public health crisis created by the COVID-19 pandemic has spurred a deluge of scientific research aimed at informing the public health and medical response to the pandemic. However, early in the pandemic, those working in frontline public health and clinical care had insufficient time to parse the rapidly evolving evidence and use it for decision-making. Academics in public health and medicine were well-placed to translate the evidence for use by frontline clinicians and public health practitioners. The Novel Coronavirus Research Compendium (NCRC), a group of >60 faculty and trainees across the United States, formed in March 2020 with the goal to quickly triage and review the large volume of preprints and peer-reviewed publications on SARS-CoV-2 and COVID-19 and summarize the most important, novel evidence to inform pandemic response. From April 6 through December 31, 2020, NCRC teams screened 54 192 peer-reviewed articles and preprints, of which 527 were selected for review and uploaded to the NCRC website for public consumption. Most articles were peer-reviewed publications (n = 395, 75.0%), published in 102 journals; 25.1% (n = 132) of articles reviewed were preprints. The NCRC is a successful model of how academics translate scientific knowledge for practitioners and help build capacity for this work among students. This approach could be used for health problems beyond COVID-19, but the effort is resource intensive and may not be sustainable in the long term.

Subject(s)

COVID-19 , Data Curation/methods , Information Dissemination/methods , Interdisciplinary Research/organization & administration , Peer Review, Research , Preprints as Topic , SARS-CoV-2 , Humans , Public Health , United States

12.

Complex Portal 2022: new curation frontiers.

Meldal, Birgit H M; Perfetto, Livia; Combe, Colin; Lubiana, Tiago; Ferreira Cavalcante, João Vitor; Bye-A-Jee, Hema; Waagmeester, Andra; Del-Toro, Noemi; Shrivastava, Anjali; Barrera, Elisabeth; Wong, Edith; Mlecnik, Bernhard; Bindea, Gabriela; Panneerselvam, Kalpana; Willighagen, Egon; Rappsilber, Juri; Porras, Pablo; Hermjakob, Henning; Orchard, Sandra.

Nucleic Acids Res ; 50(D1): D578-D586, 2022 01 07.

Article in English | MEDLINE | ID: mdl-34718729

ABSTRACT

The Complex Portal (www.ebi.ac.uk/complexportal) is a manually curated, encyclopaedic database of macromolecular complexes with known function from a range of model organisms. It summarizes complex composition, topology and function along with links to a large range of domain-specific resources (i.e. wwPDB, EMDB and Reactome). Since the last update in 2019, we have produced a first draft complexome for Escherichia coli, maintained and updated that of Saccharomyces cerevisiae, added over 40 coronavirus complexes and increased the human complexome to over 1100 complexes that include approximately 200 complexes that act as targets for viral proteins or are part of the immune system. The display of protein features in ComplexViewer has been improved and the participant table is now colour-coordinated with the nodes in ComplexViewer. Community collaboration has expanded, for example by contributing to an analysis of putative transcription cofactors and providing data accessible to semantic web tools through Wikidata which is now populated with manually curated Complex Portal content through a new bot. Our data license is now CC0 to encourage data reuse. Users are encouraged to get in touch, provide us with feedback and send curation requests through the 'Support' link.

Subject(s)

Data Curation/methods , Databases, Protein , Multiprotein Complexes/chemistry , Coronavirus/chemistry , Data Visualization , Databases, Chemical , Enzymes/chemistry , Enzymes/metabolism , Escherichia coli/chemistry , Humans , International Cooperation , Molecular Sequence Annotation , Multiprotein Complexes/metabolism , User-Computer Interface

13.

Semi-Automated Data Curation from Biomedical Literature.

Rahman, Protiva; Fabbri, Daniel.

AMIA Annu Symp Proc ; 2022: 884-891, 2022.

Article in English | MEDLINE | ID: mdl-37128469

ABSTRACT

Data curation is a bottleneck for many informatics pipelines. A specific example of this is aggregating data from preclinical studies to identify novel genetic pathways for atherosclerosis in humans. This requires extracting data from published mouse studies such as the perturbed gene and its impact on lesion sizes and plaque inflammation, which is non-trivial. Curation efforts are resource-heavy, with curators manually extracting data from hundreds of publications. In this work, we describe the development of a semi-automated curation tool to accelerate data extraction. We use natural language processing (NLP) methods to auto-populate a web-based form which is then reviewed by a curator. We conducted a controlled user study to evaluate the curation tool. Our NLP model has a 70% accuracy on categorical fields and our curation tool accelerates task completion time by 49% compared to manual curation.

Subject(s)

Data Curation , Natural Language Processing , Humans , Animals , Mice , Data Curation/methods , Publications

14.

ProCanBio: A Database of Manually Curated Biomarkers for Prostate Cancer.

Sapra, Dikscha; Kaur, Harpreet; Dhall, Anjali; Raghava, Gajendra P S.

J Comput Biol ; 28(12): 1248-1257, 2021 12.

Article in English | MEDLINE | ID: mdl-34898255

ABSTRACT

Prostate cancer (PCa) is the second lethal malignancy in men worldwide. In the past, numerous research groups investigated the omics profiles of patients and scrutinized biomarkers for the diagnosis and prognosis of PCa. However, information related to the biomarkers is widely scattered across numerous resources in complex textual format, which poses hindrance to understand the tumorigenesis of this malignancy and scrutinization of robust signature. To create a comprehensive resource, we collected all the relevant literature on PCa biomarkers from the PubMed. We scrutinize the extensive information about each biomarker from a total of 412 unique research articles. Each entry of the database incorporates PubMed ID, biomarker name, biomarker type, biomolecule, source, subjects, validation status, and performance measures such as sensitivity, specificity, and hazard ratio (HR). In this study, we present ProCanBio, a manually curated database that maintains detailed data on 2053 entries of potential PCa biomarkers obtained from 412 publications in user-friendly tabular format. Among them are 766 protein-based, 507 RNA-based, 157 genomic mutations, 260 miRNA-based, and 122 metabolites-based biomarkers. To explore the information in the resource, a web-based interactive platform was developed with searching and browsing facilities. To the best of the authors' knowledge, there is no resource that can consolidate the information contained in all the published literature. Besides this, ProCanBio is freely available and is compatible with most web browsers and devices. Eventually, we anticipate this resource will be highly useful for the research community involved in the area of prostate malignancy.

Subject(s)

Biomarkers, Tumor/genetics , Biomarkers, Tumor/metabolism , Data Curation/methods , Prostatic Neoplasms/genetics , Prostatic Neoplasms/metabolism , Databases, Factual , Gene Regulatory Networks , Humans , Male , Metabolomics , MicroRNAs/genetics , Mutation , Prognosis , Protein Interaction Maps , User-Computer Interface , Web Browser

15.

A crowdsourcing open platform for literature curation in UniProt.

Wang, Yuqi; Wang, Qinghua; Huang, Hongzhan; Huang, Wei; Chen, Yongxing; McGarvey, Peter B; Wu, Cathy H; Arighi, Cecilia N.

PLoS Biol ; 19(12): e3001464, 2021 12.

Article in English | MEDLINE | ID: mdl-34871295

ABSTRACT

The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.

Subject(s)

Crowdsourcing/methods , Data Curation/methods , Molecular Sequence Annotation/methods , Amino Acid Sequence/genetics , Computational Biology/methods , Databases, Protein/trends , Humans , Literature , Proteins/metabolism , Stakeholder Participation

16.

A localization strategy combined with transfer learning for image annotation.

Chen, Zhiqiang; Rajamanickam, Leelavathi; Cao, Jianfang; Zhao, Aidi; Hu, Xiaohui.

PLoS One ; 16(12): e0260758, 2021.

Article in English | MEDLINE | ID: mdl-34879097

ABSTRACT

This study aims to solve the overfitting problem caused by insufficient labeled images in the automatic image annotation field. We propose a transfer learning model called CNN-2L that incorporates the label localization strategy described in this study. The model consists of an InceptionV3 network pretrained on the ImageNet dataset and a label localization algorithm. First, the pretrained InceptionV3 network extracts features from the target dataset that are used to train a specific classifier and fine-tune the entire network to obtain an optimal model. Then, the obtained model is used to derive the probabilities of the predicted labels. For this purpose, we introduce a squeeze and excitation (SE) module into the network architecture that augments the useful feature information, inhibits useless feature information, and conducts feature reweighting. Next, we perform label localization to obtain the label probabilities and determine the final label set for each image. During this process, the number of labels must be determined. The optimal K value is obtained experimentally and used to determine the number of predicted labels, thereby solving the empty label set problem that occurs when the predicted label values of images are below a fixed threshold. Experiments on the Corel5k multilabel image dataset verify that CNN-2L improves the labeling precision by 18% and 15% compared with the traditional multiple-Bernoulli relevance model (MBRM) and joint equal contribution (JEC) algorithms, respectively, and it improves the recall by 6% compared with JEC. Additionally, it improves the precision by 20% and 11% compared with the deep learning methods Weight-KNN and adaptive hypergraph learning (AHL), respectively. Although CNN-2L fails to improve the recall compared with the semantic extension model (SEM), it improves the comprehensive index of the F1 value by 1%. The experimental results reveal that the proposed transfer learning model based on a label localization strategy is effective for automatic image annotation and substantially boosts the multilabel image annotation performance.

Subject(s)

Algorithms , Data Curation/methods , Deep Learning , Image Processing, Computer-Assisted/methods , Neural Networks, Computer , Tomography, X-Ray Computed/methods , Humans

17.

Metabolite discovery through global annotation of untargeted metabolomics data.

Chen, Li; Lu, Wenyun; Wang, Lin; Xing, Xi; Chen, Ziyang; Teng, Xin; Zeng, Xianfeng; Muscarella, Antonio D; Shen, Yihui; Cowan, Alexis; McReynolds, Melanie R; Kennedy, Brandon J; Lato, Ashley M; Campagna, Shawn R; Singh, Mona; Rabinowitz, Joshua D.

Nat Methods ; 18(11): 1377-1385, 2021 11.

Article in English | MEDLINE | ID: mdl-34711973

ABSTRACT

Liquid chromatography-high-resolution mass spectrometry (LC-MS)-based metabolomics aims to identify and quantify all metabolites, but most LC-MS peaks remain unidentified. Here we present a global network optimization approach, NetID, to annotate untargeted LC-MS metabolomics data. The approach aims to generate, for all experimentally observed ion peaks, annotations that match the measured masses, retention times and (when available) tandem mass spectrometry fragmentation patterns. Peaks are connected based on mass differences reflecting adduction, fragmentation, isotopes, or feasible biochemical transformations. Global optimization generates a single network linking most observed ion peaks, enhances peak assignment accuracy, and produces chemically informative peak-peak relationships, including for peaks lacking tandem mass spectrometry spectra. Applying this approach to yeast and mouse data, we identified five previously unrecognized metabolites (thiamine derivatives and N-glucosyl-taurine). Isotope tracer studies indicate active flux through these metabolites. Thus, NetID applies existing metabolomic knowledge and global optimization to substantially improve annotation coverage and accuracy in untargeted metabolomics datasets, facilitating metabolite discovery.

Subject(s)

Algorithms , Data Curation/standards , Liver/metabolism , Metabolome , Metabolomics/standards , Saccharomyces cerevisiae/metabolism , Animals , Chromatography, Liquid/methods , Data Curation/methods , Metabolomics/methods , Mice , Tandem Mass Spectrometry/methods

18.

OGT Protein Interaction Network (OGT-PIN): A Curated Database of Experimentally Identified Interaction Proteins of OGT.

Ma, Junfeng; Hou, Chunyan; Li, Yaoxiang; Chen, Shufu; Wu, Ci.

Int J Mol Sci ; 22(17)2021 Sep 06.

Article in English | MEDLINE | ID: mdl-34502531

ABSTRACT

Interactions between proteins are essential to any cellular process and constitute the basis for molecular networks that determine the functional state of a cell. With the technical advances in recent years, an astonishingly high number of protein-protein interactions has been revealed. However, the interactome of O-linked N-acetylglucosamine transferase (OGT), the sole enzyme adding the O-linked ß-N-acetylglucosamine (O-GlcNAc) onto its target proteins, has been largely undefined. To that end, we collated OGT interaction proteins experimentally identified in the past several decades. Rigorous curation of datasets from public repositories and O-GlcNAc-focused publications led to the identification of up to 929 high-stringency OGT interactors from multiple species studied (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, and others). Among them, 784 human proteins were found to be interactors of human OGT. Moreover, these proteins spanned a very diverse range of functional classes (e.g., DNA repair, RNA metabolism, translational regulation, and cell cycle), with significant enrichment in regulating transcription and (co)translation. Our dataset demonstrates that OGT is likely a hub protein in cells. A webserver OGT-Protein Interaction Network (OGT-PIN) has also been created, which is freely accessible.

Subject(s)

Acetylglucosamine/metabolism , Data Curation/methods , Databases, Protein/statistics & numerical data , N-Acetylglucosaminyltransferases/metabolism , Protein Interaction Maps , Protein Processing, Post-Translational , Animals , Arabidopsis Proteins/metabolism , Drosophila Proteins/metabolism , Humans , Mice , Rats

19.

Annotating cell types in human single-cell RNA-seq data with CellO.

Bernstein, Matthew N; Dewey, Colin N.

STAR Protoc ; 2(3): 100705, 2021 09 17.

Article in English | MEDLINE | ID: mdl-34458864

ABSTRACT

Cell type annotation is important in the analysis of single-cell RNA-seq data. CellO is a machine-learning-based tool for annotating cells using the Cell Ontology, a rich hierarchy of known cell types. We provide a protocol for using the CellO Python package to annotate human cells. We demonstrate how to use CellO in conjunction with Scanpy, a Python library for performing single-cell analysis, annotate a lung tissue data set, interpret its hierarchically structured cell type annotations, and create publication-ready figures. For complete details on the use and execution of this protocol, please refer to Bernstein et al. (2021).

Subject(s)

Data Curation/methods , RNA-Seq/methods , Sequence Analysis, RNA/methods , Biological Ontologies , Computational Biology/methods , Humans , Machine Learning , Single-Cell Analysis/methods , Software , Transcriptome/genetics , Exome Sequencing/methods

20.

Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria.

Díaz-Rodríguez, Martín; Lithgow-Serrano, Oscar; Guadarrama-García, Francisco; Tierrafría, Víctor H; Gama-Castro, Socorro; Solano-Lira, Hilda; Salgado, Heladia; Rinaldi, Fabio; Méndez-Cruz, Carlos-Francisco; Collado-Vides, Julio.

Biochim Biophys Acta Gene Regul Mech ; 1864(11-12): 194753, 2021.

Article in English | MEDLINE | ID: mdl-34461312

ABSTRACT

The number of published papers in biomedical research makes it rather impossible for a researcher to keep up to date. This is where manually curated databases contribute facilitating the access to knowledge. However, the structure required by databases strongly limits the type of valuable information that can be incorporated. Here, we present Lisen&Curate, a curation system that facilitates linking sentences or part of sentences (both considered sources) in articles with their corresponding curated objects, so that rich additional information of these objects is easily available to users. These sources are going to be offered both within RegulonDB and a new database, L-Regulon. To show the relevance of our work, two senior curators performed a curation of 31 articles on the regulation of transcription initiation of E. coli using Lisen&Curate. As a result, 194 objects were curated and 781 sources were recorded. We also found that these sources are useful to develop automatic approaches to detect objects in articles by observing word frequency patterns and by carrying out an open information extraction task. Sources may help to elaborate a controlled vocabulary of experimental methods. Finally, we discuss our ecosystem of interconnected applications, RegulonDB, L-Regulon, and Lisen&Curate, to facilitate the access to knowledge on regulation of transcription initiation in bacteria. We see our proposal as the starting point to change the way experimentalists connect a piece of knowledge with its evidence using RegulonDB.

Subject(s)

Data Curation/methods , Databases, Genetic , Gene Expression Regulation, Bacterial , Transcription Initiation, Genetic , Escherichia coli/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL