Search | VHL Regional Portal

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.

Islamaj Dogan, Rezarta; Kim, Sun; Chatr-Aryamontri, Andrew; Wei, Chih-Hsuan; Comeau, Donald C; Antunes, Rui; Matos, Sérgio; Chen, Qingyu; Elangovan, Aparna; Panyam, Nagesh C; Verspoor, Karin; Liu, Hongfang; Wang, Yanshan; Liu, Zhuang; Altinel, Berna; Hüsünbeyi, Zehra Melce; Özgür, Arzucan; Fergadis, Aris; Wang, Chen-Kai; Dai, Hong-Jie; Tran, Tung; Kavuluru, Ramakanth; Luo, Ling; Steppi, Albert; Zhang, Jinfeng; Qu, Jinchan; Lu, Zhiyong.

Database (Oxford) ; 20192019 01 01.

Article in English | MEDLINE | ID: mdl-30689846

ABSTRACT

The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.

Subject(s)

Data Mining/methods , Databases, Protein , Mutation , Precision Medicine/methods , Protein Interaction Maps , Software , Computational Biology/methods , Humans , Mutation/genetics , Mutation/physiology , Protein Interaction Mapping , Protein Interaction Maps/genetics , Protein Interaction Maps/physiology

Automatic query generation using word embeddings for retrieving passages describing experimental methods.

Aydin, Ferhat; Hüsünbeyi, Zehra Melce; Özgür, Arzucan.

Database (Oxford) ; 20172017.

Article in English | MEDLINE | ID: mdl-28077568

ABSTRACT

Information regarding the physical interactions among proteins is crucial, since protein-protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative-Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency-relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article.Database URL: https://github.com/ferhtaydn/biocemid/.

Subject(s)

Biological Ontologies , Data Curation , Data Mining/methods , Databases, Protein , Proteins , Proteins/genetics , Proteins/metabolism

BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID.

Kim, Sun; Islamaj Dogan, Rezarta; Chatr-Aryamontri, Andrew; Chang, Christie S; Oughtred, Rose; Rust, Jennifer; Batista-Navarro, Riza; Carter, Jacob; Ananiadou, Sophia; Matos, Sérgio; Santos, André; Campos, David; Oliveira, José Luís; Singh, Onkar; Jonnagaddala, Jitendra; Dai, Hong-Jie; Su, Emily Chia-Yu; Chang, Yung-Chun; Su, Yu-Chen; Chu, Chun-Han; Chen, Chien Chin; Hsu, Wen-Lian; Peng, Yifan; Arighi, Cecilia; Wu, Cathy H; Vijay-Shanker, K; Aydin, Ferhat; Hüsünbeyi, Zehra Melce; Özgür, Arzucan; Shin, Soo-Yong; Kwon, Dongseop; Dolinski, Kara; Tyers, Mike; Wilbur, W John; Comeau, Donald C.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27589962

ABSTRACT

BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were merged using a machine learning classifier to produce an optimized output. The biocurator assistant system was evaluated by four BioGRID curators in terms of practical usability. The curators' feedback was overall positive and highlighted the user-friendly design and the convenient gene/protein curation tool based on text mining.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-1-bioc/.

Subject(s)

Data Curation/methods , Data Mining/methods , Electronic Data Processing/methods , Information Dissemination/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL