Search | VHL Regional Portal

Improving candidate Biosynthetic Gene Clusters in fungi through reinforcement learning.

Almeida, Hayda; Tsang, Adrian; Diallo, Abdoulaye Baniré.

Bioinformatics ; 38(16): 3984-3991, 2022 08 10.

Article in English | MEDLINE | ID: mdl-35762945

ABSTRACT

MOTIVATION: Precise identification of Biosynthetic Gene Clusters (BGCs) is a challenging task. Performance of BGC discovery tools is limited by their capacity to accurately predict components belonging to candidate BGCs, often overestimating cluster boundaries. To support optimizing the composition and boundaries of candidate BGCs, we propose reinforcement learning approach relying on protein domains and functional annotations from expert curated BGCs. RESULTS: The proposed reinforcement learning method aims to improve candidate BGCs obtained with state-of-the-art tools. It was evaluated on candidate BGCs obtained for two fungal genomes, Aspergillus niger and Aspergillus nidulans. The results highlight an improvement of the gene precision by above 15% for TOUCAN, fungiSMASH and DeepBGC; and cluster precision by above 25% for fungiSMASH and DeepBCG, allowing these tools to obtain almost perfect precision in cluster prediction. This can pave the way of optimizing current prediction of candidate BGCs in fungi, while minimizing the curation effort required by domain experts. AVAILABILITY AND IMPLEMENTATION: https://github.com/bioinfoUQAM/RL-bgc-components. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Fungi , Multigene Family , Fungi/genetics , Genome, Fungal , Biosynthetic Pathways/genetics

TOUCAN: a framework for fungal biosynthetic gene cluster discovery.

Almeida, Hayda; Palys, Sylvester; Tsang, Adrian; Diallo, Abdoulaye Baniré.

NAR Genom Bioinform ; 2(4): lqaa098, 2020 Dec.

Article in English | MEDLINE | ID: mdl-33575642

ABSTRACT

Fungal secondary metabolites (SMs) are an important source of numerous bioactive compounds largely applied in the pharmaceutical industry, as in the production of antibiotics and anticancer medications. The discovery of novel fungal SMs can potentially benefit human health. Identifying biosynthetic gene clusters (BGCs) involved in the biosynthesis of SMs can be a costly and complex task, especially due to the genomic diversity of fungal BGCs. Previous studies on fungal BGC discovery present limited scope and can restrict the discovery of new BGCs. In this work, we introduce TOUCAN, a supervised learning framework for fungal BGC discovery. Unlike previous methods, TOUCAN is capable of predicting BGCs on amino acid sequences, facilitating its use on newly sequenced and not yet curated data. It relies on three main pillars: rigorous selection of datasets by BGC experts; combination of functional, evolutionary and compositional features coupled with outperforming classifiers; and robust post-processing methods. TOUCAN best-performing model yields 0.982 F-measure on BGC regions in the Aspergillus niger genome. Overall results show that TOUCAN outperforms previous approaches. TOUCAN focuses on fungal BGCs but can be easily adapted to expand its scope to process other species or include new features.

Data Sampling and Supervised Learning for HIV Literature Screening.

Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Tsang, Adrian.

IEEE Trans Nanobioscience ; 15(4): 354-361, 2016 06.

Article in English | MEDLINE | ID: mdl-28113721

ABSTRACT

This paper presents a supervised learning approach to support the screening of HIV literature. The manual screening of biomedical literature is an important task in the process of systematic reviews. Researchers and curators have the very demanding, time-consuming and error-prone task of manually identifying documents that should be included in a systematic review concerning a specific problem. We developed a supervised learning approach to support screening tasks, by automatically flagging potentially relevant documents from a list retrieved by a literature database search. To overcome the main issues associated with the automatic literature screening task, we evaluated the use of data sampling, feature combinations, and feature selection methods, generating a total of 105 classification models. The models yielding the best results were composed of a Logistic Model Trees classifier, a fairly balanced training set, and feature combination of Bag-Of-Words and MeSH terms. According to our results, the system correctly labels the great majority of relevant documents, making it usable to support HIV systematic reviews to allow researchers to assess a greater number of documents in less time.

mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support.

Strasser, Kimchi; McDonnell, Erin; Nyaga, Carol; Wu, Min; Wu, Sherry; Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Powlowski, Justin; Butler, Greg; Tsang, Adrian.

Database (Oxford) ; 20152015.

Article in English | MEDLINE | ID: mdl-25754864

ABSTRACT

Enzymes active on components of lignocellulosic biomass are used for industrial applications ranging from food processing to biofuels production. These include a diverse array of glycoside hydrolases, carbohydrate esterases, polysaccharide lyases and oxidoreductases. Fungi are prolific producers of these enzymes, spurring fungal genome sequencing efforts to identify and catalogue the genes that encode them. To facilitate the functional annotation of these genes, biochemical data on over 800 fungal lignocellulose-degrading enzymes have been collected from the literature and organized into the searchable database, mycoCLAP (http://mycoclap.fungalgenomics.ca). First implemented in 2011, and updated as described here, mycoCLAP is capable of ranking search results according to closest biochemically characterized homologues: this improves the quality of the annotation, and significantly decreases the time required to annotate novel sequences. The database is freely available to the scientific community, as are the open source applications based on natural language processing developed to support the manual curation of mycoCLAP. Database URL: http://mycoclap.fungalgenomics.ca.

Subject(s)

Data Mining , Databases, Genetic , Enzymes , Fungal Proteins , Genes, Fungal , Lignin/metabolism , Natural Language Processing , Data Curation , Enzymes/genetics , Enzymes/metabolism , Fungal Proteins/genetics , Fungal Proteins/metabolism

Machine learning for biomedical literature triage.

Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Butler, Greg; Tsang, Adrian.

PLoS One ; 9(12): e115892, 2014.

Article in English | MEDLINE | ID: mdl-25551575

ABSTRACT

This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.

Subject(s)

Databases, Bibliographic , Medical Informatics/methods , Support Vector Machine , Algorithms , Bayes Theorem , Decision Trees , Models, Theoretical

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL