Pesquisa | Portal Regional da BVS

Data Sampling and Supervised Learning for HIV Literature Screening.

Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Tsang, Adrian.

IEEE Trans Nanobioscience ; 15(4): 354-361, 2016 06.

Artigo em Inglês | MEDLINE | ID: mdl-28113721

RESUMO

This paper presents a supervised learning approach to support the screening of HIV literature. The manual screening of biomedical literature is an important task in the process of systematic reviews. Researchers and curators have the very demanding, time-consuming and error-prone task of manually identifying documents that should be included in a systematic review concerning a specific problem. We developed a supervised learning approach to support screening tasks, by automatically flagging potentially relevant documents from a list retrieved by a literature database search. To overcome the main issues associated with the automatic literature screening task, we evaluated the use of data sampling, feature combinations, and feature selection methods, generating a total of 105 classification models. The models yielding the best results were composed of a Logistic Model Trees classifier, a fairly balanced training set, and feature combination of Bag-Of-Words and MeSH terms. According to our results, the system correctly labels the great majority of relevant documents, making it usable to support HIV systematic reviews to allow researchers to assess a greater number of documents in less time.

mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support.

Strasser, Kimchi; McDonnell, Erin; Nyaga, Carol; Wu, Min; Wu, Sherry; Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Powlowski, Justin; Butler, Greg; Tsang, Adrian.

Database (Oxford) ; 20152015.

Artigo em Inglês | MEDLINE | ID: mdl-25754864

RESUMO

Enzymes active on components of lignocellulosic biomass are used for industrial applications ranging from food processing to biofuels production. These include a diverse array of glycoside hydrolases, carbohydrate esterases, polysaccharide lyases and oxidoreductases. Fungi are prolific producers of these enzymes, spurring fungal genome sequencing efforts to identify and catalogue the genes that encode them. To facilitate the functional annotation of these genes, biochemical data on over 800 fungal lignocellulose-degrading enzymes have been collected from the literature and organized into the searchable database, mycoCLAP (http://mycoclap.fungalgenomics.ca). First implemented in 2011, and updated as described here, mycoCLAP is capable of ranking search results according to closest biochemically characterized homologues: this improves the quality of the annotation, and significantly decreases the time required to annotate novel sequences. The database is freely available to the scientific community, as are the open source applications based on natural language processing developed to support the manual curation of mycoCLAP. Database URL: http://mycoclap.fungalgenomics.ca.

Assuntos

Mineração de Dados , Bases de Dados Genéticas , Enzimas , Proteínas Fúngicas , Genes Fúngicos , Lignina/metabolismo , Processamento de Linguagem Natural , Curadoria de Dados , Enzimas/genética , Enzimas/metabolismo , Proteínas Fúngicas/genética , Proteínas Fúngicas/metabolismo

Machine learning for biomedical literature triage.

Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Butler, Greg; Tsang, Adrian.

PLoS One ; 9(12): e115892, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-25551575

RESUMO

This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.

Assuntos

Bases de Dados Bibliográficas , Informática Médica/métodos , Máquina de Vetores de Suporte , Algoritmos , Teorema de Bayes , Árvores de Decisões , Modelos Teóricos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA