Pesquisa | Portal Regional da BVS (teste)

Computer-assisted curation of a human regulatory core network from the biological literature.

Thomas, Philippe; Durek, Pawel; Solt, Illés; Klinger, Bertram; Witzel, Franziska; Schulthess, Pascal; Mayer, Yvonne; Tikk, Domonkos; Blüthgen, Nils; Leser, Ulf.

Bioinformatics ; 31(8): 1258-66, 2015 Apr 15.

Artigo em Inglês | MEDLINE | ID: mdl-25433699

RESUMO

MOTIVATION: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained â¼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION: Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT: leser@informatik.hu-berlin.de or nils.bluethgen@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Redes Reguladoras de Genes , Genoma Humano , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Neoplasias/metabolismo , Fatores de Transcrição/metabolismo , Inteligência Artificial , Simulação por Computador , Mineração de Dados , Bases de Dados Factuais , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Modelos Biológicos , Neoplasias/classificação , Neoplasias/genética , Fatores de Transcrição/genética

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction.

Tikk, Domonkos; Solt, Illés; Thomas, Philippe; Leser, Ulf.

BMC Bioinformatics ; 14: 12, 2013 Jan 16.

Artigo em Inglês | MEDLINE | ID: mdl-23323857

RESUMO

BACKGROUND: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level. RESULTS: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance. CONCLUSIONS: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

Assuntos

Algoritmos , Mapeamento de Interação de Proteínas/métodos

Improving textual medication extraction using combined conditional random fields and rule-based systems.

Tikk, Domonkos; Solt, Illés.

J Am Med Inform Assoc ; 17(5): 540-4, 2010.

Artigo em Inglês | MEDLINE | ID: mdl-20819860

RESUMO

OBJECTIVE: In the i2b2 Medication Extraction Challenge, medication names together with details of their administration were to be extracted from medical discharge summaries. DESIGN: The task of the challenge was decomposed into three pipelined components: named entity identification, context-aware filtering and relation extraction. For named entity identification, first a rule-based (RB) method that was used in our overall fifth place-ranked solution at the challenge was investigated. Second, a conditional random fields (CRF) approach is presented for named entity identification (NEI) developed after the completion of the challenge. The CRF models are trained on the 17 ground truth documents, the output of the rule-based NEI component on all documents, a larger but potentially inaccurate training dataset. For both NEI approaches their effect on relation extraction performance was investigated. The filtering and relation extraction components are both rule-based. MEASUREMENTS: In addition to the official entry level evaluation of the challenge, entity level analysis is also provided. RESULTS: On the test data an entry level F(1)-score of 80% was achieved for exact matching and 81% for inexact matching with the RB-NEI component. The CRF produces a significantly weaker result, but CRF outperforms the rule-based model with 81% exact and 82% inexact F(1)-score (p<0.02). CONCLUSION: This study shows that a simple rule-based method is on a par with more complicated machine learners; CRF models can benefit from the addition of the potentially inaccurate training data, when only very few training documents are available. Such training data could be generated using the outputs of rule-based methods.

Assuntos

Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Preparações Farmacêuticas , Humanos , Preparações Farmacêuticas/administração & dosagem , Design de Software

Simple tricks for improving pattern-based information extraction from the biomedical literature.

Nguyen, Quang Long; Tikk, Domonkos; Leser, Ulf.

J Biomed Semantics ; 1(1): 9, 2010 Sep 24.

Artigo em Inglês | MEDLINE | ID: mdl-20868467

RESUMO

BACKGROUND: Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns. RESULTS: We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%. CONCLUSIONS: Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.

A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.

Tikk, Domonkos; Thomas, Philippe; Palaga, Peter; Hakenberg, Jörg; Leser, Ulf.

PLoS Comput Biol ; 6: e1000837, 2010 Jul 01.

Artigo em Inglês | MEDLINE | ID: mdl-20617200

RESUMO

The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.

Assuntos

Mineração de Dados/métodos , Bases de Dados de Proteínas , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas/métodos , Proteínas/classificação , Algoritmos , Área Sob a Curva , Árvores de Decisões , Modelos Moleculares , Reprodutibilidade dos Testes

Semantic classification of diseases in discharge summaries using a context-aware rule-based classifier.

Solt, Illés; Tikk, Domonkos; Gál, Viktor; Kardkovács, Zsolt T.

J Am Med Inform Assoc ; 16(4): 580-4, 2009.

Artigo em Inglês | MEDLINE | ID: mdl-19390101

RESUMO

OBJECTIVE Automated and disease-specific classification of textual clinical discharge summaries is of great importance in human life science, as it helps physicians to make medical studies by providing statistically relevant data for analysis. This can be further facilitated if, at the labeling of discharge summaries, semantic labels are also extracted from text, such as whether a given disease is present, absent, questionable in a patient, or is unmentioned in the document. The authors present a classification technique that successfully solves the semantic classification task. DESIGN The authors introduce a context-aware rule-based semantic classification technique for use on clinical discharge summaries. The classification is performed in subsequent steps. First, some misleading parts are removed from the text; then the text is partitioned into positive, negative, and uncertain context segments, then a sequence of binary classifiers is applied to assign the appropriate semantic labels. Measurement For evaluation the authors used the documents of the i2b2 Obesity Challenge and adopted its evaluation measures: F(1)-macro and F(1)-micro for measurements. RESULTS On the two subtasks of the Obesity Challenge (textual and intuitive classification) the system performed very well, and achieved a F(1)-macro = 0.80 for the textual and F(1)-macro = 0.67 for the intuitive tasks, and obtained second place at the textual and first place at the intuitive subtasks of the challenge. CONCLUSIONS The authors show in the paper that a simple rule-based classifier can tackle the semantic classification task more successfully than machine learning techniques, if the training data are limited and some semantic labels are very sparse.

Assuntos

Doença/classificação , Processamento de Linguagem Natural , Obesidade , Alta do Paciente , Inteligência Artificial , Classificação/métodos , Comorbidade , Humanos , Semântica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA