Pesquisa | Portal Regional da BVS (teste)

Computer-assisted curation of a human regulatory core network from the biological literature.

Thomas, Philippe; Durek, Pawel; Solt, Illés; Klinger, Bertram; Witzel, Franziska; Schulthess, Pascal; Mayer, Yvonne; Tikk, Domonkos; Blüthgen, Nils; Leser, Ulf.

Bioinformatics ; 31(8): 1258-66, 2015 Apr 15.

Artigo em Inglês | MEDLINE | ID: mdl-25433699

RESUMO

MOTIVATION: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained â¼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION: Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT: leser@informatik.hu-berlin.de or nils.bluethgen@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Redes Reguladoras de Genes , Genoma Humano , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Neoplasias/metabolismo , Fatores de Transcrição/metabolismo , Inteligência Artificial , Simulação por Computador , Mineração de Dados , Bases de Dados Factuais , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Modelos Biológicos , Neoplasias/classificação , Neoplasias/genética , Fatores de Transcrição/genética

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction.

Tikk, Domonkos; Solt, Illés; Thomas, Philippe; Leser, Ulf.

BMC Bioinformatics ; 14: 12, 2013 Jan 16.

Artigo em Inglês | MEDLINE | ID: mdl-23323857

RESUMO

BACKGROUND: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level. RESULTS: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance. CONCLUSIONS: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

Assuntos

Algoritmos , Mapeamento de Interação de Proteínas/métodos

The gene normalization task in BioCreative III.

Lu, Zhiyong; Kao, Hung-Yu; Wei, Chih-Hsuan; Huang, Minlie; Liu, Jingchen; Kuo, Cheng-Ju; Hsu, Chun-Nan; Tsai, Richard Tzong-Han; Dai, Hong-Jie; Okazaki, Naoaki; Cho, Han-Cheol; Gerner, Martin; Solt, Illes; Agarwal, Shashank; Liu, Feifan; Vishnyakova, Dina; Ruch, Patrick; Romacker, Martin; Rinaldi, Fabio; Bhattacharya, Sanmitra; Srinivasan, Padmini; Liu, Hongfang; Torii, Manabu; Matos, Sergio; Campos, David; Verspoor, Karin; Livingston, Kevin M; Wilbur, W John.

BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-22151901

RESUMO

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

Assuntos

Algoritmos , Mineração de Dados/métodos , Genes , Animais , Mineração de Dados/normas , Humanos , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Estados Unidos

The GNAT library for local and remote gene mention normalization.

Hakenberg, Jörg; Gerner, Martin; Haeussler, Maximilian; Solt, Illés; Plake, Conrad; Schroeder, Michael; Gonzalez, Graciela; Nenadic, Goran; Bergman, Casey M.

Bioinformatics ; 27(19): 2769-71, 2011 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-21813477

RESUMO

SUMMARY: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. AVAILABILITY: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. CONTACT: jorg.hakenberg@roche.com.

Assuntos

Mineração de Dados , Biblioteca Gênica , Processamento Eletrônico de Dados , Genes , Internet , Proteínas , Editoração , Terminologia como Assunto

Improving textual medication extraction using combined conditional random fields and rule-based systems.

Tikk, Domonkos; Solt, Illés.

J Am Med Inform Assoc ; 17(5): 540-4, 2010.

Artigo em Inglês | MEDLINE | ID: mdl-20819860

RESUMO

OBJECTIVE: In the i2b2 Medication Extraction Challenge, medication names together with details of their administration were to be extracted from medical discharge summaries. DESIGN: The task of the challenge was decomposed into three pipelined components: named entity identification, context-aware filtering and relation extraction. For named entity identification, first a rule-based (RB) method that was used in our overall fifth place-ranked solution at the challenge was investigated. Second, a conditional random fields (CRF) approach is presented for named entity identification (NEI) developed after the completion of the challenge. The CRF models are trained on the 17 ground truth documents, the output of the rule-based NEI component on all documents, a larger but potentially inaccurate training dataset. For both NEI approaches their effect on relation extraction performance was investigated. The filtering and relation extraction components are both rule-based. MEASUREMENTS: In addition to the official entry level evaluation of the challenge, entity level analysis is also provided. RESULTS: On the test data an entry level F(1)-score of 80% was achieved for exact matching and 81% for inexact matching with the RB-NEI component. The CRF produces a significantly weaker result, but CRF outperforms the rule-based model with 81% exact and 82% inexact F(1)-score (p<0.02). CONCLUSION: This study shows that a simple rule-based method is on a par with more complicated machine learners; CRF models can benefit from the addition of the potentially inaccurate training data, when only very few training documents are available. Such training data could be generated using the outputs of rule-based methods.

Assuntos

Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Preparações Farmacêuticas , Humanos , Preparações Farmacêuticas/administração & dosagem , Design de Software

Semantic classification of diseases in discharge summaries using a context-aware rule-based classifier.

Solt, Illés; Tikk, Domonkos; Gál, Viktor; Kardkovács, Zsolt T.

J Am Med Inform Assoc ; 16(4): 580-4, 2009.

Artigo em Inglês | MEDLINE | ID: mdl-19390101

RESUMO

OBJECTIVE Automated and disease-specific classification of textual clinical discharge summaries is of great importance in human life science, as it helps physicians to make medical studies by providing statistically relevant data for analysis. This can be further facilitated if, at the labeling of discharge summaries, semantic labels are also extracted from text, such as whether a given disease is present, absent, questionable in a patient, or is unmentioned in the document. The authors present a classification technique that successfully solves the semantic classification task. DESIGN The authors introduce a context-aware rule-based semantic classification technique for use on clinical discharge summaries. The classification is performed in subsequent steps. First, some misleading parts are removed from the text; then the text is partitioned into positive, negative, and uncertain context segments, then a sequence of binary classifiers is applied to assign the appropriate semantic labels. Measurement For evaluation the authors used the documents of the i2b2 Obesity Challenge and adopted its evaluation measures: F(1)-macro and F(1)-micro for measurements. RESULTS On the two subtasks of the Obesity Challenge (textual and intuitive classification) the system performed very well, and achieved a F(1)-macro = 0.80 for the textual and F(1)-macro = 0.67 for the intuitive tasks, and obtained second place at the textual and first place at the intuitive subtasks of the challenge. CONCLUSIONS The authors show in the paper that a simple rule-based classifier can tackle the semantic classification task more successfully than machine learning techniques, if the training data are limited and some semantic labels are very sparse.

Assuntos

Doença/classificação , Processamento de Linguagem Natural , Obesidade , Alta do Paciente , Inteligência Artificial , Classificação/métodos , Comorbidade , Humanos , Semântica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA