Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
1.
Database (Oxford) ; 20232023 03 31.
Article in English | MEDLINE | ID: mdl-37002680

ABSTRACT

The curation of genomic variants requires collecting evidence not only in variant knowledge bases but also in the literature. However, some variants result in no match when searched in the scientific literature. Indeed, it has been reported that a significant subset of information related to genomic variants are not reported in the full text, but only in the supplementary materials associated with a publication. In the study, we present an evaluation of the use of supplementary data (SD) to improve the retrieval of relevant scientific publications for variant curation. Our experiments show that searching SD enables to significantly increase the volume of documents retrieved for a variant, thus reducing by ∼63% the number of variants for which no match is found in the scientific literature. SD thus represent a paramount source of information for curating variants of unknown significance and should receive more attention by global research infrastructures, which maintain literature search engines. Database URL https://www.expasy.org/resources/variomes.


Subject(s)
Genomics , Search Engine , Databases, Factual
2.
Stud Health Technol Inform ; 294: 839-843, 2022 May 25.
Article in English | MEDLINE | ID: mdl-35612222

ABSTRACT

The importance of genomic data for health is rapidly growing but accessing and gathering information about variants from different sources is hindered by highly heterogeneous representations of variants, as outlined by clinical associations (AMP/ASCO/CAP) in their recommendations. To enable a smooth and effective retrieval of variant-containing documents from different resources, we developed a tool (https://goldorak.hesge.ch/synvar/) that generates for any given SNP - including variant not present in existing databases - its corresponding description at the genome, transcript and protein levels. It provides variant descriptions in the HGVS format as well as in many non-standard formats found in the literature along with database identifiers. We present the SynVar service and evaluate its impact on the recall of a genomic variant curation-support service. Using SynVar to search variants in the literature enables to increase the recall by +133.8% without a strong impact on precision (i.e. 93%).


Subject(s)
Genomics , Databases, Factual
3.
Bioinformatics ; 38(9): 2595-2601, 2022 04 28.
Article in English | MEDLINE | ID: mdl-35274687

ABSTRACT

MOTIVATION: Identification and interpretation of clinically actionable variants is a critical bottleneck. Searching for evidence in the literature is mandatory according to ASCO/AMP/CAP practice guidelines; however, it is both labor-intensive and error-prone. We developed a system to perform triage of publications relevant to support an evidence-based decision. The system is also able to prioritize variants. Our system searches within pre-annotated collections such as MEDLINE and PubMed Central. RESULTS: We assess the search effectiveness of the system using three different experimental settings: literature triage; variant prioritization and comparison of Variomes with LitVar. Almost two-thirds of the publications returned in the top-5 are relevant for clinical decision-support. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Variomes retrieves on average +21.3% more articles than LitVar and returns the same number of results or more results than LitVar for 90% of the queries when tested on a set of 803 queries; thus, establishing a new baseline for searching the literature about variants. AVAILABILITY AND IMPLEMENTATION: Variomes is publicly available at https://candy.hesge.ch/Variomes. Source code is freely available at https://github.com/variomes/sibtm-variomes. SynVar is publicly available at https://goldorak.hesge.ch/synvar. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Search Engine , Genomics/methods , Genome , PubMed , Software
4.
Stud Health Technol Inform ; 270: 884-888, 2020 Jun 16.
Article in English | MEDLINE | ID: mdl-32570509

ABSTRACT

The Swiss Variant Interpretation Platform for Oncology is a centralized, joint and curated database for clinical somatic variants piloted by a board of Swiss healthcare institutions and operated by the SIB Swiss Institute of Bioinformatics. To support this effort, SIB Text Mining designed a set of text analytics services. This report focuses on three of those services. First, the automatic annotations of the literature with a set of terminologies have been performed, resulting in a large annotated version of MEDLINE and PMC. Second, a generator of variant synonyms for single nucleotide variants has been developed using publicly available data resources, as well as patterns of non-standard formats, often found in the literature. Third, a literature ranking service enables to retrieve a ranked set of MEDLINE abstracts given a variant and optionally a diagnosis. The annotation of MEDLINE and PMC resulted in a total of respectively 785,181,199 and 1,156,060,212 annotations, which means an average of 26 and 425 annotations per abstract and full-text article. The generator of variant synonyms enables to retrieve up to 42 synonyms for a variant. The literature ranking service reaches a precision (P10) of 63%, which means that almost two-thirds of the top-10 returned abstracts are judged relevant. Further services will be implemented to complete this set of services, such as a service to retrieve relevant clinical trials for a patient and a literature ranking service for full-text articles.


Subject(s)
Computational Biology , Data Mining , Abstracting and Indexing , Humans , MEDLINE , Switzerland
5.
Nucleic Acids Res ; 48(W1): W12-W16, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32379317

ABSTRACT

Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.


Subject(s)
Data Mining/methods , Search Engine , MEDLINE , Precision Medicine
6.
Database (Oxford) ; 20202020 01 01.
Article in English | MEDLINE | ID: mdl-32367111

ABSTRACT

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.


Subject(s)
Deep Learning , Databases, Protein , Knowledge Bases , Molecular Sequence Annotation , Proteins/genetics
7.
Nucleic Acids Res ; 48(D1): D269-D276, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31713636

ABSTRACT

The Database of Protein Disorder (DisProt, URL: https://disprot.org) provides manually curated annotations of intrinsically disordered proteins from the literature. Here we report recent developments with DisProt (version 8), including the doubling of protein entries, a new disorder ontology, improvements of the annotation format and a completely new website. The website includes a redesigned graphical interface, a better search engine, a clearer API for programmatic access and a new annotation interface that integrates text mining technologies. The new entry format provides a greater flexibility, simplifies maintenance and allows the capture of more information from the literature. The new disorder ontology has been formalized and made interoperable by adopting the OWL format, as well as its structure and term definitions have been improved. The new annotation interface has made the curation process faster and more effective. We recently showed that new DisProt annotations can be effectively used to train and validate disorder predictors. We believe the growth of DisProt will accelerate, contributing to the improvement of function and disorder predictors and therefore to illuminate the 'dark' proteome.


Subject(s)
Databases, Protein , Intrinsically Disordered Proteins/chemistry , Biological Ontologies , Data Curation , Molecular Sequence Annotation
8.
Database (Oxford) ; 20182018 01 01.
Article in English | MEDLINE | ID: mdl-30576492

ABSTRACT

The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.


Subject(s)
Data Curation/methods , Data Mining/methods , Databases, Factual , Software , Humans
10.
Stud Health Technol Inform ; 235: 186-190, 2017.
Article in English | MEDLINE | ID: mdl-28423780

ABSTRACT

Identifying similar patients might greatly facilitate the treatment of a given patient, enabling to observe the response and outcome to a particular treatment. Case-based retrieval services dealing with natural language processing are of major importance to deal with the significant amount of unstructured clinical data. In this paper, we present the development and evaluation of a case-based retrieval (CBR) service tested on a collection of Italian pediatric cardiology cases. Cases are indexed and a search engine is proposed. Search functionalities, such as interactive MeSH normalization and relevance feedback, are proposed. While the qualitative evaluation aims to provide feedback and recommendations, the quantitative evaluation enables to estimate the precision of the system. In more than half of the cases and for up to two thirds of them, the system is able to suggest a similar episode of care at first rank. With an improvement of the feedback relevance strategy, we can expect an improvement of the precision. The CBR can be expanded to multilingual EHR and other fields.


Subject(s)
Electronic Health Records , Information Storage and Retrieval/methods , Natural Language Processing , Child , Female , Heart Diseases/diagnosis , Heart Diseases/therapy , Humans , Italy , Male , Medical Subject Headings , Search Engine
11.
Article in English | MEDLINE | ID: mdl-27374119

ABSTRACT

The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline.Available on: http://babar.unige.ch:8082/neXtA5Database URL: http://babar.unige.ch:8082/neXtA5/fetcher.jsp.


Subject(s)
Data Curation/methods , Data Mining/methods , Electronic Data Processing/methods , MEDLINE , Search Engine/methods
12.
Article in English | MEDLINE | ID: mdl-26384372

ABSTRACT

Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/.


Subject(s)
Data Mining/methods , Molecular Sequence Annotation/methods , Sequence Analysis, Protein/methods , Software , Animals , Humans
13.
Article in English | MEDLINE | ID: mdl-25190367

ABSTRACT

Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/. DATABASE URL: http://eagl.unige.ch/GOCat4FT/.


Subject(s)
Computational Biology/methods , Data Curation/methods , Gene Ontology , Molecular Sequence Annotation/methods , Proteins/chemistry , Proteins/classification , Software
14.
Stud Health Technol Inform ; 197: 29-33, 2014.
Article in English | MEDLINE | ID: mdl-24743073

ABSTRACT

Employing the bridge between Clinical Information System (CIS) and Clinical Research Environment (CRE) can provide functionality, which is not easily, implemented by traditional legacy EHR system. In this paper, the experience of such implementation at the University Hospitals of Geneva is described. General overview of the mapping of extracted from CIS data to the i2b2 Clinical Data Warehouse is provided. The defined implementation manages to provide the interoperability for the CRE.


Subject(s)
Biomedical Research , Electronic Health Records , Information Storage and Retrieval/methods , Medical Informatics/methods , Medical Record Linkage/methods , Natural Language Processing , Switzerland , Systems Integration
15.
BMC Bioinformatics ; 15 Suppl 1: S15, 2014.
Article in English | MEDLINE | ID: mdl-24564220

ABSTRACT

BACKGROUND: The large increase in the size of patent collections has led to the need of efficient search strategies. But the development of advanced text-mining applications dedicated to patents of the biomedical field remains rare, in particular to address the needs of the pharmaceutical & biotech industry, which intensively uses patent libraries for competitive intelligence and drug development. METHODS: We describe here the development of an advanced retrieval engine to search information in patent collections in the field of medicinal chemistry. We investigate and combine different strategies and evaluate their respective impact on the performance of the search engine applied to various search tasks, which covers the putatively most frequent search behaviours of intellectual property officers in medical chemistry: 1) a prior art search task; 2) a technical survey task; and 3) a variant of the technical survey task, sometimes called known-item search task, where a single patent is targeted. RESULTS: The optimal tuning of our engine resulted in a top-precision of 6.76% for the prior art search task, 23.28% for the technical survey task and 46.02% for the variant of the technical survey task. We observed that co-citation boosting was an appropriate strategy to improve prior art search tasks, while IPC classification of queries was improving retrieval effectiveness for technical survey tasks. Surprisingly, the use of the full body of the patent was always detrimental for search effectiveness. It was also observed that normalizing biomedical entities using curated dictionaries had simply no impact on the search tasks we evaluate. The search engine was finally implemented as a web-application within Novartis Pharma. The application is briefly described in the report. CONCLUSIONS: We have presented the development of a search engine dedicated to patent search, based on state of the art methods applied to patent corpora. We have shown that a proper tuning of the system to adapt to the various search tasks clearly increases the effectiveness of the system. We conclude that different search tasks demand different information retrieval engines' settings in order to yield optimal end-user retrieval.


Subject(s)
Chemistry, Pharmaceutical , Patents as Topic , Search Engine/methods , Algorithms , Information Storage and Retrieval , Internet , Small Molecule Libraries
16.
Stud Health Technol Inform ; 192: 1068, 2013.
Article in English | MEDLINE | ID: mdl-23920842

ABSTRACT

The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in large biomedical collections. Therefore, using biomedical controlled vocabularies is crucial for managing these contents. We investigate the impact of query expansion based on controlled vocabularies to improve the effectiveness of two search engines. Our strategy relies on the enrichment of users' queries with additional terms, directly derived from such vocabularies applied to infectious diseases and chemical patents. We observed that query expansion based on pathogen names resulted in improvements of the top-precision of our first search engine, while the normalization of diseases degraded the top-precision. The expansion of chemical entities, which was performed on the second search engine, positively affected the mean average precision. We have shown that query expansion of some types of biomedical entities has a great potential to improve search effectiveness; therefore a fine-tuning of query expansion strategies could help improving the performances of search engines.


Subject(s)
Data Mining/methods , Database Management Systems , Databases, Factual , Medical Subject Headings , Natural Language Processing , Pattern Recognition, Automated/methods , Terminology as Topic , Artificial Intelligence
17.
Database (Oxford) ; 2013: bat041, 2013.
Article in English | MEDLINE | ID: mdl-23842461

ABSTRACT

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/


Subject(s)
Data Mining/methods , Databases, Genetic , Molecular Sequence Annotation , Algorithms , Knowledge Bases
18.
PLoS One ; 8(4): e62874, 2013.
Article in English | MEDLINE | ID: mdl-23646153

ABSTRACT

BACKGROUND: Improving antibiotic prescribing practices is an important public-health priority given the widespread antimicrobial resistance. Establishing clinical practice guidelines is crucial to this effort, but their development is a complex task and their quality is directly related to the methodology and source of knowledge used. OBJECTIVE: We present the design and the evaluation of a tool (KART) that aims to facilitate the creation and maintenance of clinical practice guidelines based on information retrieval techniques. METHODS: KART consists of three main modules 1) a literature-based medical knowledge extraction module, which is built upon a specialized question-answering engine; 2) a module to normalize clinical recommendations based on automatic text categorizers; and 3) a module to manage clinical knowledge, which formalizes and stores clinical recommendations for further use. The evaluation of the usability and utility of KART followed the methodology of the cognitive walkthrough. RESULTS: KART was designed and implemented as a standalone web application. The quantitative evaluation of the medical knowledge extraction module showed that 53% of the clinical recommendations generated by KART are consistent with existing clinical guidelines. The user-based evaluation confirmed this result by showing that KART was able to find a relevant antibiotic for half of the clinical scenarios tested. The automatic normalization of the recommendation produced mixed results among end-users. CONCLUSIONS: We have developed an innovative approach for the process of clinical guidelines development and maintenance in a context where available knowledge is increasing at a rate that cannot be sustained by humans. In contrast to existing knowledge authoring tools, KART not only provides assistance to normalize, formalize and store clinical recommendations, but also aims to facilitate knowledge building.


Subject(s)
Search Engine , Software , Anti-Bacterial Agents/therapeutic use , Drug Prescriptions/standards , Evidence-Based Medicine/education , Evidence-Based Medicine/standards , Humans , Internet , Practice Guidelines as Topic/standards
19.
Stud Health Technol Inform ; 186: 155-9, 2013.
Article in English | MEDLINE | ID: mdl-23542988

ABSTRACT

With the vast amount of biomedical data we face the necessity to improve information retrieval processes in biomedical domain. The use of biomedical ontologies facilitated the combination of various data sources (e.g. scientific literature, clinical data repository) by increasing the quality of information retrieval and reducing the maintenance efforts. In this context, we developed Ontology Look-up services (OLS), based on NEWT and MeSH vocabularies. Our services were involved in some information retrieval tasks such as gene/disease normalization. The implementation of OLS services significantly accelerated the extraction of particular biomedical facts by structuring and enriching the data context. The results of precision in normalization tasks were boosted on about 20%.


Subject(s)
Abstracting and Indexing/methods , Data Mining/methods , Database Management Systems , Databases, Bibliographic , Medical Subject Headings , Natural Language Processing , Periodicals as Topic , Semantics , User-Computer Interface
20.
Database (Oxford) ; 2012: bas050, 2012.
Article in English | MEDLINE | ID: mdl-23221176

ABSTRACT

We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.


Subject(s)
Data Mining/methods , Databases, Genetic/classification , Periodicals as Topic , Toxicogenetics , Internet , Molecular Sequence Annotation , Semantics , Support Vector Machine , Workflow
SELECTION OF CITATIONS
SEARCH DETAIL
...