Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
1.
Bioinformatics ; 30(14): 1974-82, 2014 Jul 15.
Article in English | MEDLINE | ID: mdl-24681905

ABSTRACT

MOTIVATION: Post-translational modifications (PTMs) are important steps in the maturation of proteins. Several models exist to predict specific PTMs, from manually detected patterns to machine learning methods. On one hand, the manual detection of patterns does not provide the most efficient classifiers and requires an important workload, and on the other hand, models built by machine learning methods are hard to interpret and do not increase biological knowledge. Therefore, we developed a novel method based on patterns discovery and decision trees to predict PTMs. The proposed algorithm builds a decision tree, by coupling the C4.5 algorithm with genetic algorithms, producing high-performance white box classifiers. Our method was tested on the initiator methionine cleavage (IMC) and N(α)-terminal acetylation (N-Ac), two of the most common PTMs. RESULTS: The resulting classifiers perform well when compared with existing models. On a set of eukaryotic proteins, they display a cross-validated Matthews correlation coefficient of 0.83 (IMC) and 0.65 (N-Ac). When used to predict potential substrates of N-terminal acetyltransferaseB and N-terminal acetyltransferaseC, our classifiers display better performance than the state of the art. Moreover, we present an analysis of the model predicting IMC for Homo sapiens proteins and demonstrate that we are able to extract experimentally known facts without prior knowledge. Those results validate the fact that our method produces white box models. AVAILABILITY AND IMPLEMENTATION: Predictors for IMC and N-Ac and all datasets are freely available at http://terminus.unige.ch/.


Subject(s)
Protein Processing, Post-Translational , Sequence Analysis, Protein/methods , Acetylation , Acetyltransferases/metabolism , Algorithms , Amino Acid Motifs , Artificial Intelligence , Humans , Methionine/metabolism , Proteins/metabolism , Software
2.
BMC Bioinformatics ; 14: 104, 2013 Mar 22.
Article in English | MEDLINE | ID: mdl-23517090

ABSTRACT

BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.


Subject(s)
Data Mining/methods , Databases, Protein , Knowledge Bases , Protein Processing, Post-Translational , Humans , Molecular Sequence Annotation , Proteomics
3.
Database (Oxford) ; 2012: bas020, 2012.
Article in English | MEDLINE | ID: mdl-22513129

ABSTRACT

Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.


Subject(s)
Biomedical Research , Data Mining , Natural Language Processing , Workflow , Animals , Databases, Factual , Humans
4.
Bioinformatics ; 26(6): 851-2, 2010 Mar 15.
Article in English | MEDLINE | ID: mdl-20106818

ABSTRACT

SUMMARY: The SwissVar portal provides access to a comprehensive collection of single amino acid polymorphisms and diseases in the UniProtKB/Swiss-Prot database via a unique search engine. In particular, it gives direct access to the newly improved Swiss-Prot variant pages. The key strength of this portal is that it provides a possibility to query for similar diseases, as well as the underlying protein products and the molecular details of each variant. In the context of the recently proposed molecular view on diseases, the SwissVar portal should be in a unique position to provide valuable information for researchers and to advance research in this area. AVAILABILITY: The SwissVar portal is available at www.expasy.org/swissvar CONTACT: anais.mottaz@isb-sib.ch; lina.yip@isb-sib.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Amino Acids/chemistry , Databases, Protein , Phenotype , Polymorphism, Single Nucleotide , Proteins/chemistry , Proteomics/methods , Amino Acid Sequence
5.
BMC Bioinformatics ; 9 Suppl 3: S9, 2008 Apr 11.
Article in English | MEDLINE | ID: mdl-18426554

ABSTRACT

BACKGROUND: This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases. RESULTS: Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%). CONCLUSIONS: Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.


Subject(s)
Algorithms , Genes/genetics , MEDLINE , Natural Language Processing , Pattern Recognition, Automated/methods , Proteins/classification , Proteins/genetics , Artificial Intelligence , Sensitivity and Specificity , Terminology as Topic , Vocabulary, Controlled
6.
BMC Bioinformatics ; 9 Suppl 5: S3, 2008 Apr 29.
Article in English | MEDLINE | ID: mdl-18460185

ABSTRACT

BACKGROUND: Although the UniProt KnowledgeBase is not a medical-oriented database, it contains information on more than 2,000 human proteins involved in pathologies. However, these annotations are not standardized, which impairs the interoperability between biological and clinical resources. In order to make these data easily accessible to clinical researchers, we have developed a procedure to link diseases described in the UniProtKB/Swiss-Prot entries to the MeSH disease terminology. RESULTS: We mapped disease names extracted either from the UniProtKB/Swiss-Prot entry comment lines or from the corresponding OMIM entry to the MeSH. Different methods were assessed on a benchmark set of 200 disease names manually mapped to MeSH terms. The performance of the retained procedure in term of precision and recall was 86% and 64% respectively. Using the same procedure, more than 3,000 disease names in Swiss-Prot were mapped to MeSH with comparable efficiency. CONCLUSIONS: This study is a first attempt to link proteins in UniProtKB to the medical resources. The indexing we provided will help clinicians and researchers navigate from diseases to genes and from genes to diseases in an efficient way. The mapping is available at: http://research.isb-sib.ch/unimed.


Subject(s)
Databases, Protein , Disease , Semantics , Terminology as Topic , Biomedical Research/methods , Biomedical Research/organization & administration , Humans , Knowledge Bases , Medical Subject Headings , Proteomics/methods , Systems Integration
7.
Stud Health Technol Inform ; 129(Pt 1): 710-5, 2007.
Article in English | MEDLINE | ID: mdl-17911809

ABSTRACT

PROBLEM: Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative "gist" of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (naïve Bayes, neural networks..), linguistically-motivated methods (syntactic parsing, semantic tagging, or information retrieval. METHODS: In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evaluation, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer. RESULTS: The most effective combination (+2%, p<0.003) strongly overweights the METHODS section and moderately the RESULTS and CONCLUSION section. CONCLUSION: Although modest, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries.


Subject(s)
Abstracting and Indexing/methods , MEDLINE , Natural Language Processing , Information Storage and Retrieval , Libraries, Digital , Medical Subject Headings
8.
Int J Med Inform ; 76(2-3): 195-200, 2007.
Article in English | MEDLINE | ID: mdl-16815739

ABSTRACT

PROBLEM: key word assignment has been largely used in MEDLINE to provide an indicative "gist" of the content of articles and to help retrieving biomedical articles. Abstracts are also used for this purpose. However with usually more than 300 words, MEDLINE abstracts can still be regarded as long documents; therefore we design a system to select a unique key sentence. This key sentence must be indicative of the article's content and we assume that abstract's conclusions are good candidates. We design and assess the performance of an automatic key sentence selector, which classifies sentences into four argumentative moves: PURPOSE, METHODS, RESULTS and METHODS: we rely on Bayesian classifiers trained on automatically acquired data. Features representation, selection and weighting are reported and classification effectiveness is evaluated on the four classes using confusion matrices. We also explore the use of simple heuristics to take the position of sentences into account. Recall, precision and F-scores are computed for the CONCLUSION class. For the CONCLUSION class, the F-score reaches 84%. Automatic argumentative classification using Bayesian learners is feasible on MEDLINE abstracts and should help user navigation in such repositories.


Subject(s)
Abstracting and Indexing , Information Storage and Retrieval/methods , Libraries, Digital , MEDLINE , Natural Language Processing , Artificial Intelligence , Bayes Theorem , Bibliometrics , Periodicals as Topic , Terminology as Topic , Vocabulary, Controlled
9.
J Bioinform Comput Biol ; 5(6): 1215-31, 2007 Dec.
Article in English | MEDLINE | ID: mdl-18172926

ABSTRACT

The UniProt/Swiss-Prot Knowledgebase records about 30,500 variants in 5,664 proteins (Release 52.2). Most of these variants are manually curated single amino acid polymorphisms (SAPs) with references to the literature. In order to keep the list of published documents related to SAPs up to date, an automatic information retrieval method is developed to recover texts mentioning SAPs. The method is based on the use of regular expressions (patterns) and rules for the detection and validation of mutations. When evaluated using a corpus of 9,820 PubMed references, the precision of the retrieval was determined to be 89.5% over all variants. It was also found that the use of nonstandard mutation nomenclature and sequence positional correction is necessary to retrieve a significant number of relevant articles. The method was applied to the 5,664 proteins with variants. This was performed by first submitting a PubMed query to retrieve articles using gene or protein names and a list of mutation-related keywords; the SAP detection procedure was then used to recover relevant documents. The method was found to be efficient in retrieving new references on known polymorphisms. New references on known SAPs will be rendered accessible to the public via the Swiss-Prot variant pages.


Subject(s)
Databases, Protein , Knowledge Bases , Mutation , Proteins/genetics , Amino Acid Substitution , Computational Biology , Humans , Polymorphism, Genetic , Software , Terminology as Topic
10.
Stud Health Technol Inform ; 116: 835-40, 2005.
Article in English | MEDLINE | ID: mdl-16160362

ABSTRACT

PROBLEM: Key word assignment has been largely used in MEDLINE to provide an indicative "gist" of the content of articles. Abstracts are also used for this purpose. However with usually more than 300 words, abstracts can still be regarded as long documents; therefore we design a system to select a unique key sentence. This key sentence must be indicative of the article's content and we assume that abstract's conclusions are good candidates. We design and assess the performance of an automatic key sentence selector, which classifies sentences into 4 argumentative moves: PURPOSE, METHODS, RESULTS and CONCLUSION. METHODS: We rely on Bayesian classifiers trained on automatically acquired data. Features representation, selection and weighting are reported and classification effectiveness is evaluated on the four classes using confusion matrices. We also explore the use of simple heuristics to take the position of sentences into account. Recall, precision and F-scores are computed for the CONCLUSION class. For the CONCLUSION class, the F-score reaches 84%. Automatic argumentative classification is feasible on MEDLINE abstracts and should help user navigation in such repositories.


Subject(s)
Bayes Theorem , MEDLINE , Humans , Natural Language Processing
11.
Int J Med Inform ; 74(2-4): 317-24, 2005 Mar.
Article in English | MEDLINE | ID: mdl-15694638

ABSTRACT

Bio-medical knowledge bases are valuable resources for the research community. Original scientific publications are the main source used to annotate them. Medical annotation in Swiss-Prot is specifically targeted at finding and extracting data about human genetic diseases and polymorphisms. Curators have to scan through hundreds of publications to select the relevant ones. This workload can be greatly reduced by using bio-text mining techniques. Using a combination of natural language processing (NLP) techniques and statistical classifiers, we achieve recall points of up to 84% on the potentially interesting documents and a precision of more than 96% in detecting irrelevant documents. Careful analysis of the document pre-processing chain allows us to measure the impact of some steps on the overall result, as well as test different classifier configurations. The best combination was used to create a prototype of a search and classification tool that is currently tested by the database curators.


Subject(s)
Databases, Protein , Statistics as Topic , Genetic Diseases, Inborn/genetics , Humans , Polymorphism, Genetic
12.
Bioinformatics ; 21(8): 1743-4, 2005 Apr 15.
Article in English | MEDLINE | ID: mdl-15613390

ABSTRACT

UNLABELLED: We present a new database, GPSDB (Gene and Protein Synonyms DataBase) which collects gene/protein names, in a species specific way, from 14 main biological resources. A web-based search interface gives access to the database: given a gene/protein name, it retrieves all synonyms for this entity and queries Medline with a set of user-selected terms. AVAILABILITY: GPSDB is freely available from http://biomint.oefai.at/ CONTACT: johann@oefai.at.


Subject(s)
Database Management Systems , Databases, Protein , Information Dissemination/methods , Information Storage and Retrieval/methods , Natural Language Processing , Proteins/genetics , Proteins/metabolism , Terminology as Topic , Documentation/methods , MEDLINE , Proteins/classification , User-Computer Interface , Vocabulary, Controlled
13.
Proteomics ; 4(6): 1537-50, 2004 Jun.
Article in English | MEDLINE | ID: mdl-15174124

ABSTRACT

High-throughput proteomic studies produce a wealth of new information regarding post-translational modifications (PTMs). The Swiss-Prot knowledge base is faced with the challenge of including this information in a consistent and structured way, in order to facilitate easy retrieval and promote understanding by biologist expert users as well as computer programs. We are therefore standardizing the annotation of PTM features represented in Swiss-Prot. Indeed, a controlled vocabulary has been associated with every described PTM. In this paper, we present the major update of the feature annotation, and, by showing a few examples, explain how the annotation is implemented and what it means. Mod-Prot, a future companion database of Swiss-Prot, devoted to the biological aspects of PTMs (i.e., general description of the process, identity of the modification enzyme(s), taxonomic range, mass modification) is briefly described. Finally we encourage once again the scientific community (i.e., both individual researchers and database maintainers) to interact with us, so that we can continuously enhance the quality and swiftness of our services.


Subject(s)
Databases, Protein , Protein Processing, Post-Translational , Computational Biology , Databases, Protein/standards , Forecasting , Information Systems , Sequence Analysis, Protein , Systems Integration
14.
Proteomics ; 4(6): 1626-32, 2004 Jun.
Article in English | MEDLINE | ID: mdl-15174132

ABSTRACT

N-terminal myristoylation is a post-translational modification that causes the addition of a myristate to a glycine in the N-terminal end of the amino acid chain. This work presents neural network (NN) models that learn to discriminate myristoylated and nonmyristoylated proteins. Ensembles of 25 NNs and decision trees were trained on 390 positive sequences and 327 negative sequences. Experiments showed that NN ensembles were more accurate than decision tree ensembles. Our NN predictor evaluated by the leave-one-out procedure, obtained a false positive error rate equal to 2.1%. That was better than the PROSITE pattern for myristoylation for which the false positive error rate was 22.3%. On a recent version of Swiss-Prot (41.2), the NN ensemble predicted 876 myristoylated proteins, while 1150 proteins were predicted by the PROSITE pattern for myristoylation. Finally, compared to the well-known NMT predictor, the NN predictor gave similar results. Our tool is available under http://www.expasy.org/tools/myristoylator/myristoylator.html.


Subject(s)
Amino Acids/metabolism , Neural Networks, Computer , Protein Processing, Post-Translational , Amino Acid Sequence , Amino Acids/chemistry , Artificial Intelligence , Databases, Factual , Glycine/metabolism , Myristic Acids/metabolism , Probability , Sensitivity and Specificity
15.
Stud Health Technol Inform ; 95: 421-6, 2003.
Article in English | MEDLINE | ID: mdl-14664023

ABSTRACT

The goal of medical annotation of human proteins in Swiss-Prot is to add features specifically intended for researchers working on genetic diseases and polymorphisms. For this purpose, it is necessary to search through a vast number of publications containing relevant information. Promising results have been obtained by applying natural language processing and machine learning techniques to solve this problem. By using the Probabilistic Latent Categorizer on representative query sets, 69% recall and 59% precision was achieved for relevant documents. This classifier also rejected irrelevant abstracts with more than 96% precision. Better linguistic pre-processing of source documents can further improve such computer approach.


Subject(s)
Databases, Protein , Information Storage and Retrieval/statistics & numerical data , Probability , Switzerland
16.
Bioinformatics ; 19 Suppl 1: i91-4, 2003.
Article in English | MEDLINE | ID: mdl-12855443

ABSTRACT

MOTIVATION: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents. RESULTS: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance.


Subject(s)
Abstracting and Indexing/methods , Databases, Protein , Models, Statistical , Natural Language Processing , Periodicals as Topic/classification , Proteins/chemistry , PubMed , Algorithms , Artificial Intelligence , Documentation/methods , Pattern Recognition, Automated , Proteins/genetics
17.
Comput Biol Chem ; 27(1): 49-58, 2003 Feb.
Article in English | MEDLINE | ID: mdl-12798039

ABSTRACT

Large-scale sequencing of prokaryotic genomes demands the automation of certain annotation tasks currently manually performed in the production of the SWISS-PROT protein knowledgebase. The HAMAP project, or 'High-quality Automated and Manual Annotation of microbial Proteomes', aims to integrate manual and automatic annotation methods in order to enhance the speed of the curation process while preserving the quality of the database annotation. Automatic annotation is only applied to entries that belong to manually defined orthologous families and to entries with no identifiable similarities (ORFans). Many checks are enforced in order to prevent the propagation of wrong annotation and to spot problematic cases, which are channelled to manual curation. The results of this annotation are integrated in SWISS-PROT, and a website is provided at http://www.expasy.org/sprot/hamap/.


Subject(s)
Bacterial Proteins/classification , Bacterial Proteins/physiology , Database Management Systems/trends , Databases, Protein/classification , Databases, Protein/standards , Proteome/classification , Proteome/physiology , Amino Acid Sequence , Database Management Systems/standards , Genome, Bacterial , Molecular Sequence Data
SELECTION OF CITATIONS
SEARCH DETAIL
...