Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
PLoS One ; 15(6): e0233956, 2020.
Article in English | MEDLINE | ID: mdl-32542027

ABSTRACT

BACKGROUND: Surveying the scientific literature is an important part of early drug discovery; and with the ever-increasing amount of biomedical publications it is imperative to focus on the most interesting articles. Here we present a project that highlights new understanding (e.g. recently discovered modes of action) and identifies potential drug targets, via a novel, data-driven text mining approach to score type 2 diabetes (T2D) relevance. We focused on monitoring trends and jumps in T2D relevance to help us be timely informed of important breakthroughs. METHODS: We extracted over 7 million n-grams from PubMed abstracts and then clustered around 240,000 linked to T2D into almost 50,000 T2D relevant 'semantic concepts'. To score papers, we weighted the concepts based on co-mentioning with core T2D proteins. A protein's T2D relevance was determined by combining the scores of the papers mentioning it in the five preceding years. Each week all proteins were ranked according to their T2D relevance. Furthermore, the historical distribution of changes in rank from one week to the next was used to calculate the significance of a change in rank by T2D relevance for each protein. RESULTS: We show that T2D relevant papers, even those not mentioning T2D explicitly, were prioritised by relevant semantic concepts. Well known T2D proteins were therefore enriched among the top scoring proteins. Our 'high jumpers' identified important past developments in the apprehension of how certain key proteins relate to T2D, indicating that our method will make us aware of future breakthroughs. In summary, this project facilitated keeping up with current T2D research by repeatedly providing short lists of potential novel targets into our early drug discovery pipeline.


Subject(s)
Data Mining/methods , Diabetes Mellitus, Type 2/drug therapy , Drug Discovery/methods , Algorithms , Humans , Proteins/metabolism , Semantics
2.
J Cheminform ; 11(1): 19, 2019 Mar 08.
Article in English | MEDLINE | ID: mdl-30850898

ABSTRACT

Most BioCreative tasks to date have focused on assessing the quality of text-mining annotations in terms of precision and recall. Interoperability, speed, and stability are, however, other important factors to consider for practical applications of text mining. For about a decade, we have run named entity recognition (NER) web services, which are designed to be efficient, implemented using a multi-threaded queueing system to robustly handle many simultaneous requests, and hosted at a supercomputer facility. To participate in this new task, we extended the existing NER tagging service with support for the BeCalm API. The tagger suffered no downtime during the challenge and, as in earlier tests, proved to be highly efficient, consistently processing requests of 5000 abstracts in less than half a minute. In fact, the majority of this time was spent not on the NER task but rather on retrieving the document texts from the challenge servers. The latter was found to be the main bottleneck even when hosting a copy of the tagging service on a Raspberry Pi 3, showing that local document storage or caching would be desirable features to include in future revisions of the API standard.

3.
PeerJ ; 3: e1054, 2015.
Article in English | MEDLINE | ID: mdl-26157623

ABSTRACT

For tissues to carry out their functions, they rely on the right proteins to be present. Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated. We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining. We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression. By developing comparable confidence scores for all types of evidence, we show that it is possible to improve both quality and coverage by combining the datasets. To facilitate use and visualization of our work, we have developed the TISSUES resource (http://tissues.jensenlab.org), which makes all the scored and integrated data available through a single user-friendly web interface.

4.
Mol Cell Proteomics ; 14(3): 658-73, 2015 Mar.
Article in English | MEDLINE | ID: mdl-25576301

ABSTRACT

HLA class I molecules reflect the health state of cells to cytotoxic T cells by presenting a repertoire of endogenously derived peptides. However, the extent to which the proteome shapes the peptidome is still largely unknown. Here we present a high-throughput mass-spectrometry-based workflow that allows stringent and accurate identification of thousands of such peptides and direct determination of binding motifs. Applying the workflow to seven cancer cell lines and primary cells, yielded more than 22,000 unique HLA peptides across different allelic binding specificities. By computing a score representing the HLA-I sampling density, we show a strong link between protein abundance and HLA-presentation (p < 0.0001). When analyzing overpresented proteins - those with at least fivefold higher density score than expected for their abundance - we noticed that they are degraded almost 3 h faster than similar but nonpresented proteins (top 20% abundance class; median half-life 20.8h versus 23.6h, p < 0.0001). This validates protein degradation as an important factor for HLA presentation. Ribosomal, mitochondrial respiratory chain, and nucleosomal proteins are particularly well presented. Taking a set of proteins associated with cancer, we compared the predicted immunogenicity of previously validated T-cell epitopes with other peptides from these proteins in our data set. The validated epitopes indeed tend to have higher immunogenic scores than the other detected HLA peptides. Remarkably, we identified five mutated peptides from a human colon cancer cell line, which have very recently been predicted to be HLA-I binders. Altogether, we demonstrate the usefulness of combining MS-analysis with immunogenesis prediction for identifying, ranking, and selecting peptides for therapeutic use.


Subject(s)
Antigen Presentation , HLA Antigens/metabolism , Histocompatibility Antigens Class I/immunology , Mass Spectrometry/methods , Peptides/isolation & purification , Proteomics/methods , Cell Line, Tumor , Cells, Cultured , Epitopes, T-Lymphocyte/metabolism , HCT116 Cells , Humans , Neoplasms/immunology , Peptides/immunology , Proteome/immunology , Proteome/isolation & purification
5.
Bioinformatics ; 31(11): 1872-4, 2015 Jun 01.
Article in English | MEDLINE | ID: mdl-25619994

ABSTRACT

UNLABELLED: The association of organisms to their environments is a key issue in exploring biodiversity patterns. This knowledge has traditionally been scattered, but textual descriptions of taxa and their habitats are now being consolidated in centralized resources. However, structured annotations are needed to facilitate large-scale analyses. Therefore, we developed ENVIRONMENTS, a fast dictionary-based tagger capable of identifying Environment Ontology (ENVO) terms in text. We evaluate the accuracy of the tagger on a new manually curated corpus of 600 Encyclopedia of Life (EOL) species pages. We use the tagger to associate taxa with environments by tagging EOL text content monthly, and integrate the results into the EOL to disseminate them to a broad audience of users. AVAILABILITY AND IMPLEMENTATION: The software and the corpus are available under the open-source BSD and the CC-BY-NC-SA 3.0 licenses, respectively, at http://environments.hcmr.gr.


Subject(s)
Biodiversity , Biological Ontologies , Software , Animals , Data Mining/methods , Ecosystem , Internet
6.
Methods ; 74: 83-9, 2015 Mar.
Article in English | MEDLINE | ID: mdl-25484339

ABSTRACT

Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.


Subject(s)
Data Mining/methods , Databases, Genetic , Disease/genetics , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study/methods , Databases, Genetic/statistics & numerical data , Humans
7.
Database (Oxford) ; 2014: bau012, 2014.
Article in English | MEDLINE | ID: mdl-24573882

ABSTRACT

Information on protein subcellular localization is important to understand the cellular functions of proteins. Currently, such information is manually curated from the literature, obtained from high-throughput microscopy-based screens and predicted from primary sequence. To get a comprehensive view of the localization of a protein, it is thus necessary to consult multiple databases and prediction tools. To address this, we present the COMPARTMENTS resource, which integrates all sources listed above as well as the results of automatic text mining. The resource is automatically kept up to date with source databases, and all localization evidence is mapped onto common protein identifiers and Gene Ontology terms. We further assign confidence scores to the localization evidence to facilitate comparison of different types and sources of evidence. To further improve the comparability, we assign confidence scores based on the type and source of the localization evidence. Finally, we visualize the unified localization evidence for a protein on a schematic cell to provide a simple overview. Database URL: http://compartments.jensenlab.org.


Subject(s)
Cell Compartmentation , Databases, Protein , Proteins/metabolism , Data Mining , Humans , Internet , Subcellular Fractions/metabolism
8.
Bioinformatics ; 30(3): 392-7, 2014 Feb 01.
Article in English | MEDLINE | ID: mdl-24273243

ABSTRACT

MOTIVATION: MicroRNAs (miRNAs) are a highly abundant class of non-coding RNA genes involved in cellular regulation and thus also diseases. Despite miRNAs being important disease factors, miRNA-disease associations remain low in number and of variable reliability. Furthermore, existing databases and prediction methods do not explicitly facilitate forming hypotheses about the possible molecular causes of the association, thereby making the path to experimental follow-up longer. RESULTS: Here we present miRPD in which miRNA-Protein-Disease associations are explicitly inferred. Besides linking miRNAs to diseases, it directly suggests the underlying proteins involved, which can be used to form hypotheses that can be experimentally tested. The inference of miRNAs and diseases is made by coupling known and predicted miRNA-protein associations with protein-disease associations text mined from the literature. We present scoring schemes that allow us to rank miRNA-disease associations inferred from both curated and predicted miRNA targets by reliability and thereby to create high- and medium-confidence sets of associations. Analyzing these, we find statistically significant enrichment for proteins involved in pathways related to cancer and type I diabetes mellitus, suggesting either a literature bias or a genuine biological trend. We show by example how the associations can be used to extract proteins for disease hypothesis. AVAILABILITY AND IMPLEMENTATION: All datasets, software and a searchable Web site are available at http://mirpd.jensenlab.org.


Subject(s)
Disease/genetics , MicroRNAs/metabolism , Proteins/metabolism , Software , Diabetes Mellitus/genetics , Humans
9.
Nucleic Acids Res ; 42(Database issue): D401-7, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24293645

ABSTRACT

STITCH is a database of protein-chemical interactions that integrates many sources of experimental and manually curated evidence with text-mining information and interaction predictions. Available at http://stitch.embl.de, the resulting interaction network includes 390 000 chemicals and 3.6 million proteins from 1133 organisms. Compared with the previous version, the number of high-confidence protein-chemical interactions in human has increased by 45%, to 367 000. In this version, we added features for users to upload their own data to STITCH in the form of internal identifiers, chemical structures or quantitative data. For example, a user can now upload a spreadsheet with screening hits to easily check which interactions are already known. To increase the coverage of STITCH, we expanded the text mining to include full-text articles and added a prediction method based on chemical structures. We further changed our scheme for transferring interactions between species to rely on orthology rather than protein similarity. This improves the performance within protein families, where scores are now transferred only to orthologous proteins, but not to paralogous proteins. STITCH can be accessed with a web-interface, an API and downloadable files.


Subject(s)
Databases, Protein , Proteins/metabolism , Animals , Data Mining , Humans , Internet , Mice , Pharmaceutical Preparations/chemistry , Protein Interaction Mapping , Proteins/chemistry , Systems Integration
10.
PLoS One ; 8(6): e65390, 2013.
Article in English | MEDLINE | ID: mdl-23823062

ABSTRACT

The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.


Subject(s)
Classification , Data Mining/methods , Terminology as Topic
11.
J Am Med Inform Assoc ; 20(5): 947-53, 2013.
Article in English | MEDLINE | ID: mdl-23703825

ABSTRACT

OBJECTIVE: Drugs have tremendous potential to cure and relieve disease, but the risk of unintended effects is always present. Healthcare providers increasingly record data in electronic patient records (EPRs), in which we aim to identify possible adverse events (AEs) and, specifically, possible adverse drug events (ADEs). MATERIALS AND METHODS: Based on the undesirable effects section from the summary of product characteristics (SPC) of 7446 drugs, we have built a Danish ADE dictionary. Starting from this dictionary we have developed a pipeline for identifying possible ADEs in unstructured clinical narrative text. We use a named entity recognition (NER) tagger to identify dictionary matches in the text and post-coordination rules to construct ADE compound terms. Finally, we apply post-processing rules and filters to handle, for example, negations and sentences about subjects other than the patient. Moreover, this method allows synonyms to be identified and anatomical location descriptions can be merged to allow appropriate grouping of effects in the same location. RESULTS: The method identified 1 970 731 (35 477 unique) possible ADEs in a large corpus of 6011 psychiatric hospital patient records. Validation was performed through manual inspection of possible ADEs, resulting in precision of 89% and recall of 75%. DISCUSSION: The presented dictionary-building method could be used to construct other ADE dictionaries. The complication of compound words in Germanic languages was addressed. Additionally, the synonym and anatomical location collapse improve the method. CONCLUSIONS: The developed dictionary and method can be used to identify possible ADEs in Danish clinical narratives.


Subject(s)
Data Mining/methods , Dictionaries, Medical as Topic , Drug-Related Side Effects and Adverse Reactions , Electronic Health Records , Denmark , Humans , Narration
12.
Nucleic Acids Res ; 41(Database issue): D808-15, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23203871

ABSTRACT

Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made-particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.


Subject(s)
Databases, Protein , Protein Interaction Mapping , Algorithms , Data Interpretation, Statistical , Data Mining , Internet , Systems Integration , User-Computer Interface
13.
Nucleic Acids Res ; 36(Web Server issue): W513-8, 2008 Jul 01.
Article in English | MEDLINE | ID: mdl-18515843

ABSTRACT

We present a new release of the immune epitope database analysis resource (IEDB-AR, http://tools.immuneepitope.org), a repository of web-based tools for the prediction and analysis of immune epitopes. New functionalities have been added to most of the previously implemented tools, and a total of eight new tools were added, including two B-cell epitope prediction tools, four T-cell epitope prediction tools and two analysis tools.


Subject(s)
Epitopes, B-Lymphocyte/chemistry , Epitopes, T-Lymphocyte/chemistry , Software , Computer Graphics , Databases, Factual , Epitopes, B-Lymphocyte/immunology , Epitopes, T-Lymphocyte/immunology , Histocompatibility Antigens Class I/metabolism , Histocompatibility Antigens Class II/metabolism , Internet , Peptides/chemistry , Peptides/immunology , Proteins/chemistry , Proteins/immunology
14.
PLoS One ; 3(3): e1831, 2008 Mar 19.
Article in English | MEDLINE | ID: mdl-18350167

ABSTRACT

BACKGROUND: Cytotoxic T cell (CTL) cross-reactivity is believed to play a pivotal role in generating immune responses but the extent and mechanisms of CTL cross-reactivity remain largely unknown. Several studies suggest that CTL clones can recognize highly diverse peptides, some sharing no obvious sequence identity. The emerging realization in the field is that T cell receptors (TcR) recognize multiple distinct ligands. PRINCIPAL FINDINGS: First, we analyzed peptide scans of the HIV epitope SLFNTVATL (SFL9) and found that TCR specificity is position dependent and that biochemically similar amino acid substitutions do not drastically affect recognition. Inspired by this, we developed a general model of TCR peptide recognition using amino acid similarity matrices and found that such a model was able to predict the cross-reactivity of a diverse set of CTL epitopes. With this model, we were able to demonstrate that seemingly distinct T cell epitopes, i.e., ones with low sequence identity, are in fact more biochemically similar than expected. Additionally, an analysis of HIV immunogenicity data with our model showed that CTLs have the tendency to respond mostly to peptides that do not resemble self-antigens. CONCLUSIONS: T cell cross-reactivity can thus, to an extent greater than earlier appreciated, be explained by amino acid similarity. The results presented in this paper will help resolving some of the long-lasting discussions in the field of T cell cross-reactivity.


Subject(s)
Cross Reactions , T-Lymphocytes, Cytotoxic/immunology , Amino Acid Sequence , Enzyme-Linked Immunosorbent Assay , Epitopes/chemistry , Epitopes/genetics , Epitopes/immunology , HIV/immunology , Humans , Ligands , Mutation , Receptors, Antigen, T-Cell/chemistry , Receptors, Antigen, T-Cell/genetics , Receptors, Antigen, T-Cell/immunology
15.
PLoS Comput Biol ; 2(6): e65, 2006 Jun 09.
Article in English | MEDLINE | ID: mdl-16789818

ABSTRACT

Recognition of peptides bound to major histocompatibility complex (MHC) class I molecules by T lymphocytes is an essential part of immune surveillance. Each MHC allele has a characteristic peptide binding preference, which can be captured in prediction algorithms, allowing for the rapid scan of entire pathogen proteomes for peptide likely to bind MHC. Here we make public a large set of 48,828 quantitative peptide-binding affinity measurements relating to 48 different mouse, human, macaque, and chimpanzee MHC class I alleles. We use this data to establish a set of benchmark predictions with one neural network method and two matrix-based prediction methods extensively utilized in our groups. In general, the neural network outperforms the matrix-based predictions mainly due to its ability to generalize even on a small amount of data. We also retrieved predictions from tools publicly available on the internet. While differences in the data used to generate these predictions hamper direct comparisons, we do conclude that tools based on combinatorial peptide libraries perform remarkably well. The transparent prediction evaluation on this dataset provides tool developers with a benchmark for comparison of newly developed prediction methods. In addition, to generate and evaluate our own prediction methods, we have established an easily extensible web-based prediction framework that allows automated side-by-side comparisons of prediction methods implemented by experts. This is an advance over the current practice of tool developers having to generate reference predictions themselves, which can lead to underestimating the performance of prediction methods they are not as familiar with as their own. The overall goal of this effort is to provide a transparent prediction evaluation allowing bioinformaticians to identify promising features of prediction methods and providing guidance to immunologists regarding the reliability of prediction tools.


Subject(s)
Histocompatibility Antigens Class I/chemistry , Peptides/chemistry , Animals , Databases, Factual , HLA Antigens/chemistry , Humans , Inhibitory Concentration 50 , Macaca , Mice , Neural Networks, Computer , Pan troglodytes , ROC Curve , Software
16.
J Biol Phys ; 32(3-4): 335-53, 2006 Oct.
Article in English | MEDLINE | ID: mdl-19669470

ABSTRACT

Over the past decade a number of bioinformatics tools have been developed that use genomic sequences as input to predict to which parts of a microbe the immune system will react, the so-called epitopes. Many predicted epitopes have later been verified experimentally, demonstrating the usefulness of such predictions. At the same time, simulation models have been developed that describe the dynamics of different immune cell populations and their interactions with microbes. These models have been used to explain experimental findings where timing is of importance, such as the time between administration of a vaccine and infection with the microbe that the vaccine is intended to protect against. In this paper, we outline a framework for integration of these two approaches. As an example, we develop a model in which HIV dynamics are correlated with genomics data. For the first time, the fitness of wild type and mutated virus are assessed by means of a sequence-dependent scoring matrix, derived from a BLOSUM matrix, that links protein sequences to growth rates of the virus in the mathematical model. A combined bioinformatics and systems biology approach can lead to a better understanding of immune system-related diseases where both timing and genomic information are of importance.

SELECTION OF CITATIONS
SEARCH DETAIL
...