Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Database (Oxford) ; 20202020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32507889

RESUMO

Modern biology produces data at a staggering rate. Yet, much of these biological data is still isolated in the text, figures, tables and supplementary materials of articles. As a result, biological information created at great expense is significantly underutilised. The protein motif biology field does not have sufficient resources to curate the corpus of motif-related literature and, to date, only a fraction of the available articles have been curated. In this study, we develop a set of tools and a web resource, 'articles.ELM', to rapidly identify the motif literature articles pertinent to a researcher's interest. At the core of the resource is a manually curated set of about 8000 motif-related articles. These articles are automatically annotated with a range of relevant biological data allowing in-depth search functionality. Machine-learning article classification is used to group articles based on their similarity to manually curated motif classes in the Eukaryotic Linear Motif resource. Articles can also be manually classified within the resource. The 'articles.ELM' resource permits the rapid and accurate discovery of relevant motif articles thereby improving the visibility of motif literature and simplifying the recovery of valuable biological insights sequestered within scientific articles. Consequently, this web resource removes a critical bottleneck in scientific productivity for the motif biology field. Database URL: http://slim.icr.ac.uk/articles/.


Assuntos
Motivos de Aminoácidos , Mineração de Dados/métodos , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Anotação de Sequência Molecular/classificação , Anotação de Sequência Molecular/métodos , Publicações/classificação
2.
PLoS Comput Biol ; 15(11): e1007419, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31682632

RESUMO

Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (https://bitbucket.org/plyusnin/ads/). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.


Assuntos
Biologia Computacional/métodos , Anotação de Sequência Molecular/classificação , Anotação de Sequência Molecular/métodos , Algoritmos , Benchmarking/métodos , Bases de Dados Genéticas , Bases de Dados de Proteínas , Ontologia Genética/tendências , Reprodutibilidade dos Testes , Software
3.
PLoS Genet ; 10(1): e1004102, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24497837

RESUMO

Genome-wide association studies (GWAS) have revolutionized the field of cancer genetics, but the causal links between increased genetic risk and onset/progression of disease processes remain to be identified. Here we report the first step in such an endeavor for prostate cancer. We provide a comprehensive annotation of the 77 known risk loci, based upon highly correlated variants in biologically relevant chromatin annotations--we identified 727 such potentially functional SNPs. We also provide a detailed account of possible protein disruption, microRNA target sequence disruption and regulatory response element disruption of all correlated SNPs at r(2) ≥ 0.88%. 88% of the 727 SNPs fall within putative enhancers, and many alter critical residues in the response elements of transcription factors known to be involved in prostate biology. We define as risk enhancers those regions with enhancer chromatin biofeatures in prostate-derived cell lines with prostate-cancer correlated SNPs. To aid the identification of these enhancers, we performed genomewide ChIP-seq for H3K27-acetylation, a mark of actively engaged enhancers, as well as the transcription factor TCF7L2. We analyzed in depth three variants in risk enhancers, two of which show significantly altered androgen sensitivity in LNCaP cells. This includes rs4907792, that is in linkage disequilibrium (r(2) = 0.91) with an eQTL for NUDT11 (on the X chromosome) in prostate tissue, and rs10486567, the index SNP in intron 3 of the JAZF1 gene on chromosome 7. Rs4907792 is within a critical residue of a strong consensus androgen response element that is interrupted in the protective allele, resulting in a 56% decrease in its androgen sensitivity, whereas rs10486567 affects both NKX3-1 and FOXA-AR motifs where the risk allele results in a 39% increase in basal activity and a 28% fold-increase in androgen stimulated enhancer activity. Identification of such enhancer variants and their potential target genes represents a preliminary step in connecting risk to disease process.


Assuntos
Elementos Facilitadores Genéticos , Anotação de Sequência Molecular/classificação , Neoplasias da Próstata/genética , Elementos de Resposta/genética , Alelos , Cromatina/genética , Regulação Neoplásica da Expressão Gênica , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Masculino , Análise de Sequência com Séries de Oligonucleotídeos , Polimorfismo de Nucleotídeo Único/genética , Neoplasias da Próstata/metabolismo , Neoplasias da Próstata/patologia , Fatores de Risco , Fatores de Transcrição/genética
4.
Artigo em Inglês | MEDLINE | ID: mdl-26355507

RESUMO

Species detection is an important topic in the text mining field. According to the importance of the research topics (e.g., species assignment to genes and document focus species detection), some studies are dedicated to an individual topic. However, no researcher to date has discussed species detection as a general problem. Therefore, we developed a multi-scope species detection model to identify the focus species for different scopes (i.e., gene mention, sentence, paragraph, and global scope of the entire article). Species assignment is one of the bottlenecks of gene name disambiguation. In our evaluation, recognizing the focus species of a gene mention in four different scopes improved the gene name disambiguation. We used the species cue words extracted from articles to estimate the relevance between an article and a species. The relevance score was calculated by our proposed entities frequency-augmented invert species frequency (EF-AISF) formula, which represents the importance of an entity to a species. We also defined a relation guide factor (RGF) to normalize the relevance score. Our method not only achieved better performance than previous methods but also can handle the articles that do not specifically mention a species. In the DECA corpus, we outperformed previous studies and obtained an accuracy of 88.22 percent.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Genes/genética , Anotação de Sequência Molecular/classificação , Animais , Humanos , Máquina de Vetores de Suporte
5.
BMC Bioinformatics ; 14: 350, 2013 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-24299119

RESUMO

BACKGROUND: Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. RESULTS: We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. CONCLUSION: In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation.


Assuntos
Drosophila melanogaster/citologia , Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , Genoma de Inseto/genética , Modelos Genéticos , Anotação de Sequência Molecular/métodos , Animais , Diferenciação Celular/genética , Divisão Celular/genética , Biologia Computacional/classificação , Biologia Computacional/métodos , Drosophila melanogaster/embriologia , Perfilação da Expressão Gênica/classificação , Perfilação da Expressão Gênica/métodos , Ensaios de Triagem em Larga Escala , Anotação de Sequência Molecular/classificação , Valor Preditivo dos Testes , Máquina de Vetores de Suporte
6.
BMC Bioinformatics ; 14: 242, 2013 Aug 08.
Artigo em Inglês | MEDLINE | ID: mdl-23927037

RESUMO

BACKGROUND: Gene Ontology (GO) is a popular standard in the annotation of gene products and provides information related to genes across all species. The structure of GO is dynamic and is updated on a daily basis. However, the popular existing methods use outdated versions of GO. Moreover, these tools are slow to process large datasets consisting of more than 20,000 genes. RESULTS: We have developed GOParGenPy, a platform independent software tool to generate the binary data matrix showing the GO class membership, including parental classes, of a set of GO annotated genes. GOParGenPy is at least an order of magnitude faster than popular tools for Gene Ontology analysis and it can handle larger datasets than the existing tools. It can use any available version of the GO structure and allows the user to select the source of GO annotation. GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases. CONCLUSIONS: GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets. The obtained binary matrix can then be used with any analysis environment and with any analysis methods.


Assuntos
Ontologia Genética , Genes , Anotação de Sequência Molecular/métodos , Proteínas/genética , Software , Inteligência Artificial , Anotação de Sequência Molecular/classificação , Proteínas/química , Proteínas/classificação , Ferramenta de Busca/métodos , Software/classificação , Vocabulário Controlado
7.
PLoS One ; 7(10): e47436, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23077617

RESUMO

In the last years, there was an exponential increase in the number of publicly available genomes. Once finished, most genome projects lack financial support to review annotations. A few of these gene annotations are based on a combination of bioinformatics evidence, however, in most cases, annotations are based solely on sequence similarity to a previously known gene, which was most probably annotated in the same way. As a result, a large number of predicted genes remain unassigned to any functional category despite the fact that there is enough evidence in the literature to predict their function. We developed a classifier trained with term-frequency vectors automatically disclosed from text corpora of an ensemble of genes representative of each functional category of the J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR) ontology. The classifier achieved up to 84% precision with 68% recall (for confidence≥0.4), F-measure 0.76 (recall and precision equally weighted) in an independent set of 2,220 genes, from 13 bacterial species, previously classified by JCVI-CMR into unambiguous categories of its ontology. Finally, the classifier assigned (confidence≥0.7) to functional categories a total of 5,235 out of the ∼24 thousand genes previously in categories "Unknown function" or "Unclassified" for which there is literature in MEDLINE. Two biologists reviewed the literature of 100 of these genes, randomly picket, and assigned them to the same functional categories predicted by the automatic classifier. Our results confirmed the hypothesis that it is possible to confidently assign genes of a real world repository to functional categories, based exclusively on the automatic profiling of its associated literature. The LitProf--Gene Classifier web server is accessible at: www.cebio.org/litprofGC.


Assuntos
Biologia Computacional , Bases de Dados Genéticas , MEDLINE , Anotação de Sequência Molecular , Humanos , Internet , Anotação de Sequência Molecular/classificação , Anotação de Sequência Molecular/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...