Pesquisa | Portal Regional da BVS

1.

AIRI: Predicting Retention Indices and Their Uncertainties Using Artificial Intelligence.

Geer, Lewis Y; Stein, Stephen E; Mallard, William Gary; Slotta, Douglas J.

J Chem Inf Model ; 64(3): 690-696, 2024 Feb 12.

Artigo em Inglês | MEDLINE | ID: mdl-38230885

RESUMO

The Kováts retention index (RI) is a quantity measured using gas chromatography and is commonly used in the identification of chemical structures. Creating libraries of observed RI values is a laborious task, so we explore the use of a deep neural network for predicting RI values from structure for standard semipolar columns. This network generated predictions with a mean absolute error of 15.1 and, in a quantification of the tail of the error distribution, a 95th percentile absolute error of 46.5. Because of the Artificial Intelligence Retention Indices (AIRI) network's accuracy, it was used to predict RI values for the NIST EI-MS spectral libraries. These RI values are used to improve chemical identification methods and the quality of the library. Estimating uncertainty is an important practical need when using prediction models. To quantify the uncertainty of our network for each individual prediction, we used the outputs of an ensemble of 8 networks to calculate a predicted standard deviation for each RI value prediction. This predicted standard deviation was corrected to follow the error between the observed and predicted RI values. The Z scores using these predicted standard deviations had a standard deviation of 1.52 and a 95th percentile absolute Z score corresponding to a mean RI value of 42.6.

Assuntos

Inteligência Artificial , Redes Neurais de Computação , Incerteza

2.

AIomics: Exploring More of the Proteome Using Mass Spectral Libraries Extended by Artificial Intelligence.

Geer, Lewis Y; Lapin, Joel; Slotta, Douglas J; Mak, Tytus D; Stein, Stephen E.

J Proteome Res ; 22(7): 2246-2255, 2023 07 07.

Artigo em Inglês | MEDLINE | ID: mdl-37232537

RESUMO

The unbounded permutations of biological molecules, including proteins and their constituent peptides, present a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false negatives due to the simplified spectra they compute from sequence records. Spectral library searching can help solve this issue by precisely matching experimental spectra to library spectra with excellent sensitivity and specificity. However, compiling spectral libraries that span entire proteomes is pragmatically difficult. Neural networks that predict complete spectra containing a full range of annotated and unannotated ions can be used to replace these simplified spectra with libraries of fully predicted spectra, including modified peptides. Using such a network, we created predicted spectral libraries that were used to rescore matches from a sequence search done over a large search space, including a large number of modifications. Rescoring improved the separation of true and false hits by 82%, yielding an 8% increase in peptide identifications, including a 21% increase in nonspecifically cleaved peptides and a 17% increase in phosphopeptides.

Assuntos

Biblioteca de Peptídeos , Proteoma , Proteoma/metabolismo , Inteligência Artificial , Espectrometria de Massas em Tandem , Algoritmos , Fosfopeptídeos , Bases de Dados de Proteínas , Software

3.

iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures.

Wang, Jiyao; Youkharibache, Philippe; Zhang, Dachuan; Lanczycki, Christopher J; Geer, Renata C; Madej, Thomas; Phan, Lon; Ward, Minghong; Lu, Shennan; Marchler, Gabriele H; Wang, Yanli; Bryant, Stephen H; Geer, Lewis Y; Marchler-Bauer, Aron.

Bioinformatics ; 36(1): 131-135, 2020 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-31218344

RESUMO

MOTIVATION: Build a web-based 3D molecular structure viewer focusing on interactive structural analysis. RESULTS: iCn3D (I-see-in-3D) can simultaneously show 3D structure, 2D molecular contacts and 1D protein and nucleotide sequences through an integrated sequence/annotation browser. Pre-defined and arbitrary molecular features can be selected in any of the 1D/2D/3D windows as sets of residues and these selections are synchronized dynamically in all displays. Biological annotations such as protein domains, single nucleotide variations, etc. can be shown as tracks in the 1D sequence/annotation browser. These customized displays can be shared with colleagues or publishers via a simple URL. iCn3D can display structure-structure alignments obtained from NCBI's VAST+ service. It can also display the alignment of a sequence with a structure as identified by BLAST, and thus relate 3D structure to a large fraction of all known proteins. iCn3D can also display electron density maps or electron microscopy (EM) density maps, and export files for 3D printing. The following example URL exemplifies some of the 1D/2D/3D representations: https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=1TUP&showanno=1&show2d=1&showsets=1. AVAILABILITY AND IMPLEMENTATION: iCn3D is freely available to the public. Its source code is available at https://github.com/ncbi/icn3d. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequência de Bases , Biologia Computacional , Internet , Modelos Moleculares , Proteínas , Software , Biologia Computacional/métodos , Bases de Dados Genéticas , Conformação Molecular , Proteínas/química

4.

RefSeq: an update on prokaryotic genome annotation and curation.

Haft, Daniel H; DiCuccio, Michael; Badretdin, Azat; Brover, Vyacheslav; Chetvernin, Vyacheslav; O'Neill, Kathleen; Li, Wenjun; Chitsaz, Farideh; Derbyshire, Myra K; Gonzales, Noreen R; Gwadz, Marc; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zheng, Chanjuan; Thibaud-Nissen, Françoise; Geer, Lewis Y; Marchler-Bauer, Aron; Pruitt, Kim D.

Nucleic Acids Res ; 46(D1): D851-D860, 2018 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-29112715

RESUMO

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Assuntos

Curadoria de Dados , Bases de Dados de Ácidos Nucleicos , Genoma , Anotação de Sequência Molecular , Células Procarióticas , Archaea/genética , Bactérias/genética , Bases de Dados de Proteínas , Eucariotos/genética , Previsões , Humanos , Homologia de Sequência , Software , Vírus/genética

5.

CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

Marchler-Bauer, Aron; Bo, Yu; Han, Lianyi; He, Jane; Lanczycki, Christopher J; Lu, Shennan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Geer, Lewis Y; Bryant, Stephen H.

Nucleic Acids Res ; 45(D1): D200-D203, 2017 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-27899674

RESUMO

NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Biologia Computacional/métodos , Bases de Dados de Proteínas , Domínios e Motivos de Interação entre Proteínas , Proteínas , Disseminação de Informação , Internet , Proteínas/química , Proteínas/classificação , Proteínas/genética

6.

Target enhanced 2D similarity search by using explicit biological activity annotations and profiles.

Yu, Xiang; Geer, Lewis Y; Han, Lianyi; Bryant, Stephen H.

J Cheminform ; 7: 55, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26583046

RESUMO

BACKGROUND: The enriched biological activity information of compounds in large and freely-accessible chemical databases like the PubChem Bioassay Database has become a powerful research resource for the scientific research community. Currently, 2D fingerprint based conventional similarity search (CSS) is the most common widely used approach for database screening, but it does not typically incorporate the relative importance of fingerprint bits to biological activity. RESULTS: In this study, a large-scale similarity search investigation has been carried out on 208 well-defined compound activity classes extracted from PubChem Bioassay Database. An analysis was performed to compare the search performance of three types of 2D similarity search approaches: 2D fingerprint based conventional similarity search approach (CSS), iterative similarity search approach with multiple active compounds as references (ISS), and fingerprint based iterative similarity search with classification (ISC), which can be regarded as the combination of iterative similarity search with active references and a reversed iterative similarity search with inactive references. Compared to the search results returned by CSS, ISS improves recall but not precision. Although ISC causes the false rejection of active hits, it improves the precision with statistical significance, and outperforms both ISS and CSS. In a second part of this study, we introduce the profile concept into the three types of searches. We find that the profile based non-iterative search can significantly improve the search performance by increasing the recall rate. We also find that profile based ISS (PBISS) and profile based ISC (PBISC) significantly decreases ISS search time without sacrificing search performance. CONCLUSIONS: On the basis of our large-scale investigation directed against a wide spectrum of pharmaceutical targets, we conclude that ISC and ISS searches perform better than 2D fingerprint similarity searching and that profile based versions of these algorithms do nearly as well in less time. We also suggest that the profile version of the iterative similarity searches are both better performing and potentially quicker than the standard algorithm.

7.

CDD: NCBI's conserved domain database.

Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R; Lu, Shennan; Chitsaz, Farideh; Geer, Lewis Y; Geer, Renata C; He, Jane; Gwadz, Marc; Hurwitz, David I; Lanczycki, Christopher J; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Bryant, Stephen H.

Nucleic Acids Res ; 43(Database issue): D222-6, 2015 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-25414356

RESUMO

NCBI's CDD, the Conserved Domain Database, enters its 15(th) year as a public resource for the annotation of proteins with the location of conserved domain footprints. Going forward, we strive to improve the coverage and consistency of domain annotation provided by CDD. We maintain a live search system as well as an archive of pre-computed domain annotation for sequences tracked in NCBI's Entrez protein database, which can be retrieved for single sequences or in bulk. We also maintain import procedures so that CDD contains domain models and domain definitions provided by several collections available in the public domain, as well as those produced by an in-house curation effort. The curation effort aims at increasing coverage and providing finer-grained classifications of common protein domains, for which a wealth of functional and structural data has become available. CDD curation generates alignment models of representative sequence fragments, which are in agreement with domain boundaries as observed in protein 3D structure, and which model the structurally conserved cores of domain families as well as annotate conserved features. CDD can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Motivos de Aminoácidos , Sequência de Aminoácidos , Sequência Conservada , Curadoria de Dados

8.

CDD: conserved domains and protein three-dimensional structure.

Marchler-Bauer, Aron; Zheng, Chanjuan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lanczycki, Christopher J; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 41(Database issue): D348-52, 2013 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-23197659

RESUMO

CDD, the Conserved Domain Database, is part of NCBI's Entrez query and retrieval system and is also accessible via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. CDD provides annotation of protein sequences with the location of conserved domain footprints and functional sites inferred from these footprints. Pre-computed annotation is available via Entrez, and interactive search services accept single protein or nucleotide queries, as well as batch submissions of protein query sequences, utilizing RPS-BLAST to rapidly identify putative matches. CDD incorporates several protein domain and full-length protein model collections, and maintains an active curation effort that aims at providing fine grained classifications for major and well-characterized protein domain families, as supported by available protein three-dimensional (3D) structure and the published literature. To this date, the majority of protein 3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from protein structure determination efforts.

Assuntos

Bases de Dados de Proteínas , Conformação Proteica , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Internet , Modelos Moleculares , Anotação de Sequência Molecular , Proteínas/química , Proteínas/classificação , Proteínas/genética , Análise de Sequência de Proteína

9.

Analysis of the acidic proteome with negative electron-transfer dissociation mass spectrometry.

McAlister, Graeme C; Russell, Jason D; Rumachik, Neil G; Hebert, Alexander S; Syka, John E P; Geer, Lewis Y; Westphall, Michael S; Pagliarini, David J; Coon, Joshua J.

Anal Chem ; 84(6): 2875-82, 2012 Mar 20.

Artigo em Inglês | MEDLINE | ID: mdl-22335612

RESUMO

We describe the first implementation of negative electron-transfer dissociation (NETD) on a hybrid ion trap-orbitrap mass spectrometer and its application to high-throughput sequencing of peptide anions. NETD, coupled with high pH separations, negative electrospray ionization (ESI), and an NETD compatible version of OMSSA, is part of a complete workflow that includes the formation, interrogation, and sequencing of peptide anions. Together these interlocking pieces facilitated the identification of more than 2000 unique peptides from Saccharomyces cerevisiae representing the most comprehensive analysis of peptide anions by tandem mass spectrometry to date. The same S. cerevisiae samples were interrogated using traditional, positive modes of peptide LC-MS/MS analysis (e.g., acidic LC separations, positive ESI, and collision activated dissociation), and the resulting peptide identifications of the different workflows were compared. Due to a decreased flux of peptide anions and a tendency to produce lowly charged precursors, the NETD-based LC-MS/MS workflow was not as sensitive as the positive mode methods. However, the use of NETD readily permits access to underrepresented acidic portions of the proteome by identifying peptides that tend to have lower pI values. As such, NETD improves sequence coverage, filling out the acidic portions of proteins that are often overlooked by the other methods.

Assuntos

Proteínas Fúngicas/análise , Peptídeos/análise , Proteoma/análise , Proteômica/métodos , Saccharomyces cerevisiae/química , Espectrometria de Massas por Ionização por Electrospray/métodos , Sequência de Aminoácidos , Cromatografia Líquida/métodos , Concentração de Íons de Hidrogênio , Dados de Sequência Molecular

10.

Increasing peptide identifications and decreasing search times for ETD spectra by pre-processing and calculation of parent precursor charge.

Sridhara, Viswanadham; Bai, Dina L; Chi, An; Shabanowitz, Jeffrey; Hunt, Donald F; Bryant, Stephen H; Geer, Lewis Y.

Proteome Sci ; 10(1): 8, 2012 Feb 09.

Artigo em Inglês | MEDLINE | ID: mdl-22321509

RESUMO

BACKGROUND: Electron Transfer Dissociation [ETD] can dissociate multiply charged precursor polypeptides, providing extensive peptide backbone cleavage. ETD spectra contain charge reduced precursor peaks, usually of high intensity, and whose pattern is dependent on its parent precursor charge. These charge reduced precursor peaks and associated neutral loss peaks should be removed before these spectra are searched for peptide identifications. ETD spectra can also contain ion-types other than c and zË. Modifying search strategies to accommodate these ion-types may aid in increased peptide identifications. Additionally, if the precursor mass is measured using a lower resolution instrument such as a linear ion trap, the charge of the precursor is often not known, reducing sensitivity and increasing search times. We implemented algorithms to remove these precursor peaks, accommodate new ion-types in noise filtering routine in OMSSA and to estimate any unknown precursor charge, using Linear Discriminant Analysis [LDA]. RESULTS: Spectral pre-processing to remove precursor peaks and their associated neutral losses prior to protein sequence library searches resulted in a 9.8% increase in peptide identifications at a 1% False Discovery Rate [FDR] compared to previous OMSSA filter. Modifications to the OMSSA noise filter to accommodate various ion-types resulted in a further 4.2% increase in peptide identifications at 1% FDR. Moreover, ETD spectra when searched with charge states obtained from the precursor charge determination algorithm is shown to be up to 3.5 times faster than the general range search method, with a minor 3.8% increase in sensitivity. CONCLUSION: Overall, there is an 18.8% increase in peptide identifications at 1% FDR by incorporating the new precursor filter, noise filter and by using the charge determination algorithm, when compared to previous versions of OMSSA.

11.

MMDB: 3D structures and macromolecular interactions.

Madej, Thomas; Addess, Kenneth J; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Lanczycki, Christopher J; Liu, Chunlei; Lu, Shennan; Marchler-Bauer, Aron; Panchenko, Anna R; Chen, Jie; Thiessen, Paul A; Wang, Yanli; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 40(Database issue): D461-4, 2012 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-22135289

RESUMO

Close to 60% of protein sequences tracked in comprehensive databases can be mapped to a known three-dimensional (3D) structure by standard sequence similarity searches. Potentially, a great deal can be learned about proteins or protein families of interest from considering 3D structure, and to this day 3D structure data may remain an underutilized resource. Here we present enhancements in the Molecular Modeling Database (MMDB) and its data presentation, specifically pertaining to biologically relevant complexes and molecular interactions. MMDB is tightly integrated with NCBI's Entrez search and retrieval system, and mirrors the contents of the Protein Data Bank. It links protein 3D structure data with sequence data, sequence classification resources and PubChem, a repository of small-molecule chemical structures and their biological activities, facilitating access to 3D structure data not only for structural biologists, but also for molecular biologists and chemists. MMDB provides a complete set of detailed and pre-computed structural alignments obtained with the VAST algorithm, and provides visualization tools for 3D structure and structure/sequence alignment via the molecular graphics viewer Cn3D. MMDB can be accessed at http://www.ncbi.nlm.nih.gov/structure.

Assuntos

Bases de Dados de Proteínas , Modelos Moleculares , Conformação Proteica , Análise de Sequência de Proteína

12.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; Dicuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian.

Nucleic Acids Res ; 40(Database issue): D13-25, 2012 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-22140104

RESUMO

In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.

Assuntos

Bases de Dados como Assunto , Bases de Dados Genéticas , Bases de Dados de Proteínas , Expressão Gênica , Genômica , Internet , Modelos Moleculares , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , PubMed , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de Proteína , Análise de Sequência de RNA , Bibliotecas de Moléculas Pequenas , Estados Unidos

13.

Automated annotation of chemical names in the literature with tunable accuracy.

Zhang, Jun D; Geer, Lewis Y; Bolton, Evan E; Bryant, Stephen H.

J Cheminform ; 3(1): 52, 2011 Nov 22.

Artigo em Inglês | MEDLINE | ID: mdl-22107874

RESUMO

BACKGROUND: A significant portion of the biomedical and chemical literature refers to small molecules. The accurate identification and annotation of compound name that are relevant to the topic of the given literature can establish links between scientific publications and various chemical and life science databases. Manual annotation is the preferred method for these works because well-trained indexers can understand the paper topics as well as recognize key terms. However, considering the hundreds of thousands of new papers published annually, an automatic annotation system with high precision and relevance can be a useful complement to manual annotation. RESULTS: An automated chemical name annotation system, MeSH Automated Annotations (MAA), was developed to annotate small molecule names in scientific abstracts with tunable accuracy. This system aims to reproduce the MeSH term annotations on biomedical and chemical literature that would be created by indexers. When comparing automated free text matching to those indexed manually of 26 thousand MEDLINE abstracts, more than 40% of the annotations were false-positive (FP) cases. To reduce the FP rate, MAA incorporated several filters to remove "incorrect" annotations caused by nonspecific, partial, and low relevance chemical names. In part, relevance was measured by the position of the chemical name in the text. Tunable accuracy was obtained by adding or restricting the sections of the text scanned for chemical names. The best precision obtained was 96% with a 28% recall rate. The best performance of MAA, as measured with the F statistic was 66%, which favorably compares to other chemical name annotation systems. CONCLUSIONS: Accurate chemical name annotation can help researchers not only identify important chemical names in abstracts, but also match unindexed and unstructured abstracts to chemical records. The current work is tested against MEDLINE, but the algorithm is not specific to this corpus and it is possible that the algorithm can be applied to papers from chemical physics, material, polymer and environmental science, as well as patents, biological assay descriptions and other textual data.

14.

Automatic annotation of experimentally derived, evolutionarily conserved post-translational modifications onto multiple genomes.

Sridhara, Viswanadham; Marchler-Bauer, Aron; Bryant, Stephen H; Geer, Lewis Y.

Database (Oxford) ; 2011: bar019, 2011.

Artigo em Inglês | MEDLINE | ID: mdl-21571812

RESUMO

New generation sequencing technologies have resulted in significant increases in the number of complete genomes. Functional characterization of these genomes, such as by high-throughput proteomics, is an important but challenging task due to the difficulty of scaling up existing experimental techniques. By use of comparative genomics techniques, experimental results can be transferred from one genome to another, while at the same time minimizing errors by requiring discovery in multiple genomes. In this study, protein phosphorylation, an essential component of many cellular processes, is studied using data from large-scale proteomics analyses of the phosphoproteome. Phosphorylation sites from Homo sapiens, Mus musculus and Drosophila melanogaster phosphopeptide data sets were mapped onto conserved domains in NCBI's manually curated portion of Conserved Domain Database (CDD). In this subset, 25 phosphorylation sites are found to be evolutionarily conserved between the three species studied. Transfer of phosphorylation annotation of these conserved sites onto sequences sharing the same conserved domains yield 3253 phosphosite annotations for proteins from coelomata, the taxonomic division that spans H. sapiens, M. musculus and D. melanogaster. The method scales automatically, so as the amount of experimental phosphoproteomics data increases, more conserved phosphorylation sites may be revealed.

Assuntos

Automação , Sequência Conservada/genética , Evolução Molecular , Genoma/genética , Anotação de Sequência Molecular/métodos , Processamento de Proteína Pós-Traducional/genética , Algoritmos , Sequência de Aminoácidos , Animais , Humanos , Dados de Sequência Molecular , Estrutura Terciária de Proteína , Proteínas/química , Proteínas/metabolismo

15.

CDD: a Conserved Domain Database for the functional annotation of proteins.

Marchler-Bauer, Aron; Lu, Shennan; Anderson, John B; Chitsaz, Farideh; Derbyshire, Myra K; DeWeese-Scott, Carol; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Lanczycki, Christopher J; Lu, Fu; Marchler, Gabriele H; Mullokandov, Mikhail; Omelchenko, Marina V; Robertson, Cynthia L; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Zhang, Naigong; Zheng, Chanjuan; Bryant, Stephen H.

Nucleic Acids Res ; 39(Database issue): D225-9, 2011 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-21109532

RESUMO

NCBI's Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default, domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein queries. CDD can be accessed via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Modelos Biológicos , Proteínas/classificação , Análise de Sequência de Proteína

16.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; DiCuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian.

Nucleic Acids Res ; 39(Database issue): D38-51, 2011 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-21097890

RESUMO

In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.

Assuntos

Bases de Dados Genéticas , Bases de Dados de Proteínas , Expressão Gênica , Genômica , National Library of Medicine (U.S.) , Estrutura Terciária de Proteína , PubMed , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de RNA , Software , Integração de Sistemas , Estados Unidos

17.

The NCBI BioSystems database.

Geer, Lewis Y; Marchler-Bauer, Aron; Geer, Renata C; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H.

Nucleic Acids Res ; 38(Database issue): D492-6, 2010 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-19854944

RESUMO

The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI's Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets.

Assuntos

Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Biologia de Sistemas , Animais , Membrana Celular/metabolismo , Biologia Computacional/tendências , Bases de Dados de Proteínas , Genes , Genômica , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , National Library of Medicine (U.S.) , Software , Estados Unidos

18.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; Dicuccio, Michael; Federhen, Scott; Feolo, Michael; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; John Wilbur, W; Yaschenko, Eugene; Ye, Jian.

Nucleic Acids Res ; 38(Database issue): D5-16, 2010 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-19910364

RESUMO

In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, Reference Sequence, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Peptidome, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.

Assuntos

Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Algoritmos , Animais , Biologia Computacional/tendências , Bases de Dados de Proteínas , Genoma Bacteriano , Genoma Viral , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , National Institutes of Health (U.S.) , National Library of Medicine (U.S.) , Software , Estados Unidos

19.

CDD: specific functional annotation with the Conserved Domain Database.

Marchler-Bauer, Aron; Anderson, John B; Chitsaz, Farideh; Derbyshire, Myra K; DeWeese-Scott, Carol; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; He, Siqian; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Lanczycki, Christopher J; Liebert, Cynthia A; Liu, Chunlei; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Mullokandov, Mikhail; Song, James S; Tasneem, Asba; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Zhang, Naigong; Bryant, Stephen H.

Nucleic Acids Res ; 37(Database issue): D205-10, 2009 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-18984618

RESUMO

NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution. The collection can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources. CDD provides annotation of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Starting with the latest version of CDD, v2.14, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either 'specific' (identifying molecular function with high confidence) or as 'non-specific' (identifying superfamily membership only).

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Proteínas/classificação , Alinhamento de Sequência , Análise de Sequência de Proteína

20.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Feolo, Michael; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Landsman, David; Lipman, David J; Madden, Thomas L; Maglott, Donna R; Miller, Vadim; Mizrachi, Ilene; Ostell, James; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Yaschenko, Eugene; Ye, Jian.

Nucleic Acids Res ; 37(Database issue): D5-15, 2009 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-18940862

RESUMO

In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the web applications is custom implementation of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.

Assuntos

Bases de Dados Genéticas , Expressão Gênica , Genes , Genômica , Genótipo , National Library of Medicine (U.S.) , Fenótipo , Estrutura Terciária de Proteína , Proteômica , PubMed , Homologia de Sequência , Integração de Sistemas , Estados Unidos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA