Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
Nucleic Acids Res ; 51(D1): D384-D388, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36477806

ABSTRACT

NLM's conserved domain database (CDD) is a collection of protein domain and protein family models constructed as multiple sequence alignments. Its main purpose is to provide annotation for protein and translated nucleotide sequences with the location of domain footprints and associated functional sites, and to define protein domain architecture as a basis for assigning gene product names and putative/predicted function. CDD has been available publicly for over 20 years and has grown substantially during that time. Maintaining an archive of pre-computed annotation continues to be a challenge and has slowed down the cadence of CDD releases. CDD curation staff builds hierarchical classifications of large protein domain families, adds models for novel domain families via surveillance of the protein 'dark matter' that currently lacks annotation, and now spends considerable effort on providing names and attribution for conserved domain architectures. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.


Subject(s)
Databases, Protein , Proteins , Humans , Amino Acid Sequence , Conserved Sequence , Protein Structure, Tertiary , Proteins/chemistry , Proteins/genetics , Protein Domains
2.
Nucleic Acids Res ; 49(D1): D1020-D1028, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33270901

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


Subject(s)
Computational Biology/methods , Databases, Genetic , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Molecular Sequence Annotation/methods , Proteins/genetics , Data Curation/methods , Data Mining/methods , Genomics/methods , Internet , Proteins/classification , User-Computer Interface
3.
Curr Protoc Bioinformatics ; 69(1): e90, 2020 03.
Article in English | MEDLINE | ID: mdl-31851420

ABSTRACT

The Conserved Domain Database (CDD) is a freely available resource for the annotation of sequences with the locations of conserved protein domain footprints, as well as functional sites and motifs inferred from these footprints. It includes protein domain and protein family models curated in house by CDD staff, as well as imported from a variety of other sources. The latest CDD release (v3.17, April 2019) contains more than 57,000 domain models, of which almost 15,000 were curated by CDD staff. The CDD curation effort increases coverage and provides finer-grained classifications of common and widely distributed protein domain families, for which a wealth of functional and structural data have become available. The CDD maintains both live search capabilities and an archive of pre-computed domain annotations for a selected subset of sequences tracked by the NCBI's Entrez protein database. These can be retrieved or computed for a single sequence using CD-Search or in bulk using Batch CD-Search, or computed via standalone RPS-BLAST plus the rpsbproc software package. The CDD can be accessed via https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The three protocols listed here describe how to perform a CD-Search (Basic Protocol 1), a Batch CD-Search (Basic Protocol 2), and a Standalone RPS-BLAST and rpsbproc (Basic Protocol 3). © 2019 The Authors. Basic Protocol 1: CD-search Basic Protocol 2: Batch CD-search Basic Protocol 3: Standalone RPS-BLAST and rpsbproc.


Subject(s)
Computational Biology/methods , Conserved Sequence , Databases, Protein , Proteins/chemistry , Amino Acid Sequence , Guidelines as Topic , Phylogeny , Protein Domains
4.
Nucleic Acids Res ; 48(D1): D265-D268, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31777944

ABSTRACT

As NLM's Conserved Domain Database (CDD) enters its 20th year of operations as a publicly available resource, CDD curation staff continues to develop hierarchical classifications of widely distributed protein domain families, and to record conserved sites associated with molecular function, so that they can be mapped onto user queries in support of hypothesis-driven biomolecular research. CDD offers both an archive of pre-computed domain annotations as well as live search services for both single protein or nucleotide queries and larger sets of protein query sequences. CDD staff has continued to characterize protein families via conserved domain architectures and has built up a significant corpus of curated domain architectures in support of naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.


Subject(s)
Databases, Protein , Protein Domains , Amino Acid Sequence , Conserved Sequence
5.
Nucleic Acids Res ; 46(D1): D851-D860, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29112715

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


Subject(s)
Data Curation , Databases, Nucleic Acid , Genome , Molecular Sequence Annotation , Prokaryotic Cells , Archaea/genetics , Bacteria/genetics , Databases, Protein , Eukaryota/genetics , Forecasting , Humans , Sequence Homology , Software , Viruses/genetics
6.
Nucleic Acids Res ; 45(D1): D200-D203, 2017 01 04.
Article in English | MEDLINE | ID: mdl-27899674

ABSTRACT

NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.


Subject(s)
Computational Biology/methods , Databases, Protein , Protein Interaction Domains and Motifs , Proteins , Information Dissemination , Internet , Proteins/chemistry , Proteins/classification , Proteins/genetics
7.
Nucleic Acids Res ; 43(Database issue): D222-6, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25414356

ABSTRACT

NCBI's CDD, the Conserved Domain Database, enters its 15(th) year as a public resource for the annotation of proteins with the location of conserved domain footprints. Going forward, we strive to improve the coverage and consistency of domain annotation provided by CDD. We maintain a live search system as well as an archive of pre-computed domain annotation for sequences tracked in NCBI's Entrez protein database, which can be retrieved for single sequences or in bulk. We also maintain import procedures so that CDD contains domain models and domain definitions provided by several collections available in the public domain, as well as those produced by an in-house curation effort. The curation effort aims at increasing coverage and providing finer-grained classifications of common protein domains, for which a wealth of functional and structural data has become available. CDD curation generates alignment models of representative sequence fragments, which are in agreement with domain boundaries as observed in protein 3D structure, and which model the structurally conserved cores of domain families as well as annotate conserved features. CDD can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Motifs , Amino Acid Sequence , Conserved Sequence , Data Curation
8.
Bioinformatics ; 31(1): 134-6, 2015 Jan 01.
Article in English | MEDLINE | ID: mdl-25212755

ABSTRACT

MOTIVATION: cddApp is a Cytoscape extension that supports the annotation of protein networks with information about domains and specific functional sites from the National Center for Biotechnology Information's conserved domain database (CDD). CDD information is loaded for nodes annotated with NCBI numbers or UniProt identifiers and (optionally) Protein Data Bank structures. cddApp integrates with the Cytoscape apps structureViz2 and enhancedGraphics. Together, these three apps provide powerful tools to annotate nodes with CDD domain and site information and visualize that information in both network and structural contexts. AVAILABILITY AND IMPLEMENTATION: cddApp is written in Java and freely available for download from the Cytoscape app store (http://apps.cytoscape.org). Documentation is provided at http://www.rbvi.ucsf.edu/cytoscape, and the source is publically available from GitHub http://github.com/RBVI/cddApp.


Subject(s)
Bacterial Proteins/metabolism , Computational Biology/instrumentation , Metabolic Networks and Pathways , Molecular Sequence Annotation/methods , Sequence Analysis, Protein/methods , Software , Algorithms , Bacillus , Bacterial Proteins/chemistry , Conserved Sequence , Databases, Protein , Humans , Protein Conformation , Protein Interaction Mapping
9.
Nucleic Acids Res ; 41(Database issue): D348-52, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23197659

ABSTRACT

CDD, the Conserved Domain Database, is part of NCBI's Entrez query and retrieval system and is also accessible via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. CDD provides annotation of protein sequences with the location of conserved domain footprints and functional sites inferred from these footprints. Pre-computed annotation is available via Entrez, and interactive search services accept single protein or nucleotide queries, as well as batch submissions of protein query sequences, utilizing RPS-BLAST to rapidly identify putative matches. CDD incorporates several protein domain and full-length protein model collections, and maintains an active curation effort that aims at providing fine grained classifications for major and well-characterized protein domain families, as supported by available protein three-dimensional (3D) structure and the published literature. To this date, the majority of protein 3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from protein structure determination efforts.


Subject(s)
Databases, Protein , Protein Conformation , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Internet , Models, Molecular , Molecular Sequence Annotation , Proteins/chemistry , Proteins/classification , Proteins/genetics , Sequence Analysis, Protein
10.
Nucleic Acids Res ; 39(Database issue): D225-9, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21109532

ABSTRACT

NCBI's Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default, domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein queries. CDD can be accessed via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Models, Biological , Proteins/classification , Sequence Analysis, Protein
11.
Nucleic Acids Res ; 37(Database issue): D205-10, 2009 Jan.
Article in English | MEDLINE | ID: mdl-18984618

ABSTRACT

NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution. The collection can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources. CDD provides annotation of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Starting with the latest version of CDD, v2.14, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either 'specific' (identifying molecular function with high confidence) or as 'non-specific' (identifying superfamily membership only).


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Proteins/classification , Sequence Alignment , Sequence Analysis, Protein
12.
Nucleic Acids Res ; 35(Database issue): D237-40, 2007 Jan.
Article in English | MEDLINE | ID: mdl-17135202

ABSTRACT

The conserved domain database (CDD) is part of NCBI's Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez. Entrez's global query interface can be accessed at http://www.ncbi.nlm.nih.gov/Entrez and will search CDD and many other databases. Domain annotation for proteins in Entrez has been pre-computed and is readily available in the form of 'Conserved Domain' links. Novel protein sequences can be scanned against CDD using the CD-Search service; this service searches databases of CDD-derived profile models with protein sequence queries using BLAST heuristics, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Protein query sequences submitted to NCBI's protein BLAST search service are scanned for conserved domain signatures by default. The CDD collection contains models imported from Pfam, SMART and COG, as well as domain models curated at NCBI. NCBI curated models are organized into hierarchies of domains related by common descent. Here we report on the status of the curation effort and present a novel helper application, CDTree, which enables users of the CDD resource to examine curated hierarchies. More importantly, CDD and CDTree used in concert, serve as a powerful tool in protein classification, as they allow users to analyze protein sequences in the context of domain family hierarchies.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Animals , Conserved Sequence , Internet , Phylogeny , Protein Structure, Tertiary/genetics , Proteins/classification , Sequence Analysis, Protein , User-Computer Interface
13.
Nucleic Acids Res ; 33(Database issue): D192-6, 2005 Jan 01.
Article in English | MEDLINE | ID: mdl-15608175

ABSTRACT

The Conserved Domain Database (CDD) is the protein classification component of NCBI's Entrez query and retrieval system. CDD is linked to other Entrez databases such as Proteins, Taxonomy and PubMed, and can be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. CD-Search, which is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, is a fast, interactive tool to identify conserved domains in new protein sequences. CD-Search results for protein sequences in Entrez are pre-computed to provide links between proteins and domain models, and computational annotation visible upon request. Protein-protein queries submitted to NCBI's BLAST search service at http://www.ncbi.nlm.nih.gov/BLAST are scanned for the presence of conserved domains by default. While CDD started out as essentially a mirror of publicly available domain alignment collections, such as SMART, Pfam and COG, we have continued an effort to update, and in some cases replace these models with domain hierarchies curated at the NCBI. Here, we report on the progress of the curation effort and associated improvements in the functionality of the CDD information retrieval system.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Proteins/classification , Amino Acid Sequence , Conserved Sequence , Phylogeny , Sequence Alignment , Sequence Analysis, Protein , User-Computer Interface
14.
Nucleic Acids Res ; 31(1): 383-7, 2003 Jan 01.
Article in English | MEDLINE | ID: mdl-12520028

ABSTRACT

The Conserved Domain Database (CDD) is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE(R). This allows users to search for domain types by name, for example, or to view the domain architecture of any protein in Entrez's sequence database. CDD can be accessed on the WorldWideWeb at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. Users may also employ the CD-Search service to identify conserved domains in new sequences, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. CD-Search results, and pre-computed links from Entrez's protein database, are calculated using the RPS-BLAST algorithm and Position Specific Score Matrices (PSSMs) derived from CDD alignments. CD-Searches are also run by default for protein-protein queries submitted to BLAST(R) at http://www.ncbi.nlm.nih.gov/BLAST. CDD mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI. Structure information is used to identify the core substructure likely to be present in all family members, and to produce sequence alignments consistent with structure conservation. This alignment model allows NCBI curators to annotate 'columns' corresponding to functional sites conserved among family members.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Animals , Conserved Sequence , Information Storage and Retrieval , Models, Molecular , Sequence Alignment
15.
Nucleic Acids Res ; 31(1): 474-7, 2003 Jan 01.
Article in English | MEDLINE | ID: mdl-12520055

ABSTRACT

Three-dimensional structures are now known within most protein families and it is likely, when searching a sequence database, that one will identify a homolog of known structure. The goal of Entrez's 3D-structure database is to make structure information and the functional annotation it can provide easily accessible to molecular biologists. To this end, Entrez's search engine provides several powerful features: (i) links between databases, for example between a protein's sequence and structure; (ii) pre-computed sequence and structure neighbors; and (iii) structure and sequence/structure alignment visualization. Here, we focus on a new feature of Entrez's Molecular Modeling Database (MMDB): Graphical summaries of the biological annotation available for each 3D structure, based on the results of automated comparative analysis. MMDB is available at: http://www.ncbi.nlm.nih.gov/Entrez/structure.html.


Subject(s)
Databases, Protein , Models, Molecular , Structural Homology, Protein , Animals , Computer Graphics , Imaging, Three-Dimensional , Protein Structure, Tertiary , Proteins/chemistry
16.
Nucleic Acids Res ; 30(1): 249-52, 2002 Jan 01.
Article in English | MEDLINE | ID: mdl-11752307

ABSTRACT

Three-dimensional structures are now known within many protein families and it is quite likely, in searching a sequence database, that one will encounter a homolog with known structure. The goal of Entrez's 3D-structure database is to make this information, and the functional annotation it can provide, easily accessible to molecular biologists. To this end Entrez's search engine provides three powerful features. (i) Sequence and structure neighbors; one may select all sequences similar to one of interest, for example, and link to any known 3D structures. (ii) Links between databases; one may search by term matching in MEDLINE, for example, and link to 3D structures reported in these articles. (iii) Sequence and structure visualization; identifying a homolog with known structure, one may view molecular-graphic and alignment displays, to infer approximate 3D structure. In this article we focus on two features of Entrez's Molecular Modeling Database (MMDB) not described previously: links from individual biopolymer chains within 3D structures to a systematic taxonomy of organisms represented in molecular databases, and links from individual chains (and compact 3D domains within them) to structure neighbors, other chains (and 3D domains) with similar 3D structure. MMDB may be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure.


Subject(s)
Databases, Protein , Proteins/chemistry , Animals , Computer Graphics , Humans , Imaging, Three-Dimensional , Information Storage and Retrieval , Internet , National Library of Medicine (U.S.) , Phylogeny , Protein Structure, Tertiary , Proteins/genetics , Sequence Alignment , Sequence Homology, Amino Acid , United States
SELECTION OF CITATIONS
SEARCH DETAIL
...