Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
Add more filters










Publication year range
1.
bioRxiv ; 2024 Mar 25.
Article in English | MEDLINE | ID: mdl-38712272

ABSTRACT

Tens of thousands of influenza sequences are deposited into the GenBank database each year. The software tool FLAN has been used by GenBank since 2007 to validate and annotate incoming influenza sequence submissions, and has been publicly available as a webserver but not as a standalone tool. VADR is a general sequence validation and annotation software package used by GenBank for Norovirus, Dengue virus and SARS-CoV-2 virus sequence processing that is available as a standalone tool. We have created VADR influenza models based on the FLAN reference sequences and adapted VADR to accurately annotate influenza sequences. VADR and FLAN show consistent results on the vast majority of influenza sequences, and when they disagree VADR is usually correct. VADR can also accurately process influenza D sequences as well as influenza A H17, H18, H19, N10 and N11 subtype sequences, which FLAN cannot. VADR 1.6.3 and the associated influenza models are now freely available for users to download and use.

2.
Genome Res ; 34(3): 498-513, 2024 04 25.
Article in English | MEDLINE | ID: mdl-38508693

ABSTRACT

Hydractinia is a colonial marine hydroid that shows remarkable biological properties, including the capacity to regenerate its entire body throughout its lifetime, a process made possible by its adult migratory stem cells, known as i-cells. Here, we provide an in-depth characterization of the genomic structure and gene content of two Hydractinia species, Hydractinia symbiolongicarpus and Hydractinia echinata, placing them in a comparative evolutionary framework with other cnidarian genomes. We also generated and annotated a single-cell transcriptomic atlas for adult male H. symbiolongicarpus and identified cell-type markers for all major cell types, including key i-cell markers. Orthology analyses based on the markers revealed that Hydractinia's i-cells are highly enriched in genes that are widely shared amongst animals, a striking finding given that Hydractinia has a higher proportion of phylum-specific genes than any of the other 41 animals in our orthology analysis. These results indicate that Hydractinia's stem cells and early progenitor cells may use a toolkit shared with all animals, making it a promising model organism for future exploration of stem cell biology and regenerative medicine. The genomic and transcriptomic resources for Hydractinia presented here will enable further studies of their regenerative capacity, colonial morphology, and ability to distinguish self from nonself.


Subject(s)
Genome , Hydrozoa , Animals , Hydrozoa/genetics , Evolution, Molecular , Transcriptome , Stem Cells/metabolism , Male , Phylogeny , Single-Cell Analysis/methods
3.
bioRxiv ; 2023 Aug 27.
Article in English | MEDLINE | ID: mdl-37786714

ABSTRACT

Hydractinia is a colonial marine hydroid that exhibits remarkable biological properties, including the capacity to regenerate its entire body throughout its lifetime, a process made possible by its adult migratory stem cells, known as i-cells. Here, we provide an in-depth characterization of the genomic structure and gene content of two Hydractinia species, H. symbiolongicarpus and H. echinata, placing them in a comparative evolutionary framework with other cnidarian genomes. We also generated and annotated a single-cell transcriptomic atlas for adult male H. symbiolongicarpus and identified cell type markers for all major cell types, including key i-cell markers. Orthology analyses based on the markers revealed that Hydractinia's i-cells are highly enriched in genes that are widely shared amongst animals, a striking finding given that Hydractinia has a higher proportion of phylum-specific genes than any of the other 41 animals in our orthology analysis. These results indicate that Hydractinia's stem cells and early progenitor cells may use a toolkit shared with all animals, making it a promising model organism for future exploration of stem cell biology and regenerative medicine. The genomic and transcriptomic resources for Hydractinia presented here will enable further studies of their regenerative capacity, colonial morphology, and ability to distinguish self from non-self.

4.
NAR Genom Bioinform ; 5(1): lqad002, 2023 Mar.
Article in English | MEDLINE | ID: mdl-36685728

ABSTRACT

In 2020 and 2021, >1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. VADR is now nearly 1000 times faster than it was in early 2020 SARS-CoV-2 sequence processing. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month.

5.
bioRxiv ; 2022 Apr 27.
Article in English | MEDLINE | ID: mdl-35547842

ABSTRACT

Background: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. Results: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch , increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. Conclusion: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available ( https://github.com/ncbi/vadr ) for local installation and use.

6.
Database (Oxford) ; 20222022 03 01.
Article in English | MEDLINE | ID: mdl-35230423

ABSTRACT

Rapid response to the current coronavirus disease 2019 (COVID-19) pandemic requires fast dissemination of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomic sequence data in order to align diagnostic tests and vaccines with the natural evolution of the virus as it spreads through the world. To facilitate this, the National Library of Medicine's National Center for Biotechnology Information developed an automated pipeline for the deposition and quick processing of SARS-CoV-2 genome assemblies into GenBank for the user community. The pipeline ensures the collection of contextual information about the virus source, assesses sequence quality and annotates descriptive biological features, such as protein-coding regions and mature peptides. The process promotes standardized nomenclature and creates and publishes fully processed GenBank files within minutes of deposition. The software has processed and published 982 454 annotated SARS-CoV-2 sequences, as of 21 October 2021. This development addresses the needs of the scientific community as the sequencing of SARS-CoV-2 genomes increases and will facilitate unrestricted access to and usability of SARS-CoV-2 genomic sequence data, providing important reagents for scientific and public health activities in response to the COVID-19 pandemic. Database URL https://submit.ncbi.nlm.nih.gov/sarscov2/genbank/.


Subject(s)
COVID-19 , SARS-CoV-2 , COVID-19/epidemiology , COVID-19/genetics , Databases, Nucleic Acid , Genome, Viral/genetics , Humans , Pandemics , SARS-CoV-2/genetics
7.
BMC Bioinformatics ; 22(1): 400, 2021 Aug 12.
Article in English | MEDLINE | ID: mdl-34384346

ABSTRACT

BACKGROUND: The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. RESULTS: To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa. CONCLUSION: Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank.


Subject(s)
Databases, Nucleic Acid , RNA, Ribosomal , DNA, Ribosomal , Phylogeny , RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 18S/genetics , Sequence Analysis, RNA
8.
Nat Commun ; 12(1): 3494, 2021 06 09.
Article in English | MEDLINE | ID: mdl-34108470

ABSTRACT

Non-coding RNAs (ncRNA) are essential for all life, and their functions often depend on their secondary (2D) and tertiary structure. Despite the abundance of software for the visualisation of ncRNAs, few automatically generate consistent and recognisable 2D layouts, which makes it challenging for users to construct, compare and analyse structures. Here, we present R2DT, a method for predicting and visualising a wide range of RNA structures in standardised layouts. R2DT is based on a library of 3,647 templates representing the majority of known structured RNAs. R2DT has been applied to ncRNA sequences from the RNAcentral database and produced >13 million diagrams, creating the world's largest RNA 2D structure dataset. The software is amenable to community expansion, and is freely available at https://github.com/rnacentral/R2DT and a web server is found at https://rnacentral.org/r2dt .


Subject(s)
Computational Biology/methods , RNA/chemistry , Databases, Nucleic Acid , Nucleic Acid Conformation , RNA, Untranslated/chemistry , Reproducibility of Results , Sequence Analysis, RNA , Software
9.
Brief Bioinform ; 22(2): 642-663, 2021 03 22.
Article in English | MEDLINE | ID: mdl-33147627

ABSTRACT

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.


Subject(s)
COVID-19/prevention & control , Computational Biology , SARS-CoV-2/isolation & purification , Biomedical Research , COVID-19/epidemiology , COVID-19/virology , Genome, Viral , Humans , Pandemics , SARS-CoV-2/genetics
10.
Nucleic Acids Res ; 49(D1): D192-D200, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33211869

ABSTRACT

Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.


Subject(s)
Databases, Nucleic Acid , Metagenome , MicroRNAs/genetics , RNA, Bacterial/genetics , RNA, Untranslated/genetics , RNA, Viral/genetics , Bacteria/genetics , Bacteria/metabolism , Base Pairing , Base Sequence , Humans , Internet , MicroRNAs/classification , MicroRNAs/metabolism , Molecular Sequence Annotation , Nucleic Acid Conformation , RNA, Bacterial/classification , RNA, Bacterial/metabolism , RNA, Untranslated/classification , RNA, Untranslated/metabolism , RNA, Viral/classification , RNA, Viral/metabolism , Sequence Alignment , Sequence Analysis, RNA , Software , Viruses/genetics , Viruses/metabolism
11.
BMC Bioinformatics ; 21(1): 211, 2020 May 24.
Article in English | MEDLINE | ID: mdl-32448124

ABSTRACT

BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.


Subject(s)
Betacoronavirus , Coronavirus Infections , Databases, Nucleic Acid , Molecular Sequence Annotation , Pandemics , Pneumonia, Viral , Software , Betacoronavirus/genetics , COVID-19 , Coronavirus Infections/genetics , DNA Viruses , Genomics , Humans , Molecular Sequence Annotation/standards , Pneumonia, Viral/genetics , SARS-CoV-2 , Viruses
12.
Curr Protoc Bioinformatics ; 62(1): e51, 2018 06.
Article in English | MEDLINE | ID: mdl-29927072

ABSTRACT

Rfam is a database of non-coding RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. Using a combination of manual and literature-based curation and a custom software pipeline, Rfam converts descriptions of RNA families found in the scientific literature into computational models that can be used to annotate RNAs belonging to those families in any DNA or RNA sequence. Valuable research outputs that are often locked up in figures and supplementary information files are encapsulated in Rfam entries and made accessible through the Rfam Web site. The data produced by Rfam have a broad application, from genome annotation to providing training sets for algorithm development. This article gives an overview of how to search and navigate the Rfam Web site, and how to annotate sequences with RNA families. The Rfam database is freely available at http://rfam.org. © 2018 by John Wiley & Sons, Inc.


Subject(s)
Databases, Nucleic Acid , RNA, Untranslated/genetics , Base Sequence , Genome, Human , Humans , Molecular Sequence Annotation , Nucleic Acid Conformation , RNA, Untranslated/chemistry , Riboswitch/genetics , Sequence Alignment , Sequence Analysis, RNA
13.
Nucleic Acids Res ; 46(15): 7970-7976, 2018 09 06.
Article in English | MEDLINE | ID: mdl-29788499

ABSTRACT

Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.


Subject(s)
Archaea/genetics , Introns/genetics , RNA, Archaeal/genetics , RNA, Catalytic/genetics , Archaea/classification , Archaea/enzymology , Base Sequence , Nucleic Acid Conformation , Phylogeny , RNA, Archaeal/chemistry , RNA, Archaeal/classification , RNA, Catalytic/chemistry , RNA, Catalytic/classification , Species Specificity
14.
Bioinformatics ; 34(5): 755-759, 2018 03 01.
Article in English | MEDLINE | ID: mdl-29069347

ABSTRACT

Motivation: Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true-positive (contamination) from false-positive (not contamination) VecScreen matches. Results: A principal reason for false-positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions. Availability and implementation: Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact: aschaffe@helix.nih.gov. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Nucleic Acid/standards , Sequence Analysis, DNA/methods , Software , Bacteria , Eukaryota
15.
Nucleic Acids Res ; 46(D1): D335-D342, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29112718

ABSTRACT

The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.


Subject(s)
Databases, Nucleic Acid , Genome , RNA, Untranslated/chemistry , RNA, Untranslated/genetics , Humans , Molecular Sequence Annotation , Nucleic Acid Conformation , RNA, Untranslated/classification , Sequence Alignment , Sequence Analysis, RNA
16.
Cell Rep ; 19(8): 1723-1738, 2017 05 23.
Article in English | MEDLINE | ID: mdl-28538188

ABSTRACT

The MALAT1 (Metastasis-Associated Lung Adenocarcinoma Transcript 1) gene encodes a noncoding RNA that is processed into a long nuclear retained transcript (MALAT1) and a small cytoplasmic tRNA-like transcript (mascRNA). Using an RNA sequence- and structure-based covariance model, we identified more than 130 genomic loci in vertebrate genomes containing the MALAT1 3' end triple-helix structure and its immediate downstream tRNA-like structure, including 44 in the green lizard Anolis carolinensis. Structural and computational analyses revealed a co-occurrence of components of the 3' end module. MALAT1-like genes in Anolis carolinensis are highly expressed in adult testis, thus we named them testis-abundant long noncoding RNAs (tancRNAs). MALAT1-like loci also produce multiple small RNA species, including PIWI-interacting RNAs (piRNAs), from the antisense strand. The 3' ends of tancRNAs serve as potential targets for the PIWI-piRNA complex. Thus, we have identified an evolutionarily conserved class of long noncoding RNAs (lncRNAs) with similar structural constraints, post-transcriptional processing, and subcellular localization and a distinct function in spermatocytes.


Subject(s)
Genetic Loci , Genome, Human , RNA, Long Noncoding/genetics , Animals , Base Sequence , Cell Nucleus/metabolism , Humans , Lizards/genetics , Male , Nucleic Acid Conformation , Organ Specificity/genetics , RNA, Long Noncoding/chemistry , RNA, Small Interfering/genetics , Spermatocytes/metabolism
17.
Nucleic Acids Res ; 45(D1): D482-D490, 2017 01 04.
Article in English | MEDLINE | ID: mdl-27899678

ABSTRACT

The Virus Variation Resource is a value-added viral sequence data resource hosted by the National Center for Biotechnology Information. The resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, Dengue virus, West Nile virus, Ebolavirus, MERS coronavirus, Rotavirus A and Zika virus Each module is supported by pipelines that scan newly released GenBank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. These processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. Once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. This manuscript describes a series of features and functionalities recently added to the Virus Variation Resource.


Subject(s)
Computational Biology/methods , Disease Outbreaks , Genetic Variation , Software , Virus Diseases/epidemiology , Virus Diseases/virology , Viruses/genetics , Databases, Genetic
18.
Nucleic Acids Res ; 44(14): 6614-24, 2016 08 19.
Article in English | MEDLINE | ID: mdl-27342282

ABSTRACT

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.


Subject(s)
Genome, Bacterial , Molecular Sequence Annotation , Prokaryotic Cells/metabolism , Bacteria/genetics , Bacterial Proteins/chemistry , Databases, Nucleic Acid , Genes, Bacterial
19.
Nucleic Acids Res ; 43(Database issue): D130-7, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25392425

ABSTRACT

The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.


Subject(s)
Databases, Nucleic Acid , RNA, Untranslated/chemistry , Genomics , Internet , Molecular Sequence Annotation , Nucleic Acid Conformation , Nucleotide Motifs , RNA, Long Noncoding/chemistry , RNA, Untranslated/classification , Software
20.
Methods Mol Biol ; 1097: 163-97, 2014.
Article in English | MEDLINE | ID: mdl-24639160

ABSTRACT

Many different types of functional non-coding RNAs participate in a wide range of important cellular functions but the large majority of these RNAs are not routinely annotated in published genomes. Several programs have been developed for identifying RNAs, including specific tools tailored to a particular RNA family as well as more general ones designed to work for any family. Many of these tools utilize covariance models (CMs), statistical models of the conserved sequence, and structure of an RNA family. In this chapter, as an illustrative example, the Infernal software package and CMs from the Rfam database are used to identify RNAs in the genome of the archaeon Methanobrevibacter ruminantium, uncovering some additional RNAs not present in the genome's initial annotation. Analysis of the results and comparison with family-specific methods demonstrate some important strengths and weaknesses of this general approach.


Subject(s)
Computational Biology/methods , Genomics/methods , Molecular Sequence Annotation/methods , RNA/chemistry , RNA/genetics , Sequence Analysis, RNA/methods , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...