Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 53
Filter
1.
Nat Methods ; 21(6): 994-1002, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38755321

ABSTRACT

Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.


Subject(s)
Software , Nucleotides/genetics , Humans , Databases, Genetic , Computational Biology/methods , Databases, Nucleic Acid , Algorithms
2.
PLoS One ; 19(1): e0291406, 2024.
Article in English | MEDLINE | ID: mdl-38241320

ABSTRACT

Candida auris is a newly emerged multidrug-resistant fungus capable of causing invasive infections with high mortality. Despite intense efforts to understand how this pathogen rapidly emerged and spread worldwide, its environmental reservoirs are poorly understood. Here, we present a collaborative effort between the U.S. Centers for Disease Control and Prevention, the National Center for Biotechnology Information, and GridRepublic (a volunteer computing platform) to identify C. auris sequences in publicly available metagenomic datasets. We developed the MetaNISH pipeline that uses SRPRISM to align sequences to a set of reference genomes and computes a score for each reference genome. We used MetaNISH to scan ~300,000 SRA metagenomic runs from 2010 onwards and identified five datasets containing C. auris reads. Finally, GridRepublic has implemented a prospective C. auris molecular monitoring system using MetaNISH and volunteer computing.


Subject(s)
Candida , Candidiasis , Humans , Candida/genetics , Candidiasis/microbiology , Candida auris , Prospective Studies , Metagenomics , Antifungal Agents/therapeutic use
3.
J Food Prot ; 85(5): 755-772, 2022 05 01.
Article in English | MEDLINE | ID: mdl-35259246

ABSTRACT

ABSTRACT: This multiagency report developed by the Interagency Collaboration for Genomics for Food and Feed Safety provides an overview of the use of and transition to whole genome sequencing (WGS) technology for detection and characterization of pathogens transmitted commonly by food and for identification of their sources. We describe foodborne pathogen analysis, investigation, and harmonization efforts among the following federal agencies: National Institutes of Health; Department of Health and Human Services, Centers for Disease Control and Prevention (CDC) and U.S. Food and Drug Administration (FDA); and the U.S. Department of Agriculture, Food Safety and Inspection Service, Agricultural Research Service, and Animal and Plant Health Inspection Service. We describe single nucleotide polymorphism, core-genome, and whole genome multilocus sequence typing data analysis methods as used in the PulseNet (CDC) and GenomeTrakr (FDA) networks, underscoring the complementary nature of the results for linking genetically related foodborne pathogens during outbreak investigations while allowing flexibility to meet the specific needs of Interagency Collaboration partners. We highlight how we apply WGS to pathogen characterization (virulence and antimicrobial resistance profiles) and source attribution efforts and increase transparency by making the sequences and other data publicly available through the National Center for Biotechnology Information. We also highlight the impact of current trends in the use of culture-independent diagnostic tests for human diagnostic testing on analytical approaches related to food safety and what is next for the use of WGS in the area of food safety.


Subject(s)
Foodborne Diseases , Animals , Disease Outbreaks/prevention & control , Food Safety , Foodborne Diseases/epidemiology , Foodborne Diseases/prevention & control , Genomics , United States , Whole Genome Sequencing
4.
BMC Bioinformatics ; 22(1): 375, 2021 Jul 21.
Article in English | MEDLINE | ID: mdl-34289805

ABSTRACT

BACKGROUND: Illumina is the dominant sequencing technology at this time. Short length, short insert size, some systematic biases, and low-level carryover contamination in Illumina reads continue to make assembly of repeated regions a challenging problem. Some applications also require finding multiple well supported variants for assembled regions. RESULTS: To facilitate assembly of repeat regions and to report multiple well supported variants when a user can provide target sequences to assist the assembly, we propose SAUTE and SAUTE_PROT assemblers. Both assemblers use de Bruijn graph on reads. Targets can be transcripts or proteins for RNA-seq reads and transcripts, proteins, or genomic regions for genomic reads. Target sequences are nucleotide and protein sequences for SAUTE and SAUTE_PROT, respectively. CONCLUSIONS: For RNA-seq, comparisons with TRINITY, RNASPADES, SPALIGNER, and SPADES assembly of reads aligned to target proteins by DIAMOND show that SAUTE_PROT finds more coding sequences that translate to benchmark proteins. Using AMRFINDERPLUS calls, we find SAUTE has higher sensitivity and precision than SPADES, PLASMIDSPADES, SPALIGNER, and SPADES assembly of reads aligned to target regions by HISAT2. It also has better sensitivity than SKESA but worse precision.


Subject(s)
Genomics , High-Throughput Nucleotide Sequencing , Algorithms , Genome , RNA-Seq , Sequence Analysis, DNA
5.
Gigascience ; 9(4)2020 04 01.
Article in English | MEDLINE | ID: mdl-32315028

ABSTRACT

BACKGROUND: Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. FINDINGS: This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to ∼2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. CONCLUSIONS: We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.


Subject(s)
Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , INDEL Mutation/genetics , Sequence Alignment/methods , Algorithms , Humans , Sequence Analysis, DNA , Software
6.
Nucleic Acids Res ; 47(D1): D23-D28, 2019 01 08.
Article in English | MEDLINE | ID: mdl-30395293

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include PubMed Labs and a new sequence database search. Resources that were updated in the past year include PubMed, PMC, Bookshelf, genome data viewer, Assembly, prokaryotic genomes, Genome, BioProject, dbSNP, dbVar, BLAST databases, igBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Subject(s)
Biotechnology/organization & administration , Databases, Genetic , Animals , Biotechnology/methods , Databases, Chemical , Humans , Software , United States/epidemiology , Web Browser
7.
Genome Biol ; 19(1): 153, 2018 10 04.
Article in English | MEDLINE | ID: mdl-30286803

ABSTRACT

SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources. SKESA has been used for assembling over 272,000 read sets in the Sequence Read Archive at NCBI and for real-time pathogen detection. Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases .


Subject(s)
Sequence Analysis, DNA/methods , Software , Algorithms , Base Pairing/genetics , Base Sequence , Time Factors
8.
PeerJ ; 5: e3893, 2017.
Article in English | MEDLINE | ID: mdl-29372115

ABSTRACT

BACKGROUND: As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. METHODS: We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. RESULTS: Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. DISCUSSION: These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.

9.
Genome Announc ; 4(3)2016 May 05.
Article in English | MEDLINE | ID: mdl-27151797

ABSTRACT

Pseudomonas fluorescens is a well-known plant growth-promoting rhizobacterium (PGPR). We report here the first whole-genome sequence of PGPR P. fluorescens evaluated in Colombian banana plants. The genome sequences contains genes involved in plant growth and defense, including bacteriocins, 1-aminocyclopropane-1-carboxylic acid (ACC) deaminase, and genes that provide resistance to toxic compounds.

10.
Genome Announc ; 4(2)2016 Mar 17.
Article in English | MEDLINE | ID: mdl-26988047

ABSTRACT

Campylobacter coli is considered one of the main causes of food-borne illness worldwide. We report here the whole-genome sequence of multidrug-resistant Campylobacter coli strain COL B1-266, isolated from the Colombian poultry chain. The genome sequences encode genes for a variety of antimicrobial resistance genes, including aminoglycosides, ß-lactams, lincosamides, fluoroquinolones, and tetracyclines.

11.
Genome Announc ; 4(2)2016 Mar 17.
Article in English | MEDLINE | ID: mdl-26988048

ABSTRACT

Campylobacter coli, along with Campylobacter jejuni, is a major agent of gastroenteritis and acute enterocolitis in humans. We report the whole-genome sequences of two multidrug-resistance C. coli strains, isolated from the Colombian poultry chain. The isolates contain a variety of antimicrobial resistance genes for aminoglycosides, lincosamides, fluoroquinolones, and tetracycline.

12.
BMC Genomics ; 17: 37, 2016 Jan 07.
Article in English | MEDLINE | ID: mdl-26742787

ABSTRACT

BACKGROUND: Xiphophorus fishes are represented by 26 live-bearing species of tropical fish that express many attributes (e.g., viviparity, genetic and phenotypic variation, ecological adaptation, varied sexual developmental mechanisms, ability to produce fertile interspecies hybrids) that have made attractive research models for over 85 years. Use of various interspecies hybrids to investigate the genetics underlying spontaneous and induced tumorigenesis has resulted in the development and maintenance of pedigreed Xiphophorus lines specifically bred for research. The recent availability of the X. maculatus reference genome assembly now provides unprecedented opportunities for novel and exciting comparative research studies among Xiphophorus species. RESULTS: We present sequencing, assembly and annotation of two new genomes representing Xiphophorus couchianus and Xiphophorus hellerii. The final X. couchianus and X. hellerii assemblies have total sizes of 708 Mb and 734 Mb and correspond to 98 % and 102 % of the X. maculatus Jp 163 A genome size, respectively. The rates of single nucleotide change range from 1 per 52 bp to 1 per 69 bp among the three genomes and the impact of putatively damaging variants are presented. In addition, a survey of transposable elements allowed us to deduce an ancestral TE landscape, uncovered potential active TEs and document a recent burst of TEs during evolution of this genus. CONCLUSIONS: Two new Xiphophorus genomes and their corresponding transcriptomes were efficiently assembled, the former using a novel guided assembly approach. Three assembled genome sequences within this single vertebrate order of new world live-bearing fishes will accelerate our understanding of relationship between environmental adaptation and genome evolution. In addition, these genome resources provide capability to determine allele specific gene regulation among interspecies hybrids produced by crossing any of the three species that are known to produce progeny predisposed to tumor development.


Subject(s)
Cyprinodontiformes/genetics , Genetic Variation , Genome , Transcriptome/genetics , Animals , Gene Expression Regulation , Genomics , Species Specificity
13.
Genome Announc ; 3(6)2015 Nov 25.
Article in English | MEDLINE | ID: mdl-26607897

ABSTRACT

Bacillus amyloliquefaciens is an important plant growth-promoting rhizobacterium (PGPR). We report the first whole-genome sequence of PGPR Bacillus amyloliquefaciens evaluated in Colombian banana plants. The genome sequences encode genes involved in plant growth and defense, including bacteriocins, ribosomally synthesized antibacterial peptides, in addition to genes that provide resistance to toxic compounds.

14.
Genome Announc ; 3(5)2015 Oct 22.
Article in English | MEDLINE | ID: mdl-26494672

ABSTRACT

Salmonella enterica is a pathogen of significant public health importance that is frequently associated with foodborne illness. We report the whole-genome sequences of four multidrug-resistant Salmonella enterica serovar Paratyphi B and Heidelberg strains, isolated from the Colombian poultry chain. The isolates contain a variety of antimicrobial resistance genes for aminoglycosides, ß-lactams, fluoroquinolones, sulfonamides, tetracycline, and trimethoprim.

15.
Gastroenterology ; 149(1): 67-78, 2015 Jul.
Article in English | MEDLINE | ID: mdl-25865046

ABSTRACT

BACKGROUND & AIMS: Small intestinal carcinoids are rare and difficult to diagnose and patients often present with advanced incurable disease. Although the disease occurs sporadically, there have been reports of family clusters. Hereditary small intestinal carcinoid has not been recognized and genetic factors have not been identified. We performed a genetic analysis of families with small intestinal carcinoids to establish a hereditary basis and find genes that might cause this cancer. METHODS: We performed a prospective study of 33 families with at least 2 cases of small intestinal carcinoids. Affected members were characterized clinically and asymptomatic relatives were screened and underwent exploratory laparotomy for suspected tumors. Disease-associated mutations were sought using linkage analysis, whole-exome sequencing, and copy number analyses of germline and tumor DNA collected from members of a single large family. We assessed expression of mutant protein, protein activity, and regulation of apoptosis and senescence in lymphoblasts derived from the cases. RESULTS: Familial and sporadic carcinoids are clinically indistinguishable except for the multiple synchronous primary tumors observed in most familial cases. Nearly 34% of asymptomatic relatives older than age 50 were found to have occult tumors; the tumors were cleared surgically from 87% of these individuals (20 of 23). Linkage analysis and whole-exome sequencing identified a germline 4-bp deletion in the gene inositol polyphosphate multikinase (IPMK), which truncates the protein. This mutation was detected in all 11 individuals with small intestinal carcinoids and in 17 of 35 family members whose carcinoid status was unknown. Mutant IPMK had reduced kinase activity and nuclear localization, compared with the full-length protein. This reduced activation of p53 and increased cell survival. CONCLUSIONS: We found that small intestinal carcinoids can occur as an inherited autosomal-dominant disease. The familial form is characterized by multiple synchronous primary tumors, which might account for 22%-35% of cases previously considered sporadic. Relatives of patients with familial carcinoids should be screened to detect curable early stage disease. IPMK haploinsufficiency promotes carcinoid tumorigenesis.


Subject(s)
Carcinoid Tumor/genetics , Germ-Line Mutation , Intestinal Neoplasms/genetics , Phosphotransferases (Alcohol Group Acceptor)/genetics , Adolescent , Adult , Aged , Aged, 80 and over , Carcinoid Tumor/diagnosis , Carcinoid Tumor/pathology , Family , Female , Humans , Intestinal Neoplasms/diagnosis , Intestinal Neoplasms/pathology , Laparotomy , Male , Middle Aged , Pedigree , Prospective Studies , Young Adult
16.
Genome Res ; 24(12): 2066-76, 2014 12.
Article in English | MEDLINE | ID: mdl-25373144

ABSTRACT

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.


Subject(s)
Genome, Human , Haplotypes , Hydatidiform Mole/genetics , Alleles , Chromosome Mapping , Chromosomes, Artificial, Bacterial , Computational Biology/methods , Female , Genomics/methods , Heterozygote , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide , Pregnancy , Repetitive Sequences, Nucleic Acid , Segmental Duplications, Genomic , Sequence Analysis, DNA
17.
Immunogenetics ; 65(10): 749-62, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23925440

ABSTRACT

We report on the analyses of genes encoding immunoglobulin heavy and light chains in the rabbit 6.51× whole genome assembly. This OryCun2.0 assembly confirms previous mapping of the duplicated IGK1 and IGK2 loci to chromosome 2 and the IGL lambda light chain locus to chromosome 21. The most frequently rearranged and expressed IGHV1 that is closest to IG DH and IGHJ genes encodes rabbit VHa allotypes. The partially inbred Thorbecke strain rabbit used for whole-genome sequencing was homozygous at the IGK but heterozygous with the IGHV1a1 allele in one of 79 IGHV-containing unplaced scaffolds and IGHV1a2, IGHM, IGHG, and IGHE sequences in another. Some IGKV, IGLV, and IGHA genes are also in other unplaced scaffolds. By fluorescence in situ hybridization, we assigned the previously unmapped IGH locus to the q-telomeric region of rabbit chromosome 20. An approximately 3-Mb segment of human chromosome 14 including IGH genes predicted to map to this telomeric region based on synteny analysis could not be located on assembled chromosome 20. Unplaced scaffold chrUn0053 contains some of the genes that comparative mapping predicts to be missing. We identified discrepancies between previous targeted studies and the OryCun2.0 assembly and some new BAC clones with IGH sequences that can guide other studies to further sequence and improve the OryCun2.0 assembly. Complete knowledge of gene sequences encoding variable regions of rabbit heavy, kappa, and lambda chains will lead to better understanding of how and why rabbits produce antibodies of high specificity and affinity through gene conversion and somatic hypermutation.


Subject(s)
Chromosomes, Mammalian/genetics , Computational Biology/methods , Genome , Immunoglobulin Heavy Chains/genetics , Immunoglobulins/genetics , Animals , Chromosome Mapping , Chromosomes, Artificial, Bacterial/genetics , Female , Humans , Immunoglobulin Allotypes/blood , Immunoglobulin Allotypes/genetics , Immunoglobulin Variable Region/genetics , Immunoglobulin kappa-Chains/genetics , Immunoglobulin lambda-Chains/genetics , In Situ Hybridization, Fluorescence , Male , Rabbits , Reproducibility of Results
19.
Biol Direct ; 7: 12, 2012 Apr 17.
Article in English | MEDLINE | ID: mdl-22510480

ABSTRACT

BACKGROUND: BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch. RESULTS: We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI's Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST. CONCLUSIONS: DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the "Protein BLAST" link at http://blast.ncbi.nlm.nih.gov.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Search Engine/methods , Software , Algorithms , Computational Biology/methods , Internet , ROC Curve , Reproducibility of Results , Sensitivity and Specificity , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Sequence Homology, Amino Acid , Time Factors
SELECTION OF CITATIONS
SEARCH DETAIL
...