Search | VHL Regional Portal

Finding Candida auris in public metagenomic repositories.

Mario-Vasquez, Jorge E; Bagal, Ujwal R; Lowe, Elijah; Morgulis, Aleksandr; Phan, John; Sexton, D Joseph; Shiryev, Sergey; Slatkevicius, Rytis; Welsh, Rory; Litvintseva, Anastasia P; Blumberg, Matthew; Agarwala, Richa; Chow, Nancy A.

PLoS One ; 19(1): e0291406, 2024.

Article in English | MEDLINE | ID: mdl-38241320

ABSTRACT

Candida auris is a newly emerged multidrug-resistant fungus capable of causing invasive infections with high mortality. Despite intense efforts to understand how this pathogen rapidly emerged and spread worldwide, its environmental reservoirs are poorly understood. Here, we present a collaborative effort between the U.S. Centers for Disease Control and Prevention, the National Center for Biotechnology Information, and GridRepublic (a volunteer computing platform) to identify C. auris sequences in publicly available metagenomic datasets. We developed the MetaNISH pipeline that uses SRPRISM to align sequences to a set of reference genomes and computes a score for each reference genome. We used MetaNISH to scan ~300,000 SRA metagenomic runs from 2010 onwards and identified five datasets containing C. auris reads. Finally, GridRepublic has implemented a prospective C. auris molecular monitoring system using MetaNISH and volunteer computing.

Subject(s)

Candida , Candidiasis , Humans , Candida/genetics , Candidiasis/microbiology , Candida auris , Prospective Studies , Metagenomics , Antifungal Agents/therapeutic use

SRPRISM (Single Read Paired Read Indel Substitution Minimizer): an efficient aligner for assemblies with explicit guarantees.

Morgulis, Aleksandr; Agarwala, Richa.

Gigascience ; 9(4)2020 04 01.

Article in English | MEDLINE | ID: mdl-32315028

ABSTRACT

BACKGROUND: Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. FINDINGS: This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to â¼2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. CONCLUSIONS: We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.

Subject(s)

Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , INDEL Mutation/genetics , Sequence Alignment/methods , Algorithms , Humans , Sequence Analysis, DNA , Software

Single haplotype assembly of the human genome from a hydatidiform mole.

Steinberg, Karyn Meltz; Schneider, Valerie A; Graves-Lindsay, Tina A; Fulton, Robert S; Agarwala, Richa; Huddleston, John; Shiryev, Sergey A; Morgulis, Aleksandr; Surti, Urvashi; Warren, Wesley C; Church, Deanna M; Eichler, Evan E; Wilson, Richard K.

Genome Res ; 24(12): 2066-76, 2014 12.

Article in English | MEDLINE | ID: mdl-25373144

ABSTRACT

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Subject(s)

Genome, Human , Haplotypes , Hydatidiform Mole/genetics , Alleles , Chromosome Mapping , Chromosomes, Artificial, Bacterial , Computational Biology/methods , Female , Genomics/methods , Heterozygote , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide , Pregnancy , Repetitive Sequences, Nucleic Acid , Segmental Duplications, Genomic , Sequence Analysis, DNA

Database indexing for production MegaBLAST searches.

Morgulis, Aleksandr; Coulouris, George; Raytselis, Yan; Madden, Thomas L; Agarwala, Richa; Schäffer, Alejandro A.

Bioinformatics ; 24(16): 1757-64, 2008 Aug 15.

Article in English | MEDLINE | ID: mdl-18567917

ABSTRACT

MOTIVATION: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar. RESULTS: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new 'indexed MegaBLAST' is faster than the 'non-indexed' version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases. AVAILABILITY: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast [corrected] SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Database Management Systems , Databases, Protein , Information Storage and Retrieval/methods , Proteins/chemistry , Sequence Analysis, Protein/methods , Software , User-Computer Interface , Amino Acid Sequence , Molecular Sequence Data , Sequence Alignment/methods

A fast and symmetric DUST implementation to mask low-complexity DNA sequences.

Morgulis, Aleksandr; Gertz, E Michael; Schäffer, Alejandro A; Agarwala, Richa.

J Comput Biol ; 13(5): 1028-40, 2006 Jun.

Article in English | MEDLINE | ID: mdl-16796549

ABSTRACT

The DUST module has been used within BLAST for many years to mask low-complexity sequences. In this paper, we present a new implementation of the DUST module that uses the same function to assign a complexity score to a sequence, but uses a different rule by which high-scoring sequences are masked. The new rule masks every nucleotide masked by the old rule and occasionally masks more. The new masking rule corrects two related deficiencies with the old rule. First, the new rule is symmetric with respect to reversing the sequence. Second, the new rule is not context sensitive; the decision to mask a subsequence does not depend on what sequences flank it. The new implementation is at least four times faster than the old on the human genome. We show that both the percentage of additional bases masked and the effect on MegaBLAST outputs are very small.

Subject(s)

Genome, Human/genetics , Pattern Recognition, Automated , Sequence Alignment , Sequence Analysis, DNA , Software , Humans

WindowMasker: window-based masker for sequenced genomes.

Morgulis, Aleksandr; Gertz, E Michael; Schäffer, Alejandro A; Agarwala, Richa.

Bioinformatics ; 22(2): 134-41, 2006 Jan 15.

Article in English | MEDLINE | ID: mdl-16287941

ABSTRACT

MOTIVATION: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf

Subject(s)

Chromosome Mapping/methods , DNA/chemistry , DNA/genetics , Repetitive Sequences, Nucleic Acid/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Base Sequence , Humans , Molecular Sequence Data , Pattern Recognition, Automated/methods , Programming Languages

Protein database searches using compositionally adjusted substitution matrices.

Altschul, Stephen F; Wootton, John C; Gertz, E Michael; Agarwala, Richa; Morgulis, Aleksandr; Schäffer, Alejandro A; Yu, Yi-Kuo.

FEBS J ; 272(20): 5101-9, 2005 Oct.

Article in English | MEDLINE | ID: mdl-16218944

ABSTRACT

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.

Subject(s)

Computational Biology/methods , Databases, Protein , Sequence Alignment/methods , Algorithms , Internet , Proteins/chemistry , Proteins/genetics , ROC Curve , Sequence Alignment/statistics & numerical data , Sequence Homology, Amino Acid , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL