Pesquisa | Portal Regional da BVS (teste)

Finding Candida auris in public metagenomic repositories.

Mario-Vasquez, Jorge E; Bagal, Ujwal R; Lowe, Elijah; Morgulis, Aleksandr; Phan, John; Sexton, D Joseph; Shiryev, Sergey; Slatkevicius, Rytis; Welsh, Rory; Litvintseva, Anastasia P; Blumberg, Matthew; Agarwala, Richa; Chow, Nancy A.

PLoS One ; 19(1): e0291406, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38241320

RESUMO

Candida auris is a newly emerged multidrug-resistant fungus capable of causing invasive infections with high mortality. Despite intense efforts to understand how this pathogen rapidly emerged and spread worldwide, its environmental reservoirs are poorly understood. Here, we present a collaborative effort between the U.S. Centers for Disease Control and Prevention, the National Center for Biotechnology Information, and GridRepublic (a volunteer computing platform) to identify C. auris sequences in publicly available metagenomic datasets. We developed the MetaNISH pipeline that uses SRPRISM to align sequences to a set of reference genomes and computes a score for each reference genome. We used MetaNISH to scan ~300,000 SRA metagenomic runs from 2010 onwards and identified five datasets containing C. auris reads. Finally, GridRepublic has implemented a prospective C. auris molecular monitoring system using MetaNISH and volunteer computing.

Assuntos

Candida , Candidíase , Humanos , Candida/genética , Candidíase/microbiologia , Candida auris , Estudos Prospectivos , Metagenômica , Antifúngicos/uso terapêutico

SRPRISM (Single Read Paired Read Indel Substitution Minimizer): an efficient aligner for assemblies with explicit guarantees.

Morgulis, Aleksandr; Agarwala, Richa.

Gigascience ; 9(4)2020 04 01.

Artigo em Inglês | MEDLINE | ID: mdl-32315028

RESUMO

BACKGROUND: Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. FINDINGS: This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to â¼2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. CONCLUSIONS: We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.

Assuntos

Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação INDEL/genética , Alinhamento de Sequência/métodos , Algoritmos , Humanos , Análise de Sequência de DNA , Software

Single haplotype assembly of the human genome from a hydatidiform mole.

Steinberg, Karyn Meltz; Schneider, Valerie A; Graves-Lindsay, Tina A; Fulton, Robert S; Agarwala, Richa; Huddleston, John; Shiryev, Sergey A; Morgulis, Aleksandr; Surti, Urvashi; Warren, Wesley C; Church, Deanna M; Eichler, Evan E; Wilson, Richard K.

Genome Res ; 24(12): 2066-76, 2014 12.

Artigo em Inglês | MEDLINE | ID: mdl-25373144

RESUMO

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Assuntos

Genoma Humano , Haplótipos , Mola Hidatiforme/genética , Alelos , Mapeamento Cromossômico , Cromossomos Artificiais Bacterianos , Biologia Computacional/métodos , Feminino , Genômica/métodos , Heterozigoto , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Gravidez , Sequências Repetitivas de Ácido Nucleico , Duplicações Segmentares Genômicas , Análise de Sequência de DNA

Database indexing for production MegaBLAST searches.

Morgulis, Aleksandr; Coulouris, George; Raytselis, Yan; Madden, Thomas L; Agarwala, Richa; Schäffer, Alejandro A.

Bioinformatics ; 24(16): 1757-64, 2008 Aug 15.

Artigo em Inglês | MEDLINE | ID: mdl-18567917

RESUMO

MOTIVATION: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar. RESULTS: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new 'indexed MegaBLAST' is faster than the 'non-indexed' version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases. AVAILABILITY: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast [corrected] SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Interface Usuário-Computador , Sequência de Aminoácidos , Dados de Sequência Molecular , Alinhamento de Sequência/métodos

A fast and symmetric DUST implementation to mask low-complexity DNA sequences.

Morgulis, Aleksandr; Gertz, E Michael; Schäffer, Alejandro A; Agarwala, Richa.

J Comput Biol ; 13(5): 1028-40, 2006 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-16796549

RESUMO

The DUST module has been used within BLAST for many years to mask low-complexity sequences. In this paper, we present a new implementation of the DUST module that uses the same function to assign a complexity score to a sequence, but uses a different rule by which high-scoring sequences are masked. The new rule masks every nucleotide masked by the old rule and occasionally masks more. The new masking rule corrects two related deficiencies with the old rule. First, the new rule is symmetric with respect to reversing the sequence. Second, the new rule is not context sensitive; the decision to mask a subsequence does not depend on what sequences flank it. The new implementation is at least four times faster than the old on the human genome. We show that both the percentage of additional bases masked and the effect on MegaBLAST outputs are very small.

Assuntos

Genoma Humano/genética , Reconhecimento Automatizado de Padrão , Alinhamento de Sequência , Análise de Sequência de DNA , Software , Humanos

WindowMasker: window-based masker for sequenced genomes.

Morgulis, Aleksandr; Gertz, E Michael; Schäffer, Alejandro A; Agarwala, Richa.

Bioinformatics ; 22(2): 134-41, 2006 Jan 15.

Artigo em Inglês | MEDLINE | ID: mdl-16287941

RESUMO

MOTIVATION: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf

Assuntos

Mapeamento Cromossômico/métodos , DNA/química , DNA/genética , Sequências Repetitivas de Ácido Nucleico/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Humanos , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão/métodos , Linguagens de Programação

Protein database searches using compositionally adjusted substitution matrices.

Altschul, Stephen F; Wootton, John C; Gertz, E Michael; Agarwala, Richa; Morgulis, Aleksandr; Schäffer, Alejandro A; Yu, Yi-Kuo.

FEBS J ; 272(20): 5101-9, 2005 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-16218944

RESUMO

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.

Assuntos

Biologia Computacional/métodos , Bases de Dados de Proteínas , Alinhamento de Sequência/métodos , Algoritmos , Internet , Proteínas/química , Proteínas/genética , Curva ROC , Alinhamento de Sequência/estatística & dados numéricos , Homologia de Sequência de Aminoácidos , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA