Pesquisa | Portal Regional da BVS

CIndex: compressed indexes for fast retrieval of FASTQ files.

Huo, Hongwei; Liu, Pengfei; Wang, Chenhui; Jiang, Hongbo; Vitter, Jeffrey Scott.

Bioinformatics ; 38(2): 335-343, 2022 01 03.

Artigo em Inglês | MEDLINE | ID: mdl-34524416

RESUMO

MOTIVATION: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. RESULTS: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and RÎ³, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. AVAILABILITY AND IMPLEMENTATION: The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados , Software , Genômica/métodos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Análise de Sequência de DNA/métodos , Compressão de Dados/métodos

Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections.

Huo, Hongwei; Chen, Xiaoyang; Guo, Xu; Vitter, Jeffrey Scott.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2394-2408, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-31985436

RESUMO

In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection G of t sequences Si of length n each, we can represent G succinctly in 2nHk(T) + O(n' loglogn) + o(q n') + o(tn) bits using O(t n2 + q n') time, where Hk(T) is the kth-order empirical entropy of the sequence T ∈ G that is used as the reference sequence, n' is the total number of variations between T and the sequences in G, and q is a small fixed constant. We can restore any length len substring S[ sp, ..., sp + len-1] of S ∈ G in O(ns' + len(logn)2 / loglogn) time and report all positions where P occurs in G in O(m ·t + occ ·t ·(logn)2/loglogn ) time. In addition, we propose a dynamic programming method to find the variations between T and the sequences in G in a space-efficient way, with which we can build succinct structures to enable efficient search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods. The source code is available online.

Assuntos

Algoritmos , Compressão de Dados/métodos , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos

RefSelect: a reference sequence selection algorithm for planted (l, d) motif search.

Yu, Qiang; Huo, Hongwei; Zhao, Ruixing; Feng, Dazheng; Vitter, Jeffrey Scott; Huan, Jun.

BMC Bioinformatics ; 17 Suppl 9: 266, 2016 Jul 19.

Artigo em Inglês | MEDLINE | ID: mdl-27454113

RESUMO

BACKGROUND: The planted (l, d) motif search (PMS) is an important yet challenging problem in computational biology. Pattern-driven PMS algorithms usually use k out of t input sequences as reference sequences to generate candidate motifs, and they can find all the (l, d) motifs in the input sequences. However, most of them simply take the first k sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets. RESULTS: In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and (3) is also suitable for large data sets which contain hundreds or more sequences. CONCLUSIONS: The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.

Assuntos

Biologia Computacional/métodos , Proteínas/química , Algoritmos , Motivos de Aminoácidos , Domínios Proteicos , Proteínas/genética , Análise de Sequência de Proteína , Software

An Efficient Exact Algorithm for the Motif Stem Search Problem over Large Alphabets.

Yu, Qiang; Huo, Hongwei; Vitter, Jeffrey Scott; Huan, Jun; Nekrich, Yakov.

IEEE/ACM Trans Comput Biol Bioinform ; 12(2): 384-97, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26357225

RESUMO

In recent years, there has been an increasing interest in planted (l, d) motif search (PMS) with applications to discovering significant segments in biological sequences. However, there has been little discussion about PMS over large alphabets. This paper focuses on motif stem search (MSS), which is recently introduced to search motifs on large-alphabet inputs. A motif stem is an l-length string with some wildcards. The goal of the MSS problem is to find a set of stems that represents a superset of all (l , d) motifs present in the input sequences, and the superset is expected to be as small as possible. The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular expressions. (2) We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous MSS algorithms, StemFinder runs much faster and reports fewer stems which represent a smaller superset of all (l, d) motifs. StemFinder is freely available at http://sites.google.com/site/feqond/stemfinder.

Assuntos

Motivos de Aminoácidos , Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Simulação por Computador , Proteínas/química

An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets.

Yu, Qiang; Huo, Hongwei; Chen, Xiaoyang; Guo, Haitao; Vitter, Jeffrey Scott; Huan, Jun.

IEEE Trans Nanobioscience ; 14(5): 535-44, 2015 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-25872217

RESUMO

The planted (l,d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l,d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l,d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l,d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l,d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets. MCES is freely available at http://sites.google.com/site/feqond/mces.

Assuntos

Algoritmos , Imunoprecipitação da Cromatina/métodos , DNA/química , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Motivos de Nucleotídeos/genética , Análise de Sequência de DNA/métodos , Biologia Computacional , DNA/genética

Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree.

Külekci, M Oguzhan; Vitter, Jeffrey Scott; Xu, Bojian.

IEEE/ACM Trans Comput Biol Bioinform ; 9(2): 421-9, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-21968959

RESUMO

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19--50 times the text size with the best engineering efforts, prohibiting their usability on massive data. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage is less than double the sequence size. Our method is also orders of magnitude faster than the prior methods for processing massive texts, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.

Assuntos

Algoritmos , Biologia Computacional/métodos , Processamento de Linguagem Natural , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas , Genoma , Proteínas/química , Proteínas/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA