Pesquisa | Portal Regional da BVS

1.

An efficient Burrows-Wheeler transform-based aligner for short read mapping.

Guo, Lilu; Huo, Hongwei.

Comput Biol Chem ; 110: 108050, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38447272

RESUMO

Read mapping as the foundation of computational biology is a bottleneck task under the pressure of sequencing throughput explodes. In this work, we present an efficient Burrows-Wheeler transform-based aligner for next-generation sequencing (NGS) short read. Firstly, we propose a difference-aware classification strategy to assign specific reads to the computationally more economical search modes, and present some acceleration techniques, such as a seed pruning method based on the property of maximum coverage interval to reduce the redundant locating for candidate regions, redesigning LF calculation to support fast query. Then, we propose a heuristic verification to determine the best mapping from amounts of flanking sequences. Incorporated with low-distortion string embedding, most dissimilar sequences are filtered out cheaply, and the highly similar sequences left are just right for the wavefront alignment algorithm's preference. We provide a full spectrum benchmark with different read lengths, the results show that our method is 1.3-1.4 times faster than state-of-the-art Burrows-Wheeler transform-based methods (including bowtie2, bwa-MEM, and hisat2) over 101bp reads and has a speedup with 1.5-13 times faster over 750bp to 1000bp reads; meanwhile, our method has comparable memory usage and accuracy. However, hash-based methods (including Strobealign, Minimap2, and Accel-Align) are significantly faster, in part because Burrows-Wheeler transform-based methods calculate on the compressed space. The source code is available: https://github.com/Lilu-guo/Effaln.

2.

CIndex: compressed indexes for fast retrieval of FASTQ files.

Huo, Hongwei; Liu, Pengfei; Wang, Chenhui; Jiang, Hongbo; Vitter, Jeffrey Scott.

Bioinformatics ; 38(2): 335-343, 2022 01 03.

Artigo em Inglês | MEDLINE | ID: mdl-34524416

RESUMO

MOTIVATION: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. RESULTS: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and RÎ³, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. AVAILABILITY AND IMPLEMENTATION: The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados , Software , Genômica/métodos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Análise de Sequência de DNA/métodos , Compressão de Dados/métodos

3.

Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections.

Huo, Hongwei; Chen, Xiaoyang; Guo, Xu; Vitter, Jeffrey Scott.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2394-2408, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-31985436

RESUMO

In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection G of t sequences Si of length n each, we can represent G succinctly in 2nHk(T) + O(n' loglogn) + o(q n') + o(tn) bits using O(t n2 + q n') time, where Hk(T) is the kth-order empirical entropy of the sequence T ∈ G that is used as the reference sequence, n' is the total number of variations between T and the sequences in G, and q is a small fixed constant. We can restore any length len substring S[ sp, ..., sp + len-1] of S ∈ G in O(ns' + len(logn)2 / loglogn) time and report all positions where P occurs in G in O(m ·t + occ ·t ·(logn)2/loglogn ) time. In addition, we propose a dynamic programming method to find the variations between T and the sequences in G in a space-efficient way, with which we can build succinct structures to enable efficient search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods. The source code is available online.

Assuntos

Algoritmos , Compressão de Dados/métodos , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos

4.

A new algorithm for DNA motif discovery using multiple sample sequence sets.

Yu, Qiang; Zhao, Xiang; Huo, Hongwei.

J Bioinform Comput Biol ; 17(4): 1950021, 2019 08.

Artigo em Inglês | MEDLINE | ID: mdl-31617465

RESUMO

DNA motif discovery plays an important role in understanding the mechanisms of gene regulation. Most existing motif discovery algorithms can identify motifs in an efficient and effective manner when dealing with small datasets. However, large datasets generated by high-throughput sequencing technologies pose a huge challenge: it is too time-consuming to process the entire dataset, but if only a small sample sequence set is processed, it is difficult to identify infrequent motifs. In this paper, we propose a new DNA motif discovery algorithm: first divide the input dataset into multiple sample sequence sets, then refine initial motifs of each sample sequence set with the expectation maximization method, and finally combine all the results from each sample sequence set. Besides, we design a new initial motif generation method with the utilization of the entire dataset, which helps to identify infrequent motifs. The experimental results on the simulated data show that the proposed algorithm has better time performance for large datasets and better accuracy of identifying infrequent motifs than the compared algorithms. Also, we have verified the validity of the proposed algorithm on the real data.

Assuntos

Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , Motivos de Nucleotídeos , Imunoprecipitação da Cromatina , Sequenciamento de Nucleotídeos em Larga Escala

5.

SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

Yu, Qiang; Wei, Dingbang; Huo, Hongwei.

BMC Bioinformatics ; 19(1): 228, 2018 06 18.

Artigo em Inglês | MEDLINE | ID: mdl-29914360

RESUMO

BACKGROUND: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. RESULTS: We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. CONCLUSIONS: We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

Assuntos

Algoritmos , Biologia Computacional/métodos , DNA/análise , DNA/genética , Motivos de Nucleotídeos , Análise de Sequência de DNA/métodos , DNA/química , Humanos

6.

A New Algorithm for Identifying Cis-Regulatory Modules Based on Hidden Markov Model.

Guo, Haitao; Huo, Hongwei.

Biomed Res Int ; 2017: 6274513, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28497059

RESUMO

The discovery of cis-regulatory modules (CRMs) is the key to understanding mechanisms of transcription regulation. Since CRMs have specific regulatory structures that are the basis for the regulation of gene expression, how to model the regulatory structure of CRMs has a considerable impact on the performance of CRM identification. The paper proposes a CRM discovery algorithm called ComSPS. ComSPS builds a regulatory structure model of CRMs based on HMM by exploring the rules of CRM transcriptional grammar that governs the internal motif site arrangement of CRMs. We test ComSPS on three benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them.

Assuntos

Algoritmos , Elementos Reguladores de Transcrição , Análise de Sequência de DNA/métodos , Transcrição Gênica , Cadeias de Markov

7.

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets.

Yu, Qiang; Huo, Hongwei; Feng, Dazheng.

Biomed Res Int ; 2016: 4986707, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27843946

RESUMO

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

Assuntos

Algoritmos , Imunoprecipitação da Cromatina , Bases de Dados Genéticas , Motivos de Nucleotídeos/genética , Análise de Sequência de DNA , Animais , Sequência de Bases , Simulação por Computador , Camundongos , Células-Tronco Embrionárias Murinas/metabolismo , Probabilidade , Fatores de Tempo

8.

SMCis: An Effective Algorithm for Discovery of Cis-Regulatory Modules.

Guo, Haitao; Huo, Hongwei; Yu, Qiang.

PLoS One ; 11(9): e0162968, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27637070

RESUMO

The discovery of cis-regulatory modules (CRMs) is a challenging problem in computational biology. Limited by the difficulty of using an HMM to model dependent features in transcriptional regulatory sequences (TRSs), the probabilistic modeling methods based on HMMs cannot accurately represent the distance between regulatory elements in TRSs and are cumbersome to model the prevailing dependencies between motifs within CRMs. We propose a probabilistic modeling algorithm called SMCis, which builds a more powerful CRM discovery model based on a hidden semi-Markov model. Our model characterizes the regulatory structure of CRMs and effectively models dependencies between motifs at a higher level of abstraction based on segments rather than nucleotides. Experimental results on three benchmark datasets indicate that our method performs better than the compared algorithms.

Assuntos

Algoritmos , Animais , Biologia Computacional , Drosophila , Probabilidade

9.

RefSelect: a reference sequence selection algorithm for planted (l, d) motif search.

Yu, Qiang; Huo, Hongwei; Zhao, Ruixing; Feng, Dazheng; Vitter, Jeffrey Scott; Huan, Jun.

BMC Bioinformatics ; 17 Suppl 9: 266, 2016 Jul 19.

Artigo em Inglês | MEDLINE | ID: mdl-27454113

RESUMO

BACKGROUND: The planted (l, d) motif search (PMS) is an important yet challenging problem in computational biology. Pattern-driven PMS algorithms usually use k out of t input sequences as reference sequences to generate candidate motifs, and they can find all the (l, d) motifs in the input sequences. However, most of them simply take the first k sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets. RESULTS: In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and (3) is also suitable for large data sets which contain hundreds or more sequences. CONCLUSIONS: The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.

Assuntos

Biologia Computacional/métodos , Proteínas/química , Algoritmos , Motivos de Aminoácidos , Domínios Proteicos , Proteínas/genética , Análise de Sequência de Proteína , Software

10.

The components of rice and watermelon root exudates and their effects on pathogenic fungus and watermelon defense.

Ren, Lixuan; Huo, Hongwei; Zhang, Fang; Hao, Wenya; Xiao, Liang; Dong, Caixia; Xu, Guohua.

Plant Signal Behav ; 11(6): e1187357, 2016 06 02.

Artigo em Inglês | MEDLINE | ID: mdl-27217091

RESUMO

Watermelon (Citrullus lanatus) is susceptible to wilt disease caused by the fungus Fusarium oxysporum f. sp niveum (FON). Intercropping management of watermelon/aerobic rice (Oryza sativa) alleviates watermelon wilt disease, because some unidentified component(s) in rice root exudates suppress FON sporulation and spore germination. Here, we show that the phenolic acid p-coumaric acid is present in rice root exudates only, and it inhibits FON spore germination and sporulation. We found that exogenously applied p-coumaric acid up-regulated the expression of ClPR3 in roots, as well as increased chitinase activity in leaves. Furthermore, exogenously applied p-coumaric acid increased ß-1,3-glucanase activity in watermelon roots. By contrast, we found that ferulic acid was secreted by watermelon roots, but not by rice roots, and that it stimulated spore germination and sporulation of FON. Exogenous application of ferulic acid down-regulated ClPR3 expression and inhibited chitinase activity in watermelon leaves. Salicylic acid was detected in both watermelon and rice root exudates, which stimulated FON spore germination at low concentrations and suppressed spore germination at high concentrations. Exogenously applied salicylic acid did not alter ClPR3 expression, but did increase chitinase and ß-1,3-glucanase activities in watermelon leaves. Together, our results show that the root exudates of phenolic acids were different between rice and watermelon, which lead to their special ecological roles on pathogenic fungus and watermelon defense.

Assuntos

Citrullus/imunologia , Citrullus/microbiologia , Oryza/química , Exsudatos de Plantas/farmacologia , Raízes de Plantas/química , Quitinases/metabolismo , Citrullus/enzimologia , Citrullus/genética , Resistência à Doença/imunologia , Fusarium/efeitos dos fármacos , Fusarium/fisiologia , Regulação da Expressão Gênica de Plantas/efeitos dos fármacos , Glucana 1,3-beta-Glucosidase/metabolismo , Hidroxibenzoatos/farmacologia , Doenças das Plantas/microbiologia , Folhas de Planta/efeitos dos fármacos , Folhas de Planta/enzimologia , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Raízes de Plantas/efeitos dos fármacos , Raízes de Plantas/enzimologia , Reação em Cadeia da Polimerase em Tempo Real , Esporos Fúngicos/efeitos dos fármacos , Esporos Fúngicos/fisiologia

11.

An Affinity Propagation-Based DNA Motif Discovery Algorithm.

Sun, Chunxiao; Huo, Hongwei; Yu, Qiang; Guo, Haitao; Sun, Zhigang.

Biomed Res Int ; 2015: 853461, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26347887

RESUMO

The planted (l, d) motif search (PMS) is one of the fundamental problems in bioinformatics, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. Nowadays, identifying weak motifs and reducing the effect of local optimum are still important but challenging tasks for motif discovery. To solve the tasks, we propose a new algorithm, APMotif, which first applies the Affinity Propagation (AP) clustering in DNA sequences to produce informative and good candidate motifs and then employs Expectation Maximization (EM) refinement to obtain the optimal motifs from the candidate motifs. Experimental results both on simulated data sets and real biological data sets show that APMotif usually outperforms four other widely used algorithms in terms of high prediction accuracy.

Assuntos

Algoritmos , DNA/genética , Motivos de Nucleotídeos , Elementos de Resposta , Análise de Sequência de DNA/métodos

12.

An Efficient Exact Algorithm for the Motif Stem Search Problem over Large Alphabets.

Yu, Qiang; Huo, Hongwei; Vitter, Jeffrey Scott; Huan, Jun; Nekrich, Yakov.

IEEE/ACM Trans Comput Biol Bioinform ; 12(2): 384-97, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26357225

RESUMO

In recent years, there has been an increasing interest in planted (l, d) motif search (PMS) with applications to discovering significant segments in biological sequences. However, there has been little discussion about PMS over large alphabets. This paper focuses on motif stem search (MSS), which is recently introduced to search motifs on large-alphabet inputs. A motif stem is an l-length string with some wildcards. The goal of the MSS problem is to find a set of stems that represents a superset of all (l , d) motifs present in the input sequences, and the superset is expected to be as small as possible. The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular expressions. (2) We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous MSS algorithms, StemFinder runs much faster and reports fewer stems which represent a smaller superset of all (l, d) motifs. StemFinder is freely available at http://sites.google.com/site/feqond/stemfinder.

Assuntos

Motivos de Aminoácidos , Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Simulação por Computador , Proteínas/química

13.

An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets.

Yu, Qiang; Huo, Hongwei; Chen, Xiaoyang; Guo, Haitao; Vitter, Jeffrey Scott; Huan, Jun.

IEEE Trans Nanobioscience ; 14(5): 535-44, 2015 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-25872217

RESUMO

The planted (l,d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l,d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l,d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l,d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l,d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets. MCES is freely available at http://sites.google.com/site/feqond/mces.

Assuntos

Algoritmos , Imunoprecipitação da Cromatina/métodos , DNA/química , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Motivos de Nucleotídeos/genética , Análise de Sequência de DNA/métodos , Biologia Computacional , DNA/genética

14.

A heuristic cluster-based EM algorithm for the planted (l, d) problem.

Zhang, Yipu; Huo, Hongwei; Yu, Qiang.

J Bioinform Comput Biol ; 11(4): 1350009, 2013 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-23859273

RESUMO

The planted motif search problem arises from locating the transcription factor binding sites (TFBSs) which are crucial for understanding the gene regulatory relationship. Many attempts in using expectation maximization for TFBSs discovery are successful in past. However, identifying highly degenerate motifs and reducing the effect of local optima are still an arduous task. To alleviate the vulnerability of EM to local optima trapping, we present a heuristic cluster-based EM algorithm, CEM, which refines the cluster subsets in EM method to explore the best local optimal solution. Based on experiments using both synthetic and real datasets, our algorithm demonstrates significant improvements in identifying the motif instances and performs better than current widely used algorithms. CEM is a novel planted motif finding algorithm, which is able to solve the challenging instances and easy to parallel since the process of solving each cluster subset is independent.

Assuntos

Algoritmos , Fatores de Transcrição/química , Sítios de Ligação , Análise por Conglomerados , Fatores de Transcrição/metabolismo

15.

PairMotif+: a fast and effective algorithm for de novo motif discovery in DNA sequences.

Yu, Qiang; Huo, Hongwei; Zhang, Yipu; Guo, Hongzhi; Guo, Haitao.

Int J Biol Sci ; 9(4): 412-24, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23678291

RESUMO

The planted (l, d) motif search is one of the most widely studied problems in bioinformatics, which plays an important role in the identification of transcription factor binding sites in DNA sequences. However, it is still a challenging task to identify highly degenerate motifs, since current algorithms either output the exact results with a high computational cost or accomplish the computation in a short time but very often fall into a local optimum. In order to make a better trade-off between accuracy and efficiency, we propose a new pattern-driven algorithm, named PairMotif+. At first, some pairs of l-mers are extracted from input sequences according to probabilistic analysis and statistical method so that one or more pairs of motif instances are included in them. Then an approximate strategy for refining pairs of l-mers with high accuracy is adopted in order to avoid the verification of most candidate motifs. Experimental results on the simulated data show that PairMotif+ can solve various (l, d) problems within an hour on a PC with 2.67 GHz processor, and has a better identification accuracy than the compared algorithms MEME, AlignACE and VINE. Also, the validity of the proposed algorithm is tested on multiple real data sets.

Assuntos

Algoritmos , Biologia Computacional/métodos , Sequência de Bases/genética

16.

PairMotif: A new pattern-driven algorithm for planted (l, d) DNA motif search.

Yu, Qiang; Huo, Hongwei; Zhang, Yipu; Guo, Hongzhi.

PLoS One ; 7(10): e48442, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-23119020

RESUMO

Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (l, d) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (l, d) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned l-mers, multiple pairs of l-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.

Assuntos

Algoritmos , DNA/química , Motivos de Nucleotídeos , Software , Sítios de Ligação , Biologia Computacional/métodos , Simulação por Computador , Reconhecimento Automatizado de Padrão , Fatores de Transcrição/metabolismo

17.

A quantum-inspired genetic algorithm based on probabilistic coding for multiple sequence alignment.

Huo, Hong-Wei; Stojkovic, Vojislav; Xie, Qiao-Luan.

J Bioinform Comput Biol ; 8(1): 59-75, 2010 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-20183874

RESUMO

Quantum parallelism arises from the ability of a quantum memory register to exist in a superposition of base states. Since the number of possible base states is 2(n), where n is the number of qubits in the quantum memory register, one operation on a quantum computer performs what an exponential number of operations on a classical computer performs. The power of quantum algorithms comes from taking advantages of quantum parallelism. Quantum algorithms are exponentially faster than classical algorithms. Genetic optimization algorithms are stochastic search algorithms which are used to search large, nonlinear spaces where expert knowledge is lacking or difficult to encode. QGMALIGN--a probabilistic coding based quantum-inspired genetic algorithm for multiple sequence alignment is presented. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The experimental results show that QGMALIGN can compete with the popular methods, such as CLUSTALX and SAGA, and performs well on the presenting biological data. Moreover, the addition of genetic operators to the quantum-inspired algorithm lowers the cost of overall running time.

Assuntos

Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Biologia Computacional , Evolução Molecular , Modelos Genéticos , Modelos Estatísticos , Mutação , Processos Estocásticos

18.

Repeats identification using improved suffix trees.

Huo, Hongwei; Wang, Xiaowu; Stojkovic, Vojislav.

Int J Comput Biol Drug Des ; 2(3): 264-77, 2009.

Artigo em Inglês | MEDLINE | ID: mdl-20090164

RESUMO

The suffix tree data structure plays an important role in the efficient implementations of some querying algorithms. This paper presents the fast Rep(eats)Seeker algorithm for repeats identification based on the improvements of suffix tree construction. The leaf nodes and the branch nodes are numbered in different ways during the construction of a suffix tree and extra information is added to the branch nodes. The experimental results show that improvements reduce the running time of the RepSeeker algorithm without losing the accuracy. The experimental results coincide with the theoretical expectations.

Assuntos

Algoritmos , Biologia Computacional/métodos

19.

A probabilistic coding based quantum genetic algorithm for multiple sequence alignment.

Huo, Hongwei; Xie, Qiaoluan; Shen, Xubang; Stojkovic, Vojislav.

Comput Syst Bioinformatics Conf ; 7: 15-26, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-19642265

RESUMO

This paper presents an original Quantum Genetic algorithm for Multiple sequence ALIGNment (QGMALIGN) that combines a genetic algorithm and a quantum algorithm. A quantum probabilistic coding is designed for representing the multiple sequence alignment. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The features of implicit parallelism and state superposition in quantum mechanics and the global search capability of the genetic algorithm are exploited to get efficient computation. A set of well known test cases from BAliBASE2.0 is used as reference to evaluate the efficiency of the QGMALIGN optimization. The QGMALIGN results have been compared with the most popular methods (CLUSTALX, SAGA, DIALIGN, SB_PIMA, and QGMALIGN) results. The QGMALIGN results show that QGMALIGN performs well on the presenting biological data. The addition of genetic operators to the quantum algorithm lowers the cost of overall running time.

Assuntos

Algoritmos , Inteligência Artificial , Modelos Genéticos , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Análise de Sequência/métodos , Simulação por Computador , Interpretação Estatística de Dados

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA