Pesquisa | Portal Regional da BVS

Exact global alignment using A* with chaining seed heuristic and match pruning.

Groot Koerkamp, Ragnar; Ivanov, Pesho.

Bioinformatics ; 40(3)2024 Mar 04.

Artigo em Inglês | MEDLINE | ID: mdl-38265119

RESUMO

MOTIVATION: Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time. RESULTS: We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically.On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best fit n1.06, n≤107 bp). A similar scaling remains up to d=12% (best fit n1.24, n≤107 bp). For n=107 bp and d=4%, A*PA reaches >500× speedup compared to the leading exact aligners Edlib and BiWFA. The performance of A*PA is highly influenced by long gaps. On long (n>500kb) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3× median speedup compared to Edlib and BiWFA. When the sequences come from different human samples, A*PA performs 1.7× faster than Edlib and BiWFA. AVAILABILITY AND IMPLEMENTATION: github.com/RagnarGrootKoerkamp/astar-pairwise-aligner.

Assuntos

Heurística , Software , Humanos , Análise de Sequência de DNA/métodos , Algoritmos , Sementes

Aligning distant sequences to graphs using long seed sketches.

Joudaki, Amir; Meterez, Alexandru; Mustafa, Harun; Groot Koerkamp, Ragnar; Kahles, André; Rätsch, Gunnar.

Genome Res ; 33(7): 1208-1217, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-37072187

RESUMO

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.

Assuntos

Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Alinhamento de Sequência , Análise de Sequência de DNA/métodos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA