Search | VHL Regional Portal

Show: 20 | 50 | 100

Results 1 - 10 de 10

Filter

Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets.

Martayan, Igor; Cazaux, Bastien; Limasset, Antoine; Marchet, Camille.

Bioinformatics ; 40(Supplement_1): i48-i57, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940123

ABSTRACT

SUMMARY: In this article, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromage's concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano's scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management. AVAILABILITY AND IMPLEMENTATION: https://github.com/imartayan/CBL.

Subject(s)

Software , Algorithms , Computational Biology/methods

A survey of mapping algorithms in the long-reads era.

Sahlin, Kristoffer; Baudeau, Thomas; Cazaux, Bastien; Marchet, Camille.

Genome Biol ; 24(1): 133, 2023 06 01.

Article in English | MEDLINE | ID: mdl-37264447

ABSTRACT

It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms

dipwmsearch: a Python package for searching di-PWM motifs.

Mille, Marie; Ripoll, Julie; Cazaux, Bastien; Rivals, Eric.

Bioinformatics ; 39(4)2023 04 03.

Article in English | MEDLINE | ID: mdl-37010504

ABSTRACT

MOTIVATION: Seeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites or other RNA/DNA binding sites. Useful motif representations include position weight matrices (PWMs), dinucleotide PWMs (di-PWMs), and hidden Markov models (HMMs). Dinucleotide PWMs not only combine the simplicity of PWMs-a matrix form and a cumulative scoring function-but also incorporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences. RESULTS: We propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then searches these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs. AVAILABILITY AND IMPLEMENTATION: dipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.

Subject(s)

Algorithms , Computational Biology , Binding Sites , Protein Binding , Position-Specific Scoring Matrices

HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets.

Kirsch-Gerweck, Benedikt; Bohnenkämper, Leonard; Henrichs, Michel T; Alanko, Jarno N; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Burger, Joachim; Stoye, Jens; Diekmann, Yoan.

Mol Biol Evol ; 40(3)2023 03 04.

Article in English | MEDLINE | ID: mdl-36790822

ABSTRACT

Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.

Subject(s)

Genetics, Population , Metagenomics , Genomics , Genome , Haplotypes

Founder reconstruction enables scalable and seamless pangenomic analysis.

Norri, Tuukka; Cazaux, Bastien; Dönges, Saska; Valenzuela, Daniel; Mäkinen, Veli.

Bioinformatics ; 37(24): 4611-4619, 2021 12 11.

Article in English | MEDLINE | ID: mdl-34260702

ABSTRACT

MOTIVATION: Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. RESULTS: We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling. AVAILABILITY AND IMPLEMENTATION: Our open access tools and instructions how to reproduce our experiments are available at the following address: https://github.com/algbio/panvc-founders. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Software , Sequence Analysis, DNA , Genome , Workflow

Finding all maximal perfect haplotype blocks in linear time.

Alanko, Jarno; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Stoye, Jens.

Algorithms Mol Biol ; 15: 2, 2020.

Article in English | MEDLINE | ID: mdl-32055252

ABSTRACT

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

Linear time minimum segmentation enables scalable founder reconstruction.

Norri, Tuukka; Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli.

Algorithms Mol Biol ; 14: 12, 2019.

Article in English | MEDLINE | ID: mdl-31131017

ABSTRACT

BACKGROUND: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R = { R 1 , , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment [ a , b ] ∈ P has length at least L and the number d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over [ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original R such that crossovers happen only at segment boundaries. RESULTS: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O ( m n 2 ) . CONCLUSIONS: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.

AQUAPONY: visualization and interpretation of phylogeographic information on phylogenetic trees.

Cazaux, Bastien; Castel, Guillaume; Rivals, Eric.

Bioinformatics ; 35(17): 3163-3165, 2019 09 01.

Article in English | MEDLINE | ID: mdl-30649190

ABSTRACT

MOTIVATION: The visualization and interpretation of evolutionary spatiotemporal scenarios is broadly and increasingly used in infectious disease research, ecology or agronomy. Using probabilistic frameworks, well-known tools can infer from molecular data ancestral traits for internal nodes in a phylogeny, and numerous phylogenetic rendering tools can display such evolutionary trees. However, visualizing such ancestral information and its uncertainty on the tree remains tedious. For instance, ancestral nodes can be associated to several geographical annotations with close probabilities and thus, several migration or transmission scenarios exist. RESULTS: We expose a web-based tool, named AQUAPONY, that facilitates such operations. Given an evolutionary tree with ancestral (e.g. geographical) annotations, the user can easily control the display of ancestral information on the entire tree or a subtree, and can view alternative phylogeographic scenarios along a branch according to a chosen uncertainty threshold. AQUAPONY interactively visualizes the tree and eases the objective interpretation of evolutionary scenarios. AQUAPONY's implementation makes it highly responsive to user interaction, and instantaneously updates the tree visualizations even for large trees (which can be exported as image files). AVAILABILITY AND IMPLEMENTATION: AQUAPONY is coded in JavaScript/HTML, available under Cecill license, and can be freely used at http://www.atgc-montpellier.fr/aquapony/.

Subject(s)

Phylogeny , Software , Phenotype , Phylogeography

Read mapping on de Bruijn graphs.

Limasset, Antoine; Cazaux, Bastien; Rivals, Eric; Peterlongo, Pierre.

BMC Bioinformatics ; 17(1): 237, 2016 Jun 16.

Article in English | MEDLINE | ID: mdl-27306641

ABSTRACT

BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.

Subject(s)

Escherichia coli/genetics , Genome, Human , Genomics/methods , Algorithms , Contig Mapping , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA

10.

Colib'read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads.

Le Bras, Yvan; Collin, Olivier; Monjeaud, Cyril; Lacroix, Vincent; Rivals, Éric; Lemaitre, Claire; Miele, Vincent; Sacomoto, Gustavo; Marchet, Camille; Cazaux, Bastien; Zine El Aabidine, Amal; Salmela, Leena; Alves-Carvalho, Susete; Andrieux, Alexan; Uricaru, Raluca; Peterlongo, Pierre.

Gigascience ; 5: 9, 2016.

Article in English | MEDLINE | ID: mdl-26870323

ABSTRACT

BACKGROUND: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. FINDINGS: Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. CONCLUSIONS: With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.

Subject(s)

Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Information Storage and Retrieval/methods , Software , Base Sequence , Cluster Analysis , Genome/genetics , Genomics/methods , Molecular Sequence Data , Reproducibility of Results

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL