Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 39
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38676570

ABSTRACT

MOTIVATION: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.


Subject(s)
Algorithms , Genome, Bacterial , Recombination, Genetic , Sequence Alignment , Sequence Alignment/methods , Humans , Software , Sequence Analysis, DNA/methods , Genomics/methods
2.
Bioinformatics ; 39(9)2023 09 02.
Article in English | MEDLINE | ID: mdl-37688560

ABSTRACT

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.


Subject(s)
Biological Specimen Banks , Haplotypes , Whole Genome Sequencing , United Kingdom
3.
Nat Methods ; 20(4): 550-558, 2023 04.
Article in English | MEDLINE | ID: mdl-36550274

ABSTRACT

Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)-a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.


Subject(s)
Genomics , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Genomics/methods , Genome, Human , Repetitive Sequences, Nucleic Acid
4.
Gigascience ; 122022 12 28.
Article in English | MEDLINE | ID: mdl-36576129

ABSTRACT

BACKGROUND: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. RESULTS: In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29\%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12\%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. CONCLUSIONS: By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. AVAILABILITY: The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.


Subject(s)
COVID-19 , Deep Learning , Puma , Animals , SARS-CoV-2/genetics , Puma/genetics , Genome, Viral
5.
BMC Bioinformatics ; 22(Suppl 15): 625, 2022 Apr 19.
Article in English | MEDLINE | ID: mdl-35439933

ABSTRACT

BACKGROUND: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. RESULTS: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. CONCLUSIONS: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.


Subject(s)
COVID-19 , Genome, Viral , High-Throughput Nucleotide Sequencing , Humans , Mutation , Pandemics , Phylogeny , SARS-CoV-2/genetics
6.
J Comput Biol ; 29(2): 188-194, 2022 02.
Article in English | MEDLINE | ID: mdl-35041518

ABSTRACT

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.


Subject(s)
Algorithms , Sequence Alignment/statistics & numerical data , Software , Computational Biology , Databases, Genetic/statistics & numerical data , Genome, Human , Genomics/statistics & numerical data , Humans , Sequence Analysis, DNA/statistics & numerical data
7.
Nat Comput ; 21(1): 81-108, 2022 Mar.
Article in English | MEDLINE | ID: mdl-36969737

ABSTRACT

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

8.
IEEE J Biomed Health Inform ; 25(11): 4068-4078, 2021 11.
Article in English | MEDLINE | ID: mdl-34003758

ABSTRACT

Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes it difficult, sometimes infeasible using current approaches and tools. One possible solution is to reduce the size of an SCS instance - usually represented as a matrix of presence, absence, and uncertainty of the mutations found in the different sequenced cells - and to infer the tree from this reduced-size instance. In this work, we present a new clustering procedure aimed at clustering such categorical vector, or matrix data - here representing SCS instances, called celluloid. We show that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method. We demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice. Our approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license, as well as on the Python Package Index (PyPI) at https://pypi.org/project/celluloid-clust/.


Subject(s)
Algorithms , Neoplasms , Cluster Analysis , Humans , Mutation/genetics , Neoplasms/genetics , Phylogeny , Software
9.
Bioinformatics ; 37(4): 464-472, 2021 05 01.
Article in English | MEDLINE | ID: mdl-32926128

ABSTRACT

MOTIVATION: Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. RESULTS: We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. AVAILABILITY AND IMPLEMENTATION: The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Sharks , Alternative Splicing , Animals , RNA-Seq , Sequence Analysis, RNA , Sharks/genetics , Software
10.
Bioinform Adv ; 1(1): vbab005, 2021.
Article in English | MEDLINE | ID: mdl-36700094

ABSTRACT

Motivation: Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results: We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome ('samples-specific' strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation: Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

11.
Bioinformatics ; 37(3): 326-333, 2021 04 20.
Article in English | MEDLINE | ID: mdl-32805010

ABSTRACT

MOTIVATION: In recent years, the well-known Infinite Sites Assumption has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging single-cell sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. RESULTS: We present Simulated Annealing Single-Cell inference (SASC): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS datasets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real datasets and in comparison with some other available methods. AVAILABILITY AND IMPLEMENTATION: The SASC tool is open source and available at https://github.com/sciccolella/sasc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Neoplasms , Single-Cell Analysis , Humans , Mutation , Neoplasms/genetics , Phylogeny , Sequence Analysis , Software
12.
Bioinformatics ; 37(2): 178-184, 2021 04 19.
Article in English | MEDLINE | ID: mdl-32730595

ABSTRACT

MOTIVATION: The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. RESULTS: To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. AVAILABILITY AND IMPLEMENTATION: An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Trees , Biological Evolution , Phylogeny , Sequence Analysis , Software
13.
BMC Bioinformatics ; 21(Suppl 1): 413, 2020 Dec 09.
Article in English | MEDLINE | ID: mdl-33297943

ABSTRACT

BACKGROUND: Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. RESULTS: We present a new tool, gpps, that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gpps . CONCLUSIONS: gpps provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.


Subject(s)
Computational Biology/methods , DNA Mutational Analysis , Disease Progression , Mutation , Neoplasms/genetics , Neoplasms/pathology , Base Sequence , Evolution, Molecular , Humans , Phylogeny , Single-Cell Analysis
14.
Bioinformatics ; 36(5): 1622-1624, 2020 03 01.
Article in English | MEDLINE | ID: mdl-31589304

ABSTRACT

SUMMARY: Retroviruses and their vector derivatives integrate semi-randomly in the genome of host cells and are inherited by their progeny as stable genetic marks. The retrieval and mapping of the sequences flanking the virus-host DNA junctions allows the identification of insertion sites in gene therapy or virally infected patients, essential for monitoring the evolution of genetically modified cells in vivo. However, since ∼30% of insertions land in low complexity or repetitive regions of the host cell genome, they cannot be correctly assigned and are currently discarded, limiting the accuracy and predictive power of clonal tracking studies. Here, we present γ-TRIS, a new graph-based genome-free alignment tool for identifying insertion sites even if embedded in low complexity regions. By using γ-TRIS to reanalyze clinical studies, we observed improvements in clonal quantification and tracking. AVAILABILITY AND IMPLEMENTATION: Source code at https://bitbucket.org/bereste/g-tris. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Genomics , Algorithms , Base Sequence , Humans , Software
15.
IEEE/ACM Trans Comput Biol Bioinform ; 16(5): 1410-1423, 2019.
Article in English | MEDLINE | ID: mdl-31603766

ABSTRACT

Most of the evolutionary history reconstruction approaches are based on the infinite sites assumption, which states that mutations appear once in the evolutionary history. The Perfect Phylogeny model is the result of the infinite sites assumption and has been widely used to infer cancer evolution. Nonetheless, recent results show that recurrent and back mutations are present in the evolutionary history of tumors, hence the Perfect Phylogeny model might be too restrictive. We propose an approach that allows losing previously acquired mutations and multiple acquisitions of a character. Moreover, we provide an ILP formulation for the evolutionary tree reconstruction problem. Our formulation allows us to tackle both the Incomplete Directed Phylogeny problem and the Clonal Reconstruction problem when general evolutionary models are considered. The latter problem is fundamental in cancer genomics, the goal is to study the evolutionary history of a tumor considering as input data the fraction of cells having a certain mutation in a set of cancer samples. For the Clonal Reconstruction problem, an experimental analysis shows the advantage of allowing mutation losses. Namely, by analyzing real and simulated datasets, our ILP approach provides a better interpretation of the evolutionary history than a Perfect Phylogeny. The software is at https://github.com/AlgoLab/gppf.


Subject(s)
Genomics/methods , Neoplasms , Software , Algorithms , Humans , Mutation/genetics , Neoplasms/classification , Neoplasms/genetics , Neoplasms/metabolism , Phylogeny
16.
iScience ; 18: 20-27, 2019 Aug 30.
Article in English | MEDLINE | ID: mdl-31352182

ABSTRACT

The amount of genetic variation discovered in human populations is growing rapidly leading to challenging computational tasks, such as variant calling. Standard methods for addressing this problem include read mapping, a computationally expensive procedure; thus, mapping-free tools have been proposed in recent years. These tools focus on isolated, biallelic SNPs, providing limited support for multi-allelic SNPs and short insertions and deletions of nucleotides (indels). Here we introduce MALVA, a mapping-free method to genotype an individual from a sample of reads. MALVA is the first mapping-free tool able to genotype multi-allelic SNPs and indels, even in high-density genomic regions, and to effectively handle a huge number of variants. MALVA requires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels, MALVA provides even better results than the most widely adopted variant discovery tools.

17.
J Comput Biol ; 26(9): 948-961, 2019 09.
Article in English | MEDLINE | ID: mdl-31140836

ABSTRACT

Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multistring generalizations of the Burrows-Wheeler transform (BWT) and the longest common prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings, such as those for genome assembly. In this article, we explore a multithread computational strategy for building the BWT and LCP array. Our algorithm applies a divide and conquer approach that leads to parallel computation of multistring BWT and LCP array.


Subject(s)
Algorithms , Computational Biology/methods , Sequence Analysis/methods
18.
BMC Bioinformatics ; 19(1): 444, 2018 Nov 20.
Article in English | MEDLINE | ID: mdl-30458725

ABSTRACT

BACKGROUND: While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. This latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph. RESULTS: We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the specific goal of detecting novel splicing events, involving either annotated or unannotated splice sites. ASGAL takes as input the annotated transcripts of a gene and a RNA-Seq sample, and computes (1) the spliced alignments of each read in input, and (2) a list of novel events with respect to the gene annotation. CONCLUSIONS: An experimental analysis shows that ASGAL allows to enrich the annotation with novel alternative splicing events even when genes in an experiment express at most one isoform. Compared with other tools which use the spliced alignment of reads against a reference genome for differential analysis, ASGAL better predicts events that use splice sites which are novel with respect to a splicing graph, showing a higher accuracy. To the best of our knowledge, ASGAL is the first tool that detects novel alternative splicing events by directly aligning reads to a splicing graph. AVAILABILITY: Source code, documentation, and data are available for download at http://asgal.algolab.eu .


Subject(s)
Alternative Splicing/genetics , RNA Splicing/genetics , RNA/genetics , Sequence Analysis, RNA/methods , Humans
19.
BMC Bioinformatics ; 19(1): 252, 2018 07 03.
Article in English | MEDLINE | ID: mdl-29970002

ABSTRACT

BACKGROUND: Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. RESULTS: Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. CONCLUSIONS: Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. AVAILABILITY: HapCHAT is available at http://hapchat.algolab.eu under the GNU Public License (GPL).


Subject(s)
Haplotypes/genetics , Sequence Analysis, DNA/methods , Algorithms , Humans
20.
J Comput Biol ; 24(10): 953-968, 2017 Oct.
Article in English | MEDLINE | ID: mdl-28715269

ABSTRACT

The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.


Subject(s)
Algorithms , Computational Biology/methods , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Genome, Human , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...