Search | VHL Regional Portal

1.

Learning locality-sensitive bucketing functions.

Yuan, Xin; Chen, Ke; Li, Xiang; Shi, Qian; Shao, Mingfu.

Bioinformatics ; 40(Supplement_1): i318-i327, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940133

ABSTRACT

MOTIVATION: Many tasks in sequence analysis ask to identify biologically related sequences in a large set. The edit distance, being a sensible model for both evolution and sequencing error, is widely used in these tasks as a measure. The resulting computational problem-to recognize all pairs of sequences within a small edit distance-turns out to be exceedingly difficult, since the edit distance is known to be notoriously expensive to compute and that all-versus-all comparison is simply not acceptable with millions or billions of sequences. Among many attempts, we recently proposed the locality-sensitive bucketing (LSB) functions to meet this challenge. Formally, a (d1,d2)-LSB function sends sequences into multiple buckets with the guarantee that pairs of sequences of edit distance at most d1 can be found within a same bucket while those of edit distance at least d2 do not share any. LSB functions generalize the locality-sensitive hashing (LSH) functions and admit favorable properties, with a notable highlight being that optimal LSB functions for certain (d1,d2) exist. LSB functions hold the potential of solving above problems optimally, but the existence of LSB functions for more general (d1,d2) remains unclear, let alone constructing them for practical use. RESULTS: In this work, we aim to utilize machine learning techniques to train LSB functions. With the development of a novel loss function and insights in the neural network structures that can potentially extend beyond this specific task, we obtained LSB functions that exhibit nearly perfect accuracy for certain (d1,d2), matching our theoretical results, and high accuracy for many others. Comparing to the state-of-the-art LSH method Order Min Hash, the trained LSB functions achieve a 2- to 5-fold improvement on the sensitivity of recognizing similar sequences. An experiment on analyzing erroneous cell barcode data is also included to demonstrate the application of the trained LSB functions. AVAILABILITY AND IMPLEMENTATION: The code for the training process and the structure of trained models are freely available at https://github.com/Shao-Group/lsb-learn.

Subject(s)

Algorithms , Computational Biology/methods , Machine Learning

2.

Accurate assembly of multiple RNA-seq samples with Aletsch.

Shi, Qian; Zhang, Qimin; Shao, Mingfu.

Bioinformatics ; 40(Supplement_1): i307-i317, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940157

ABSTRACT

MOTIVATION: High-throughput RNA sequencing has become indispensable for decoding gene activities, yet the challenge of reconstructing full-length transcripts persists. Traditional single-sample assemblers frequently produce fragmented transcripts, especially in single-cell RNA-seq data. While algorithms designed for assembling multiple samples exist, they encounter various limitations. RESULTS: We present Aletsch, a new assembler for multiple bulk or single-cell RNA-seq samples. Aletsch incorporates several algorithmic innovations, including a "bridging" system that can effectively integrate multiple samples to restore missed junctions in individual samples, and a new graph-decomposition algorithm that leverages "supporting" information across multiple samples to guide the decomposition of complex vertices. A standout feature of Aletsch is its application of a random forest model with 50 well-designed features for scoring transcripts. We demonstrate its robust adaptability across different chromosomes, datasets, and species. Our experiments, conducted on RNA-seq data from several protocols, firmly demonstrate Aletsch's significant outperformance over existing meta-assemblers. As an example, when measured with the partial area under the precision-recall curve (pAUC, constrained by precision), Aletsch surpasses the leading assemblers TransMeta by 22.9%-62.1% and PsiCLASS by 23.0%-175.5% on human datasets. AVAILABILITY AND IMPLEMENTATION: Aletsch is freely available at https://github.com/Shao-Group/aletsch. Scripts that reproduce the experimental results of this manuscript is available at https://github.com/Shao-Group/aletsch-test.

Subject(s)

Algorithms , RNA-Seq , Software , RNA-Seq/methods , Humans , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA/methods

3.

Efficient Seeding for Error-Prone Sequences with SubseqHash2.

Li, Xiang; Chen, Ke; Shao, Mingfu.

bioRxiv ; 2024 Jun 03.

Article in English | MEDLINE | ID: mdl-38895288

ABSTRACT

Seeding is an essential preparatory step for large-scale sequence comparisons. Substring-based seeding methods such as kmers are ideal for sequences with low error rates but struggle to achieve high sensitivity while maintaining a reasonable precision for error-prone long reads. SubseqHash, a novel subsequence-based seeding method we recently developed, achieves superior accuracy to substring-based methods in seeding sequences with high mutation/error rates, while the only drawback is its computation speed. In this paper, we propose SubseqHash2, an improved algorithm that can compute multiple sets of seeds in one run by defining k orders over all length- k subsequences and identifying the optimal subsequence under each of the k orders in a single dynamic programming framework. The algorithm is further accelerated using SIMD instructions. SubseqHash2 achieves a 10-50× speedup over repeating SubseqHash while maintaining the high accuracy of seeds. We demonstrate that SubseqHash2 drastically outperforms popular substring-based methods including kmers, minimizers, syncmers, and Strobemers for three fundamental applications. In read mapping, SubseqHash2 can generate adequate seed-matches for aligning hard reads that minimap2 fails on. In sequence alignment, SubseqHash2 achieves high coverage of correct seeds and low coverage of incorrect seeds. In overlap detection, seeds produced by SubseqHash2 lead to more correct overlapping pairs at the same false-positive rate. With all the algorithmic breakthroughs of SubseqHash2, we clear the path for the wide adoption of subsequence-based seeds in long-read analysis. SubseqHash2 is available at https://github.com/Shao-Group/SubseqHash2.

4.

Accurate Assembly of Circular RNAs with TERRACE.

Zahin, Tasfia; Shi, Qian; Zang, Xiaofei Carl; Shao, Mingfu.

bioRxiv ; 2024 Feb 11.

Article in English | MEDLINE | ID: mdl-38370635

ABSTRACT

Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with its 5' and 3' ends covalently bonded. Due to this specific structure circRNAs are more stable than linear RNAs, admit distinct biological properties and functions, and have been proven to be promising biomarkers. Circular RNAs were severely overlooked previously owing to the biases in the RNA-seq protocols and in the detection algorithms, but recently gained tremendous attentions in both aspects. However, most existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, and hence exhibit unsatisfactory accuracy when a high-quality transcriptome is unavailable. Here we present TERRACE, a new algorithm for full-length assembly of circRNAs from paired-end total RNA-seq data. TERRACE uses the splice graph as the underlying data structure to organize the splicing and coverage information. We transform the problem of assembling circRNAs into finding two paths that "bridge" the three fragments in the splice graph induced by back-spliced reads. To solve this formulation, we adopted a definition for optimal bridging paths and a dynamic programming algorithm to calculate such paths, an approach that was proven useful for assembling linear RNAs. TERRACE features an efficient algorithm to detect back-spliced reads that are missed by RNA-seq aligners, contributing to its much improved sensitivity. It also incorporates a new machine-learning approach that is trained to assign a confidence score to each assembled circRNA, which is shown superior to using abundance for scoring. TERRACE is compared with leading circRNA detection methods on both simulations and biological datasets. Our method consistently outperforms by a large margin in sensitivity while maintaining better or comparable precision. In particular, when the annotations are not provided, TERRACE can assemble 123%-412% more correct circRNAs than state-of-the-art methods on human tissues. TERRACE presents a major leap on assembling full-length circRNAs from RNA-seq data, and we expect it to be widely used in the downstream research on circRNAs.

5.

On de novo Bridging Paired-end RNA-seq Data.

Li, Xiang; Shao, Mingfu.

ACM BCB ; 20232023 Sep.

Article in English | MEDLINE | ID: mdl-38045531

ABSTRACT

The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome-a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph (cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.

6.

On the Maximal Independent Sets of k-mers with the Edit Distance.

Ma, Leran; Chen, Ke; Shao, Mingfu.

ACM BCB ; 20232023 Sep.

Article in English | MEDLINE | ID: mdl-38050580

ABSTRACT

In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace.

7.

Transcript assembly and annotations: Bias and adjustment.

Zhang, Qimin; Shao, Mingfu.

PLoS Comput Biol ; 19(12): e1011734, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38127855

ABSTRACT

Transcript annotations play a critical role in gene expression analysis as they serve as a reference for quantifying isoform-level expression. The two main sources of annotations are RefSeq and Ensembl/GENCODE, but discrepancies between their methodologies and information resources can lead to significant differences. It has been demonstrated that the choice of annotation can have a significant impact on gene expression analysis. Furthermore, transcript assembly is closely linked to annotations, as assembling large-scale available RNA-seq data is an effective data-driven way to construct annotations, and annotations are often served as benchmarks to evaluate the accuracy of assembly methods. However, the influence of different annotations on transcript assembly is not yet fully understood. We investigate the impact of annotations on transcript assembly. Surprisingly, we observe that opposite conclusions can arise when evaluating assemblers with different annotations. To understand this striking phenomenon, we compare the structural similarity of annotations at various levels and find that the primary structural difference across annotations occurs at the intron-chain level. Next, we examine the biotypes of annotated and assembled transcripts and uncover a significant bias towards annotating and assembling transcripts with intron retentions, which explains above the contradictory conclusions. We develop a standalone tool, available at https://github.com/Shao-Group/irtool, that can be combined with an assembler to generate an assembly without intron retentions. We evaluate the performance of such a pipeline and offer guidance to select appropriate assembling tools for different application scenarios.

Subject(s)

Gene Expression Profiling , Molecular Sequence Annotation , Protein Isoforms/genetics , RNA-Seq

8.

Locality-sensitive bucketing functions for the edit distance.

Chen, Ke; Shao, Mingfu.

Algorithms Mol Biol ; 18(1): 7, 2023 Jul 24.

Article in English | MEDLINE | ID: mdl-37488600

ABSTRACT

BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. RESULTS: In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text]-sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. CONCLUSION: These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.

9.

Seeding with minimized subsequence.

Li, Xiang; Shi, Qian; Chen, Ke; Shao, Mingfu.

Bioinformatics ; 39(39 Suppl 1): i232-i241, 2023 06 30.

Article in English | MEDLINE | ID: mdl-37387132

ABSTRACT

MOTIVATION: Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. RESULTS: We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. AVAILABILITY AND IMPLEMENTATION: SubseqHash is freely available at https://github.com/Shao-Group/subseqhash.

Subject(s)

Algorithms , Mutation Rate , Probability , Sequence Alignment

10.

Transcript Assembly and Annotations: Bias and Adjustment.

Zhang, Qimin; Shao, Mingfu.

bioRxiv ; 2023 Apr 21.

Article in English | MEDLINE | ID: mdl-37131680

ABSTRACT

Motivation: Transcript annotations play a critical role in gene expression analysis as they serve as a reference for quantifying isoform-level expression. The two main sources of annotations are RefSeq and Ensembl/GENCODE, but discrepancies between their methodologies and information resources can lead to significant differences. It has been demonstrated that the choice of annotation can have a significant impact on gene expression analysis. Furthermore, transcript assembly is closely linked to annotations, as assembling large-scale available RNA-seq data is an effective data-driven way to construct annotations, and annotations are often served as benchmarks to evaluate the accuracy of assembly methods. However, the influence of different annotations on transcript assembly is not yet fully understood. Results: We investigate the impact of annotations on transcript assembly. We observe that conflicting conclusions can arise when evaluating assemblers with different annotations. To understand this striking phenomenon, we compare the structural similarity of annotations at various levels and find that the primary structural difference across annotations occurs at the intron-chain level. Next, we examine the biotypes of annotated and assembled transcripts and uncover a significant bias towards annotating and assembling transcripts with intron retentions, which explains above the contradictory conclusions. We develop a standalone tool, available at https://github.com/Shao-Group/irtool, that can be combined with an assembler to generate an assembly without intron retentions. We evaluate the performance of such a pipeline and offer guidance to select appropriate assembling tools for different application scenarios.

11.

On de novo Bridging Paired-end RNA-seq Data.

Li, Xiang; Shao, Mingfu.

ArXiv ; 2023 Mar 27.

Article in English | MEDLINE | ID: mdl-37033458

ABSTRACT

The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome - a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph(cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.

12.

Accurate assembly of multi-end RNA-seq data with Scallop2.

Zhang, Qimin; Shi, Qian; Shao, Mingfu.

Nat Comput Sci ; 2(3): 148-152, 2022 Mar.

Article in English | MEDLINE | ID: mdl-36713932

ABSTRACT

Modern RNA-sequencing protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to "bridge" multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge, and (3) piping the refined splice graph and the bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on 10 Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared to two popular assemblers StringTie2 and Scallop.

13.

Context-aware seeds for read mapping.

Xin, Hongyi; Shao, Mingfu; Kingsford, Carl.

Algorithms Mol Biol ; 15: 10, 2020.

Article in English | MEDLINE | ID: mdl-32489399

ABSTRACT

MOTIVATION: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows, this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. RESULTS: We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS significantly reduces seed frequencies when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. AVAILABILITY: https://github.com/Kingsford-Group/CAS_code.

14.

Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads.

Tung, Laura H; Shao, Mingfu; Kingsford, Carl.

Genome Biol ; 20(1): 287, 2019 12 18.

Article in English | MEDLINE | ID: mdl-31849338

ABSTRACT

Single-molecule long-read sequencing has been used to improve mRNA isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and sequencing length limits. This drives a need for long-read transcript assembly. By adding long-read-specific optimizations to Scallop, we developed Scallop-LR, a reference-based long-read transcript assembler. Analyzing 26 PacBio samples, we quantified the benefit of performing transcript assembly on long reads. We demonstrate Scallop-LR identifies more known transcripts and potentially novel isoforms for the human transcriptome than Iso-Seq Analysis and StringTie, indicating that long-read transcript assembly by Scallop-LR can reveal a more complete human transcriptome.

Subject(s)

Genetic Techniques , Single Molecule Imaging , Transcriptome , Algorithms , Animals , Humans , Mice , Software

15.

Theory and A Heuristic for the Minimum Path Flow Decomposition Problem.

Shao, Mingfu; Kingsford, Carl.

IEEE/ACM Trans Comput Biol Bioinform ; 16(2): 658-670, 2019.

Article in English | MEDLINE | ID: mdl-29990201

ABSTRACT

Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: Given a directed acyclic graph $G=(V,E)$G=(V,E) with source $s$s and sink $t$t and a flow $f$f, compute a set of $s$s-$t$t paths $P$P and assign weight $w(p)$w(p) for $p\in P$p∈P such that $f(e) = \sum _{p\in P: e\in p} w(p)$f(e)=∑p∈P:e∈pw(p), $\forall e\in E$∀e∈E, and $|P|$|P| is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values. This result gives rise to the framework of our heuristic: to iteratively reduce the gap through identifying such equations. We also define an operation on certain independent substructures of the graph, and prove that this operation does not affect the optimality but can transform the graph into one with desired property that facilitates reducing the gap. We apply and test our algorithm on both simulated random instances and perfect splice graph instances, and also compare it with the existing state-of-art algorithm for flow decomposition. The results illustrate that our algorithm can achieve very high accuracy on these instances, and also that our algorithm significantly improves on the previous algorithms. An implementation of our algorithm is freely available at https://github.com/Kingsford-Group/catfish.

Subject(s)

Heuristics , Sequence Alignment/methods , Algorithms , Animals , Genomics/methods , Humans , RNA/genetics , Sequence Analysis, RNA

16.

SQUID: transcriptomic structural variation detection from RNA-seq.

Ma, Cong; Shao, Mingfu; Kingsford, Carl.

Genome Biol ; 19(1): 52, 2018 04 12.

Article in English | MEDLINE | ID: mdl-29650026

ABSTRACT

Transcripts are frequently modified by structural variations, which lead to fused transcripts of either multiple genes, known as a fusion gene, or a gene and a previously non-transcribed sequence. Detecting these modifications, called transcriptomic structural variations (TSVs), especially in cancer tumor sequencing, is an important and challenging computational problem. We introduce SQUID, a novel algorithm to predict both fusion-gene and non-fusion-gene TSVs accurately from RNA-seq alignments. SQUID unifies both concordant and discordant read alignments into one model and doubles the precision on simulation data compared to other approaches. Using SQUID, we identify novel non-fusion-gene TSVs on TCGA samples.

Subject(s)

Algorithms , Gene Expression Profiling/methods , Genomic Structural Variation , Sequence Analysis, RNA/methods , Cell Line, Tumor , Gene Fusion , Genes, Tumor Suppressor , Humans , Sequence Alignment , Software

17.

Accurate assembly of transcripts through phase-preserving graph decomposition.

Shao, Mingfu; Kingsford, Carl.

Nat Biotechnol ; 35(12): 1167-1169, 2017 Dec.

Article in English | MEDLINE | ID: mdl-29131147

ABSTRACT

We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.

Subject(s)

Algorithms , RNA , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Software , Humans , RNA/analysis , RNA/genetics , RNA/metabolism

18.

DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields.

Shao, Mingfu; Ma, Jianzhu; Wang, Sheng.

Bioinformatics ; 33(14): i267-i273, 2017 Jul 15.

Article in English | MEDLINE | ID: mdl-28881999

ABSTRACT

MOTIVATION: Reconstructing the full-length expressed transcripts ( a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. RESULTS: We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods. AVAILABILITY AND IMPLEMENTATION: DeepBound is freely available at https://github.com/realbigws/DeepBound . CONTACT: mingfu.shao@cs.cmu.edu or realbigws@gmail.com.

Subject(s)

RNA Splicing , Sequence Analysis, RNA/methods , Software , Algorithms , Area Under Curve , Computer Simulation , Exons , Humans , Introns , Models, Genetic

19.

On Computing Breakpoint Distances for Genomes with Duplicate Genes.

Shao, Mingfu; Moret, Bernard M E.

J Comput Biol ; 24(6): 571-580, 2017 Jun.

Article in English | MEDLINE | ID: mdl-27788022

ABSTRACT

A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of nonconserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this article, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological data sets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the "any matching" formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.

Subject(s)

Algorithms , Genes, Duplicate , Genome , Genomics/methods , Mammals/genetics , Models, Genetic , Animals , Biological Evolution

20.

A Fast and Exact Algorithm for the Exemplar Breakpoint Distance.

Shao, Mingfu; Moret, Bernard M E.

J Comput Biol ; 23(5): 337-46, 2016 05.

Article in English | MEDLINE | ID: mdl-26953781

ABSTRACT

A fundamental problem in comparative genomics is to compute the distance between two genomes. For two genomes without duplicate genes, we can easily compute a variety of distance measures in linear time, but the problem is NP-hard under most models when genomes contain duplicate genes. Sankoff proposed the use of exemplars to tackle the problem of duplicate genes and gene families: each gene family is represented by a single gene (the exemplar for that family), chosen so as to optimize some metric. Unfortunately, choosing exemplars is itself an NP-hard problem. In this article, we propose a very fast and exact algorithm to compute the exemplar breakpoint distance, based on new insights in the underlying structure of genome rearrangements and exemplars. We evaluate the performance of our algorithm on simulation data and compare its performance to the best effort to date (a divide-and-conquer approach), showing that our algorithm runs much faster and scales much better. We also devise a new algorithm for the intermediate breakpoint distance problem, which can then be applied to assign orthologs. We compare our algorithm with the state-of-the-art method MSOAR by assigning orthologs among five well annotated mammalian genomes, showing that our algorithm runs much faster and is slightly more accurate than MSOAR.

Subject(s)

Genomics/methods , Mammals/genetics , Algorithms , Animals , Genome , Models, Genetic , Multigene Family , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL