Search | VHL Regional Portal

1.

Combining DNA and protein alignments to improve genome annotation with LiftOn.

Chao, Kuan-Hao; Heinz, Jakob M; Hoh, Celine; Mao, Alan; Shumate, Alaina; Pertea, Mihaela; Salzberg, Steven L.

bioRxiv ; 2024 May 17.

Article in English | MEDLINE | ID: mdl-38798552

ABSTRACT

As the number and variety of assembled genomes continues to grow, the number of annotated genomes is falling behind, particularly for eukaryotes. DNA-based mapping tools help to address this challenge, but they are only able to transfer annotation between closely-related species. Here we introduce LiftOn, a homology-based software tool that integrates DNA and protein alignments to enhance the accuracy of genome-scale annotation and to allow mapping between relatively distant species. LiftOn's protein-centric algorithm considers both types of alignments, chooses optimal open reading frames, resolves overlapping gene loci, and finds additional gene copies where they exist. LiftOn can reliably transfer annotation between genomes representing members of the same species, as we demonstrate on human, mouse, honey bee, rice, and Arabidopsis thaliana. It can further map annotation effectively across species pairs as far apart as mouse and rat or Drosophila melanogaster and D. erecta.

2.

EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes.

Shinder, Ida; Hu, Richard; Ji, Hyun Joo; Chao, Kuan-Hao; Pertea, Mihaela.

Nat Commun ; 14(1): 7223, 2023 11 09.

Article in English | MEDLINE | ID: mdl-37940654

ABSTRACT

Accurate alignment of transcribed RNA to reference genomes is a critical step in the analysis of gene expression, which in turn has broad applications in biomedical research and in the basic sciences. We reveal that widely used splice-aware aligners, such as STAR and HISAT2, can introduce erroneous spliced alignments between repeated sequences, leading to the inclusion of falsely spliced transcripts in RNA-seq experiments. In some cases, the 'phantom' introns resulting from these errors make their way into widely-used genome annotation databases. To address this issue, we present EASTR (Emending Alignments of Spliced Transcript Reads), a software tool that detects and removes falsely spliced alignments or transcripts from alignment and annotation files. EASTR improves the accuracy of spliced alignments across diverse species, including human, maize, and Arabidopsis thaliana, by detecting sequence similarity between intron-flanking regions. We demonstrate that applying EASTR before transcript assembly substantially reduces false positive introns, exons, and transcripts, improving the overall accuracy of assembled transcripts. Additionally, we show that EASTR's application to reference annotation databases can detect and correct likely cases of mis-annotated transcripts.

Subject(s)

Arabidopsis , Software , Humans , Exons/genetics , Genome , RNA , Sequence Analysis, RNA/methods , Arabidopsis/genetics , Introns/genetics

3.

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure.

Varabyou, Ales; Sommer, Markus J; Erdogdu, Beril; Shinder, Ida; Minkin, Ilia; Chao, Kuan-Hao; Park, Sukhwan; Heinz, Jakob; Pockrandt, Christopher; Shumate, Alaina; Rincon, Natalia; Puiu, Daniela; Steinegger, Martin; Salzberg, Steven L; Pertea, Mihaela.

Genome Biol ; 24(1): 249, 2023 10 30.

Article in English | MEDLINE | ID: mdl-37904256

ABSTRACT

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .

Subject(s)

Genome, Human , Proteins , Humans , Phylogeny , Proteins/genetics , Algorithms , Software , Molecular Sequence Annotation

4.

Splam: a deep-learning-based splice site predictor that improves spliced alignments.

Chao, Kuan-Hao; Mao, Alan; Salzberg, Steven L; Pertea, Mihaela.

bioRxiv ; 2023 Jul 29.

Article in English | MEDLINE | ID: mdl-37546880

ABSTRACT

The process of splicing messenger RNA to remove introns plays a central role in creating genes and gene variants. Here we describe Splam, a novel method for predicting splice junctions in DNA based on deep residual convolutional neural networks. Unlike some previous models, Splam looks at a relatively limited window of 400 base pairs flanking each splice site, motivated by the observation that the biological process of splicing relies primarily on signals within this window. Additionally, Splam introduces the idea of training the network on donor and acceptor pairs together, based on the principle that the splicing machinery recognizes both ends of each intron at once. We compare Splam's accuracy to recent state-of-the-art splice site prediction methods, particularly SpliceAI, another method that uses deep neural networks. Our results show that Splam is consistently more accurate than SpliceAI, with an overall accuracy of 96% at predicting human splice junctions. Splam generalizes even to non-human species, including distant ones like the flowering plant Arabidopsis thaliana. Finally, we demonstrate the use of Splam on a novel application: processing the spliced alignments of RNA-seq data to identify and eliminate errors. We show that when used in this manner, Splam yields substantial improvements in the accuracy of downstream transcriptome analysis of both poly(A) and ribo-depleted RNA-seq libraries. Overall, Splam offers a faster and more accurate approach to detecting splice junctions, while also providing a reliable and efficient solution for cleaning up erroneous spliced alignments.

5.

WGT: Tools and algorithms for recognizing, visualizing, and generating Wheeler graphs.

Chao, Kuan-Hao; Chen, Pei-Wei; Seshia, Sanjit A; Langmead, Ben.

iScience ; 26(8): 107402, 2023 Aug 18.

Article in English | MEDLINE | ID: mdl-37575187

ABSTRACT

A Wheeler graph represents a collection of strings in a way that is particularly easy to index and query. Such a graph is a practical choice for representing a graph-shaped pangenome, and it is the foundation for current graph-based pangenome indexes. However, there are no practical tools to visualize or to check graphs that may have the Wheeler properties. Here, we present Wheelie, an algorithm that combines a renaming heuristic with a permutation solver (Wheelie-PR) or a Satisfiability Modulo Theory (SMT) solver (Wheelie-SMT) to check whether a given graph has the Wheeler properties, a problem that is NP-complete in general. Wheelie can check a variety of random and real-world graphs in far less time than any algorithm proposed to date. It can check a graph with 1,000s of nodes in seconds. We implement these algorithms together with complementary visualization tools in the WGT toolkit, available as open source software at https://github.com/Kuanhao-Chao/Wheeler_Graph_Toolkit.

6.

A feature extraction free approach for protein interactome inference from co-elution data.

Chen, Yu-Hsin; Chao, Kuan-Hao; Wong, Jin Yung; Liu, Chien-Fu; Leu, Jun-Yi; Tsai, Huai-Kuang.

Brief Bioinform ; 24(4)2023 07 20.

Article in English | MEDLINE | ID: mdl-37328692

ABSTRACT

Protein complexes are key functional units in cellular processes. High-throughput techniques, such as co-fractionation coupled with mass spectrometry (CF-MS), have advanced protein complex studies by enabling global interactome inference. However, dealing with complex fractionation characteristics to define true interactions is not a simple task, since CF-MS is prone to false positives due to the co-elution of non-interacting proteins by chance. Several computational methods have been designed to analyze CF-MS data and construct probabilistic protein-protein interaction (PPI) networks. Current methods usually first infer PPIs based on handcrafted CF-MS features, and then use clustering algorithms to form potential protein complexes. While powerful, these methods suffer from the potential bias of handcrafted features and severely imbalanced data distribution. However, the handcrafted features based on domain knowledge might introduce bias, and current methods also tend to overfit due to the severely imbalanced PPI data. To address these issues, we present a balanced end-to-end learning architecture, Software for Prediction of Interactome with Feature-extraction Free Elution Data (SPIFFED), to integrate feature representation from raw CF-MS data and interactome prediction by convolutional neural network. SPIFFED outperforms the state-of-the-art methods in predicting PPIs under the conventional imbalanced training. When trained with balanced data, SPIFFED had greatly improved sensitivity for true PPIs. Moreover, the ensemble SPIFFED model provides different voting schemes to integrate predicted PPIs from multiple CF-MS data. Using the clustering software (i.e. ClusterONE), SPIFFED allows users to infer high-confidence protein complexes depending on the CF-MS experimental designs. The source code of SPIFFED is freely available at: https://github.com/bio-it-station/SPIFFED.

Subject(s)

Protein Interaction Mapping , Proteins , Protein Interaction Mapping/methods , Proteins/chemistry , Algorithms , Protein Interaction Maps , Software

7.

The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual.

Chao, Kuan-Hao; Zimin, Aleksey V; Pertea, Mihaela; Salzberg, Steven L.

G3 (Bethesda) ; 13(3)2023 03 09.

Article in English | MEDLINE | ID: mdl-36630290

ABSTRACT

We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein-coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.

Subject(s)

East Asian People , Genome, Human , Humans , Male , East Asian People/genetics , Molecular Sequence Annotation , Sequence Analysis, DNA

8.

sangeranalyseR: Simple and Interactive Processing of Sanger Sequencing Data in R.

Chao, Kuan-Hao; Barton, Kirston; Palmer, Sarah; Lanfear, Robert.

Genome Biol Evol ; 13(3)2021 03 01.

Article in English | MEDLINE | ID: mdl-33591316

ABSTRACT

sangeranalyseR is feature-rich, free, and open-source R package for processing Sanger sequencing data. It allows users to go from loading reads to saving aligned contigs in a few lines of R code by using sensible defaults for most actions. It also provides complete flexibility for determining how individual reads and contigs are processed, both at the command-line in R and via interactive Shiny applications. sangeranalyseR provides a wide range of options for all steps in Sanger processing pipelines including trimming reads, detecting secondary peaks, viewing chromatograms, detecting indels and stop codons, aligning contigs, estimating phylogenetic trees, and more. Input data can be in either ABIF or FASTA format. sangeranalyseR comes with extensive online documentation and outputs aligned and unaligned reads and contigs in FASTA format, along with detailed interactive HTML reports. sangeranalyseR supports the use of colorblind-friendly palettes for viewing alignments and chromatograms. It is released under an MIT licence and available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR, last accessed February 22, 2021) and on Github (https://github.com/roblanf/sangeranalyseR, last accessed February 22, 2021).

Subject(s)

Computational Biology/methods , Sequence Analysis, DNA/methods , Software , DNA , Phylogeny , Sequence Alignment , User-Computer Interface , Web Browser

9.

RNASeqR: An R Package for Automated Two-Group RNA-Seq Analysis Workflow.

Chao, Kuan-Hao; Hsiao, Yi-Wen; Lee, Yi-Fang; Lee, Chien-Yueh; Lai, Liang-Chuan; Tsai, Mong-Hsun; Lu, Tzu-Pin; Chuang, Eric Y.

IEEE/ACM Trans Comput Biol Bioinform ; 18(5): 2023-2031, 2021.

Article in English | MEDLINE | ID: mdl-31796413

ABSTRACT

RNA-Seq analysis has revolutionized researchers' understanding of the transcriptome in biological research. Assessing the differences in transcriptomic profiles between tissue samples or patient groups enables researchers to explore the underlying biological impact of transcription. RNA-Seq analysis requires multiple processing steps and huge computational capabilities. There are many well-developed R packages for individual steps; however, there are few R/Bioconductor packages that integrate existing software tools into a comprehensive RNA-Seq analysis and provide fundamental end-to-end results in pure R environment so that researchers can quickly and easily get fundamental information in big sequencing data. To address this need, we have developed the open source R/Bioconductor package, RNASeqR. It allows users to run an automated RNA-Seq analysis with only six steps, producing essential tabular and graphical results for further biological interpretation. The features of RNASeqR include: six-step analysis, comprehensive visualization, background execution version, and the integration of both R and command-line software. RNASeqR provides fast, light-weight, and easy-to-run RNA-Seq analysis pipeline in pure R environment. It allows users to efficiently utilize popular software tools, including both R/Bioconductor and command-line tools, without predefining the resources or environments. RNASeqR is freely available for Linux and macOS operating systems from Bioconductor (https://bioconductor.org/packages/release/bioc/html/RNASeqR.html).

Subject(s)

Computational Biology/methods , RNA-Seq/methods , Data Visualization , Humans , Software

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL