Search | VHL Regional Portal

1.

Quality assessment of splice site annotation based on conservation across multiple species.

Minkin, Ilia; Salzberg, Steven L.

bioRxiv ; 2023 Dec 02.

Article in English | MEDLINE | ID: mdl-38076842

ABSTRACT

Despite many improvements over the years, the annotation of the human genome remains imperfect, and even the best annotations of the human reference genome sometimes contradict one another. Hence, refinement of the human genome annotation is an important challenge. The use of evolutionarily conserved sequences provides a strategy for addressing this problem, and the rapidly growing number of genomes from other species increases the power of an evolution-driven approach. Using the latest large-scale whole genome alignment data, we found that splice sites from protein-coding genes in the high-quality MANE annotation are consistently conserved across more than 400 species. We also studied splice sites from the RefSeq, GENCODE, and CHESS databases that are not present in MANE, from both protein-coding genes and lncRNAs. We trained a logistic regression classifier to distinguish between the conservation patterns exhibited by splice sites from MANE versus sites that were flanked by the standard GT-AG dinucleotides, but that were chosen randomly from a sequence not under selection. We found that up to 70% of splice sites from annotated protein-coding transcripts outside of MANE exhibit conservation patterns closer to random sequence as opposed to highly-conserved splice sites from MANE. Our study highlights potentially erroneous splice sites that might require further scrutiny.

2.

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure.

Varabyou, Ales; Sommer, Markus J; Erdogdu, Beril; Shinder, Ida; Minkin, Ilia; Chao, Kuan-Hao; Park, Sukhwan; Heinz, Jakob; Pockrandt, Christopher; Shumate, Alaina; Rincon, Natalia; Puiu, Daniela; Steinegger, Martin; Salzberg, Steven L; Pertea, Mihaela.

Genome Biol ; 24(1): 249, 2023 10 30.

Article in English | MEDLINE | ID: mdl-37904256

ABSTRACT

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .

Subject(s)

Genome, Human , Proteins , Humans , Phylogeny , Proteins/genetics , Algorithms , Software , Molecular Sequence Annotation

3.

Structure-guided isoform identification for the human transcriptome.

Sommer, Markus J; Cha, Sooyoung; Varabyou, Ales; Rincon, Natalia; Park, Sukhwan; Minkin, Ilia; Pertea, Mihaela; Steinegger, Martin; Salzberg, Steven L.

Elife ; 112022 12 15.

Article in English | MEDLINE | ID: mdl-36519529

ABSTRACT

Recently developed methods to predict three-dimensional protein structure with high accuracy have opened new avenues for genome and proteome research. We explore a new hypothesis in genome annotation, namely whether computationally predicted structures can help to identify which of multiple possible gene isoforms represents a functional protein product. Guided by protein structure predictions, we evaluated over 230,000 isoforms of human protein-coding genes assembled from over 10,000 RNA sequencing experiments across many human tissues. From this set of assembled transcripts, we identified hundreds of isoforms with more confidently predicted structure and potentially superior function in comparison to canonical isoforms in the latest human gene database. We illustrate our new method with examples where structure provides a guide to function in combination with expression and evolutionary evidence. Additionally, we provide the complete set of structures as a resource to better understand the function of human genes and their isoforms. These results demonstrate the promise of protein structure prediction as a genome annotation tool, allowing us to refine even the most highly curated catalog of human proteins. More generally we demonstrate a practical, structure-guided approach that can be used to enhance the annotation of any genome.

Subject(s)

Genome , Transcriptome , Humans , Molecular Sequence Annotation , Protein Isoforms/genetics , Sequence Analysis, RNA

4.

Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ.

Minkin, Ilia; Medvedev, Paul.

Nat Commun ; 11(1): 6327, 2020 12 10.

Article in English | MEDLINE | ID: mdl-33303762

ABSTRACT

Multiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine.

Subject(s)

Algorithms , Genome , Animals , Base Sequence , Computer Simulation , Databases, Genetic , Genetic Variation , Mice , Nucleotides/genetics , Time Factors

5.

Scalable Pairwise Whole-Genome Homology Mapping of Long Genomes with BubbZ.

Minkin, Ilia; Medvedev, Paul.

iScience ; 23(6): 101224, 2020 Jun 26.

Article in English | MEDLINE | ID: mdl-32563153

ABSTRACT

Pairwise whole-genome homology mapping is the problem of finding all pairs of homologous intervals between a pair of genomes. As the number of available whole genomes has been rising dramatically in the last few years, there has been a need for more scalable homology mappers. In this paper, we develop an algorithm (BubbZ) for computing whole-genome pairwise homology mappings, especially in the context of all-to-all comparison for multiple genomes. BubbZ is based on an algorithm for computing chains in compacted de Bruijn graphs. We evaluate BubbZ on simulated datasets, a dataset composed of 16 long mouse genomes, and a large dataset of 1,600 Salmonella genomes. We show up to approximately an order of magnitude speed improvement, compared with MashMap2 and Minimap2, while retaining similar accuracy.

6.

A strategy for building and using a human reference pangenome.

Llamas, Bastien; Narzisi, Giuseppe; Schneider, Valerie; Audano, Peter A; Biederstedt, Evan; Blauvelt, Lon; Bradbury, Peter; Chang, Xian; Chin, Chen-Shan; Fungtammasan, Arkarachai; Clarke, Wayne E; Cleary, Alan; Ebler, Jana; Eizenga, Jordan; Sibbesen, Jonas A; Markello, Charles J; Garrison, Erik; Garg, Shilpa; Hickey, Glenn; Lazo, Gerard R; Lin, Michael F; Mahmoud, Medhat; Marschall, Tobias; Minkin, Ilia; Monlong, Jean; Musunuri, Rajeeva L; Sagayaradj, Sagayamary; Novak, Adam M; Rautiainen, Mikko; Regier, Allison; Sedlazeck, Fritz J; Siren, Jouni; Souilmi, Yassine; Wagner, Justin; Wrightsman, Travis; Yokoyama, Toshiyuki T; Zeng, Qiandong; Zook, Justin M; Paten, Benedict; Busby, Ben.

F1000Res ; 8: 1751, 2019.

Article in English | MEDLINE | ID: mdl-34386196

ABSTRACT

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.

7.

TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes.

Minkin, Ilia; Pham, Son; Medvedev, Paul.

Bioinformatics ; 33(24): 4024-4032, 2017 Dec 15.

Article in English | MEDLINE | ID: mdl-27659452

ABSTRACT

MOTIVATION: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). RESULTS: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes. AVAILABILITY AND IMPLEMENTATION: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo. CONTACT: ium125@psu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Genomics/methods , Animals , Genome, Human , Humans , Primates/genetics , Software

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL