Search | VHL Regional Portal

1.

Systems biology dissection of PTSD and MDD across brain regions, cell types, and blood.

Daskalakis, Nikolaos P; Iatrou, Artemis; Chatzinakos, Chris; Jajoo, Aarti; Snijders, Clara; Wylie, Dennis; DiPietro, Christopher P; Tsatsani, Ioulia; Chen, Chia-Yen; Pernia, Cameron D; Soliva-Estruch, Marina; Arasappan, Dhivya; Bharadwaj, Rahul A; Collado-Torres, Leonardo; Wuchty, Stefan; Alvarez, Victor E; Dammer, Eric B; Deep-Soboslay, Amy; Duong, Duc M; Eagles, Nick; Huber, Bertrand R; Huuki, Louise; Holstein, Vincent L; Logue, Mark W; Lugenbühl, Justina F; Maihofer, Adam X; Miller, Mark W; Nievergelt, Caroline M; Pertea, Geo; Ross, Deanna; Sendi, Mohammad S E; Sun, Benjamin B; Tao, Ran; Tooke, James; Wolf, Erika J; Zeier, Zane; Berretta, Sabina; Champagne, Frances A; Hyde, Thomas; Seyfried, Nicholas T; Shin, Joo Heon; Weinberger, Daniel R; Nemeroff, Charles B; Kleinman, Joel E; Ressler, Kerry J.

Science ; 384(6698): eadh3707, 2024 May 24.

Article in English | MEDLINE | ID: mdl-38781393

ABSTRACT

The molecular pathology of stress-related disorders remains elusive. Our brain multiregion, multiomic study of posttraumatic stress disorder (PTSD) and major depressive disorder (MDD) included the central nucleus of the amygdala, hippocampal dentate gyrus, and medial prefrontal cortex (mPFC). Genes and exons within the mPFC carried most disease signals replicated across two independent cohorts. Pathways pointed to immune function, neuronal and synaptic regulation, and stress hormones. Multiomic factor and gene network analyses provided the underlying genomic structure. Single nucleus RNA sequencing in dorsolateral PFC revealed dysregulated (stress-related) signals in neuronal and non-neuronal cell types. Analyses of brain-blood intersections in >50,000 UK Biobank participants were conducted along with fine-mapping of the results of PTSD and MDD genome-wide association studies to distinguish risk from disease processes. Our data suggest shared and distinct molecular pathology in both disorders and propose potential therapeutic targets and biomarkers.

Subject(s)

Brain , Depressive Disorder, Major , Genetic Loci , Stress Disorders, Post-Traumatic , Female , Humans , Male , Amygdala/metabolism , Biomarkers/metabolism , Brain/metabolism , Depressive Disorder, Major/genetics , Gene Regulatory Networks , Genome-Wide Association Study , Neurons/metabolism , Prefrontal Cortex/metabolism , Stress Disorders, Post-Traumatic/genetics , Systems Biology , Single-Cell Gene Expression Analysis , Chromosome Mapping

2.

Analysis of gene expression in the postmortem brain of neurotypical Black Americans reveals contributions of genetic ancestry.

Benjamin, Kynon J M; Chen, Qiang; Eagles, Nicholas J; Huuki-Myers, Louise A; Collado-Torres, Leonardo; Stolz, Joshua M; Pertea, Geo; Shin, Joo Heon; Paquola, Apuã C M; Hyde, Thomas M; Kleinman, Joel E; Jaffe, Andrew E; Han, Shizhong; Weinberger, Daniel R.

Nat Neurosci ; 27(6): 1064-1074, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38769152

ABSTRACT

Ancestral differences in genomic variation affect the regulation of gene expression; however, most gene expression studies have been limited to European ancestry samples or adjusted to identify ancestry-independent associations. Here, we instead examined the impact of genetic ancestry on gene expression and DNA methylation in the postmortem brain tissue of admixed Black American neurotypical individuals to identify ancestry-dependent and ancestry-independent contributions. Ancestry-associated differentially expressed genes (DEGs), transcripts and gene networks, while notably not implicating neurons, are enriched for genes related to the immune response and vascular tissue and explain up to 26% of heritability for ischemic stroke, 27% of heritability for Parkinson disease and 30% of heritability for Alzheimer's disease. Ancestry-associated DEGs also show general enrichment for the heritability of diverse immune-related traits but depletion for psychiatric-related traits. We also compared Black and non-Hispanic white Americans, confirming most ancestry-associated DEGs. Our results delineate the extent to which genetic ancestry affects differences in gene expression in the human brain and the implications for brain illness risk.

Subject(s)

Black or African American , Brain , DNA Methylation , Humans , Black or African American/genetics , Brain/metabolism , Female , Male , White People/genetics , Autopsy , Gene Expression/genetics , Alzheimer Disease/genetics , Alzheimer Disease/metabolism , Alzheimer Disease/ethnology , Aged , Middle Aged

3.

Sex affects transcriptional associations with schizophrenia across the dorsolateral prefrontal cortex, hippocampus, and caudate nucleus.

Benjamin, Kynon J M; Arora, Ria; Feltrin, Arthur S; Pertea, Geo; Giles, Hunter H; Stolz, Joshua M; D'Ignazio, Laura; Collado-Torres, Leonardo; Shin, Joo Heon; Ulrich, William S; Hyde, Thomas M; Kleinman, Joel E; Weinberger, Daniel R; Paquola, Apuã C M; Erwin, Jennifer A.

Nat Commun ; 15(1): 3980, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38730231

ABSTRACT

Schizophrenia is a complex neuropsychiatric disorder with sexually dimorphic features, including differential symptomatology, drug responsiveness, and male incidence rate. Prior large-scale transcriptome analyses for sex differences in schizophrenia have focused on the prefrontal cortex. Analyzing BrainSeq Consortium data (caudate nucleus: n = 399, dorsolateral prefrontal cortex: n = 377, and hippocampus: n = 394), we identified 831 unique genes that exhibit sex differences across brain regions, enriched for immune-related pathways. We observed X-chromosome dosage reduction in the hippocampus of male individuals with schizophrenia. Our sex interaction model revealed 148 junctions dysregulated in a sex-specific manner in schizophrenia. Sex-specific schizophrenia analysis identified dozens of differentially expressed genes, notably enriched in immune-related pathways. Finally, our sex-interacting expression quantitative trait loci analysis revealed 704 unique genes, nine associated with schizophrenia risk. These findings emphasize the importance of sex-informed analysis of sexually dimorphic traits, inform personalized therapeutic strategies in schizophrenia, and highlight the need for increased female samples for schizophrenia analyses.

Subject(s)

Caudate Nucleus , Dorsolateral Prefrontal Cortex , Hippocampus , Quantitative Trait Loci , Schizophrenia , Sex Characteristics , Humans , Schizophrenia/genetics , Schizophrenia/metabolism , Female , Male , Hippocampus/metabolism , Caudate Nucleus/metabolism , Dorsolateral Prefrontal Cortex/metabolism , Adult , Transcriptome , Gene Expression Profiling , Sex Factors , Chromosomes, Human, X/genetics , Prefrontal Cortex/metabolism

4.

Genetic and environmental contributions to ancestry differences in gene expression in the human brain.

Benjamin, Kynon J M; Chen, Qiang; Eagles, Nicholas J; Huuki-Myers, Louise A; Collado-Torres, Leonardo; Stolz, Joshua M; Pertea, Geo; Shin, Joo Heon; Paquola, Apuã C M; Hyde, Thomas M; Kleinman, Joel E; Jaffe, Andrew E; Han, Shizhong; Weinberger, Daniel R.

bioRxiv ; 2023 Oct 05.

Article in English | MEDLINE | ID: mdl-37034760

ABSTRACT

Ancestral differences in genomic variation are determining factors in gene regulation; however, most gene expression studies have been limited to European ancestry samples or adjusted for ancestry to identify ancestry-independent associations. We instead examined the impact of genetic ancestry on gene expression and DNA methylation (DNAm) in admixed African/Black American neurotypical individuals to untangle effects of genetic and environmental factors. Ancestry-associated differentially expressed genes (DEGs), transcripts, and gene networks, while notably not implicating neurons, are enriched for genes related to immune response and vascular tissue and explain up to 26% of heritability for ischemic stroke, 27% of heritability for Parkinson's disease, and 30% of heritability for Alzhemier's disease. Ancestry-associated DEGs also show general enrichment for heritability of diverse immune-related traits but depletion for psychiatric-related traits. The cell-type enrichments and direction of effects vary by brain region. These DEGs are less evolutionarily constrained and are largely explained by genetic variations; roughly 15% are predicted by DNAm variation implicating environmental exposures. We also compared Black and White Americans, confirming most of these ancestry-associated DEGs. Our results highlight how environment and genetic background affect genetic ancestry differences in gene expression in the human brain and affect risk for brain illness.

5.

Improved transcriptome assembly using a hybrid of long and short reads with StringTie.

Shumate, Alaina; Wong, Brandon; Pertea, Geo; Pertea, Mihaela.

PLoS Comput Biol ; 18(6): e1009730, 2022 06.

Article in English | MEDLINE | ID: mdl-35648784

ABSTRACT

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Subject(s)

High-Throughput Nucleotide Sequencing , Transcriptome , Algorithms , Animals , Exons , Humans , Mice , Sequence Analysis, DNA , Sequence Analysis, RNA , Software , Transcriptome/genetics

6.

TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets.

Varabyou, Ales; Pertea, Geo; Pockrandt, Christopher; Pertea, Mihaela.

Bioinformatics ; 37(20): 3650-3651, 2021 Oct 25.

Article in English | MEDLINE | ID: mdl-33964128

ABSTRACT

SUMMARY: Although the ability to programmatically summarize and visually inspect sequencing data is an integral part of genome analysis, currently available methods are not capable of handling large numbers of samples. In particular, making a visual comparison of transcriptional landscapes between two sets of thousands of RNA-seq samples is limited by available computational resources, which can be overwhelmed due to the sheer size of the data. In this work, we present TieBrush, a software package designed to process very large sequencing datasets (RNA, whole-genome, exome, etc.) into a form that enables quick visual and computational inspection. TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input. AVAILABILITY AND IMPLEMENTATION: TieBrush is provided as a C++ package under the MIT License. Precompiled binaries, source code and example data are available on GitHub (https://github.com/alevar/tiebrush). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.

GFF Utilities: GffRead and GffCompare.

Pertea, Geo; Pertea, Mihaela.

F1000Res ; 92020.

Article in English | MEDLINE | ID: mdl-32489650

ABSTRACT

GTF (Gene Transfer Format) and GFF (General Feature Format) are popular file formats used by bioinformatics programs to represent and exchange information about various genomic features, such as gene and transcript locations and structure. GffRead and GffCompare are open source programs that provide extensive and efficient solutions to manipulate files in a GTF or GFF format. While GffRead can convert, sort, filter, transform, or cluster genomic features, GffCompare can be used to compare and merge different gene annotations. Availability and implementation: GFF utilities are implemented in C++ for Linux and OS X and released as open source under an MIT license ( https://github.com/gpertea/gffread, https://github.com/gpertea/gffcompare).

Subject(s)

Computational Biology , Genomics , Software , Genome , Molecular Sequence Annotation

8.

Genome assembly and characterization of a complex zfBED-NLR gene-containing disease resistance locus in Carolina Gold Select rice with Nanopore sequencing.

Read, Andrew C; Moscou, Matthew J; Zimin, Aleksey V; Pertea, Geo; Meyer, Rachel S; Purugganan, Michael D; Leach, Jan E; Triplett, Lindsay R; Salzberg, Steven L; Bogdanove, Adam J.

PLoS Genet ; 16(1): e1008571, 2020 01.

Article in English | MEDLINE | ID: mdl-31986137

ABSTRACT

Long-read sequencing facilitates assembly of complex genomic regions. In plants, loci containing nucleotide-binding, leucine-rich repeat (NLR) disease resistance genes are an important example of such regions. NLR genes constitute one of the largest gene families in plants and are often clustered, evolving via duplication, contraction, and transposition. We recently mapped the Xo1 locus for resistance to bacterial blight and bacterial leaf streak, found in the American heirloom rice variety Carolina Gold Select, to a region that in the Nipponbare reference genome is NLR gene-rich. Here, toward identification of the Xo1 gene, we combined Nanopore and Illumina reads and generated a high-quality Carolina Gold Select genome assembly. We identified 529 complete or partial NLR genes and discovered, relative to Nipponbare, an expansion of NLR genes at the Xo1 locus. One of these has high sequence similarity to the cloned, functionally similar Xa1 gene. Both harbor an integrated zfBED domain, and the repeats within each protein are nearly perfect. Across diverse Oryzeae, we identified two sub-clades of NLR genes with these features, varying in the presence of the zfBED domain and the number of repeats. The Carolina Gold Select genome assembly also uncovered at the Xo1 locus a rice blast resistance gene and a gene encoding a polyphenol oxidase (PPO). PPO activity has been used as a marker for blast resistance at the locus in some varieties; however, the Carolina Gold Select sequence revealed a loss-of-function mutation in the PPO gene that breaks this association. Our results demonstrate that whole genome sequencing combining Nanopore and Illumina reads effectively resolves NLR gene loci. Our identification of an Xo1 candidate is an important step toward mechanistic characterization, including the role(s) of the zfBED domain. Finally, the Carolina Gold Select genome assembly will facilitate identification of other useful traits in this historically important variety.

Subject(s)

Disease Resistance , NLR Proteins/genetics , Oryza/genetics , Plant Proteins/genetics , Molecular Sequence Annotation , NLR Proteins/chemistry , NLR Proteins/metabolism , Nanopore Sequencing/methods , Oryza/immunology , Plant Proteins/chemistry , Plant Proteins/metabolism , Whole Genome Sequencing/methods , Zinc Fingers

9.

Transcriptome assembly from long-read RNA-seq alignments with StringTie2.

Kovaka, Sam; Zimin, Aleksey V; Pertea, Geo M; Razaghi, Roham; Salzberg, Steven L; Pertea, Mihaela.

Genome Biol ; 20(1): 278, 2019 12 16.

Article in English | MEDLINE | ID: mdl-31842956

ABSTRACT

RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

Subject(s)

Genetic Techniques , Genomics/methods , Transcriptome , Animals , Arabidopsis , Humans , Sequence Analysis, RNA , Software , Zea mays

10.

Applying Rapid Whole-Genome Sequencing To Predict Phenotypic Antimicrobial Susceptibility Testing Results among Carbapenem-Resistant Klebsiella pneumoniae Clinical Isolates.

Tamma, Pranita D; Fan, Yunfan; Bergman, Yehudit; Pertea, Geo; Kazmi, Abida Q; Lewis, Shawna; Carroll, Karen C; Schatz, Michael C; Timp, Winston; Simner, Patricia J.

Antimicrob Agents Chemother ; 63(1)2019 01.

Article in English | MEDLINE | ID: mdl-30373801

ABSTRACT

Standard antimicrobial susceptibility testing (AST) approaches lead to delays in the selection of optimal antimicrobial therapy. Here, we sought to determine the accuracy of antimicrobial resistance (AMR) determinants identified by Nanopore whole-genome sequencing in predicting AST results. Using a cohort of 40 clinical isolates (21 carbapenemase-producing carbapenem-resistant Klebsiella pneumoniae, 10 non-carbapenemase-producing carbapenem-resistant K. pneumoniae, and 9 carbapenem-susceptible K. pneumoniae isolates), three separate sequencing and analysis pipelines were performed, as follows: (i) a real-time Nanopore analysis approach identifying acquired AMR genes, (ii) an assembly-based Nanopore approach identifying acquired AMR genes and chromosomal mutations, and (iii) an approach using short-read correction of Nanopore assemblies. The short-read correction of Nanopore assemblies served as the reference standard to determine the accuracy of Nanopore sequencing results. With the real-time analysis approach, full annotation of acquired AMR genes occurred within 8 h from subcultured isolates. Assemblies sufficient for full resistance gene and single-nucleotide polymorphism annotation were available within 14 h from subcultured isolates. The overall agreement of genotypic results and anticipated AST results for the 40 K. pneumoniae isolates was 77% (range, 30% to 100%) and 92% (range, 80% to 100%) for the real-time approach and the assembly approach, respectively. Evaluating the patients contributing the 40 isolates, the real-time approach and assembly approach could shorten the median time to effective antibiotic therapy by 20 h and 26 h, respectively, compared to standard AST. Nanopore sequencing offers a rapid approach to both accurately identify resistance mechanisms and to predict AST results for K. pneumoniae isolates. Bioinformatics improvements enabling real-time alignment, coupled with rapid extraction and library preparation, will further enhance the accuracy and workflow of the Nanopore real-time approach.

Subject(s)

Bacterial Proteins/genetics , Drug Resistance, Multiple, Bacterial/genetics , Genome, Bacterial , Klebsiella pneumoniae/genetics , Phenotype , Whole Genome Sequencing/methods , beta-Lactamases/genetics , Anti-Bacterial Agents/metabolism , Anti-Bacterial Agents/pharmacology , Bacterial Proteins/metabolism , Carbapenems/metabolism , Carbapenems/pharmacology , Cohort Studies , Computational Biology/methods , Gene Expression , Gene Library , Humans , Klebsiella Infections/drug therapy , Klebsiella Infections/microbiology , Klebsiella pneumoniae/drug effects , Klebsiella pneumoniae/enzymology , Klebsiella pneumoniae/isolation & purification , Microbial Sensitivity Tests , Polymorphism, Single Nucleotide , Whole Genome Sequencing/instrumentation , beta-Lactamases/metabolism

11.

CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise.

Pertea, Mihaela; Shumate, Alaina; Pertea, Geo; Varabyou, Ales; Breitwieser, Florian P; Chang, Yu-Chi; Madugundu, Anil K; Pandey, Akhilesh; Salzberg, Steven L.

Genome Biol ; 19(1): 208, 2018 11 28.

Article in English | MEDLINE | ID: mdl-30486838

ABSTRACT

We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .

Subject(s)

Databases, Genetic , Sequence Analysis, RNA , Transcription, Genetic , Amino Acid Sequence , Animals , Female , Humans , Introns , Male

12.

Genome sequence of the progenitor of the wheat D genome Aegilops tauschii.

Luo, Ming-Cheng; Gu, Yong Q; Puiu, Daniela; Wang, Hao; Twardziok, Sven O; Deal, Karin R; Huo, Naxin; Zhu, Tingting; Wang, Le; Wang, Yi; McGuire, Patrick E; Liu, Shuyang; Long, Hai; Ramasamy, Ramesh K; Rodriguez, Juan C; Van, Sonny L; Yuan, Luxia; Wang, Zhenzhong; Xia, Zhiqiang; Xiao, Lichan; Anderson, Olin D; Ouyang, Shuhong; Liang, Yong; Zimin, Aleksey V; Pertea, Geo; Qi, Peng; Bennetzen, Jeffrey L; Dai, Xiongtao; Dawson, Matthew W; Müller, Hans-Georg; Kugler, Karl; Rivarola-Duarte, Lorena; Spannagl, Manuel; Mayer, Klaus F X; Lu, Fu-Hao; Bevan, Michael W; Leroy, Philippe; Li, Pingchuan; You, Frank M; Sun, Qixin; Liu, Zhiyong; Lyons, Eric; Wicker, Thomas; Salzberg, Steven L; Devos, Katrien M; Dvorák, Jan.

Nature ; 551(7681): 498-502, 2017 11 23.

Article in English | MEDLINE | ID: mdl-29143815

ABSTRACT

Aegilops tauschii is the diploid progenitor of the D genome of hexaploid wheat (Triticum aestivum, genomes AABBDD) and an important genetic resource for wheat. The large size and highly repetitive nature of the Ae. tauschii genome has until now precluded the development of a reference-quality genome sequence. Here we use an array of advanced technologies, including ordered-clone genome sequencing, whole-genome shotgun sequencing, and BioNano optical genome mapping, to generate a reference-quality genome sequence for Ae. tauschii ssp. strangulata accession AL8/78, which is closely related to the wheat D genome. We show that compared to other sequenced plant genomes, including a much larger conifer genome, the Ae. tauschii genome contains unprecedented amounts of very similar repeated sequences. Our genome comparisons reveal that the Ae. tauschii genome has a greater number of dispersed duplicated genes than other sequenced genomes and its chromosomes have been structurally evolving an order of magnitude faster than those of other grass genomes. The decay of colinearity with other grass genomes correlates with recombination rates along chromosomes. We propose that the vast amounts of very similar repeated sequences cause frequent errors in recombination and lead to gene duplications and structural chromosome changes that drive fast genome evolution.

Subject(s)

Genome, Plant , Phylogeny , Poaceae/genetics , Triticum/genetics , Chromosome Mapping , Diploidy , Evolution, Molecular , Gene Duplication , Genes, Plant/genetics , Genomics/standards , Poaceae/classification , Recombination, Genetic/genetics , Sequence Analysis, DNA/standards , Triticum/classification

13.

First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (Formerly Scedosporium prolificans).

Luo, Ruibang; Zimin, Aleksey; Workman, Rachael; Fan, Yunfan; Pertea, Geo; Grossman, Nina; Wear, Maggie P; Jia, Bei; Miller, Heather; Casadevall, Arturo; Timp, Winston; Zhang, Sean X; Salzberg, Steven L.

G3 (Bethesda) ; 7(11): 3831-3836, 2017 11 06.

Article in English | MEDLINE | ID: mdl-28963165

ABSTRACT

Here we describe the sequencing and assembly of the pathogenic fungus Lomentospora prolificans using a combination of short, highly accurate Illumina reads and additional coverage in very long Oxford Nanopore reads. The resulting assembly is highly contiguous, containing a total of 37,627,092 bp with over 98% of the sequence in just 26 scaffolds. Annotation identified 8896 protein-coding genes. Pulsed-field gel analysis suggests that this organism contains at least 7 and possibly 11 chromosomes, the two longest of which have sizes corresponding closely to the sizes of the longest scaffolds, at 6.6 and 5.7 Mb.

Subject(s)

Genome, Fungal , Molecular Sequence Annotation , Scedosporium/genetics , Fungal Proteins/genetics , Whole Genome Sequencing

14.

The Douglas-Fir Genome Sequence Reveals Specialization of the Photosynthetic Apparatus in Pinaceae.

Neale, David B; McGuire, Patrick E; Wheeler, Nicholas C; Stevens, Kristian A; Crepeau, Marc W; Cardeno, Charis; Zimin, Aleksey V; Puiu, Daniela; Pertea, Geo M; Sezen, U Uzay; Casola, Claudio; Koralewski, Tomasz E; Paul, Robin; Gonzalez-Ibeas, Daniel; Zaman, Sumaira; Cronn, Richard; Yandell, Mark; Holt, Carson; Langley, Charles H; Yorke, James A; Salzberg, Steven L; Wegrzyn, Jill L.

G3 (Bethesda) ; 7(9): 3157-3167, 2017 09 07.

Article in English | MEDLINE | ID: mdl-28751502

ABSTRACT

A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb.) Franco (Coastal Douglas-fir) is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50 = 340,704 bp). Incremental improvements in sequencing and assembly technologies are in part responsible for the higher quality reference genome, but it may also be due to a slightly lower exact repeat content in Douglas-fir vs. pine and spruce. Comparative genome annotation with angiosperm species reveals gene-family expansion and contraction in Douglas-fir and other conifers which may account for some of the major morphological and physiological differences between the two major plant groups. Notable differences in the size of the NDH-complex gene family and genes underlying the functional basis of shade tolerance/intolerance were observed. This reference genome sequence not only provides an important resource for Douglas-fir breeders and geneticists but also sheds additional light on the evolutionary processes that have led to the divergence of modern angiosperms from the more ancient gymnosperms.

Subject(s)

Genome, Plant , Photosynthesis/genetics , Pinaceae/genetics , Pinaceae/metabolism , Pseudotsuga/genetics , Pseudotsuga/metabolism , Whole Genome Sequencing , Adaptation, Biological/genetics , Computational Biology , Evolution, Molecular , Gene Duplication , Gene Regulatory Networks , Genomics , Molecular Sequence Annotation , Multigene Family , Phylogeny , Pinaceae/classification , Proteomics/methods , Pseudotsuga/classification , Repetitive Sequences, Nucleic Acid

15.

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M; Leek, Jeffrey T; Salzberg, Steven L.

Nat Protoc ; 11(9): 1650-67, 2016 09.

Article in English | MEDLINE | ID: mdl-27560171

ABSTRACT

High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

Subject(s)

Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Software , Statistics as Topic/methods , Molecular Sequence Annotation , RNA, Messenger/genetics , RNA, Messenger/metabolism , User-Computer Interface

16.

Ballgown bridges the gap between transcriptome assembly and expression analysis.

Frazee, Alyssa C; Pertea, Geo; Jaffe, Andrew E; Langmead, Ben; Salzberg, Steven L; Leek, Jeffrey T.

Nat Biotechnol ; 33(3): 243-6, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25748911

Subject(s)

Gene Expression Profiling/methods , Gene Expression Regulation , Software , Transcriptome/genetics , Female , Humans , Male , Quantitative Trait Loci/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism

17.

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.

Pertea, Mihaela; Pertea, Geo M; Antonescu, Corina M; Chang, Tsung-Cheng; Mendell, Joshua T; Salzberg, Steven L.

Nat Biotechnol ; 33(3): 290-5, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25690850

ABSTRACT

Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

Subject(s)

Sequence Analysis, RNA/methods , Software , Transcriptome/genetics , Algorithms , HEK293 Cells , Humans , RNA, Messenger/genetics , RNA, Messenger/metabolism

18.

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

Kim, Daehwan; Pertea, Geo; Trapnell, Cole; Pimentel, Harold; Kelley, Ryan; Salzberg, Steven L.

Genome Biol ; 14(4): R36, 2013 Apr 25.

Article in English | MEDLINE | ID: mdl-23618408

ABSTRACT

TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

Subject(s)

Gene Duplication , Gene Fusion , Mutagenesis, Insertional , Sequence Alignment/methods , Software , Humans , Sensitivity and Specificity , Sequence Analysis, RNA/methods , Transcriptome

19.

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Trapnell, Cole; Roberts, Adam; Goff, Loyal; Pertea, Geo; Kim, Daehwan; Kelley, David R; Pimentel, Harold; Salzberg, Steven L; Rinn, John L; Pachter, Lior.

Nat Protoc ; 7(3): 562-78, 2012 Mar 01.

Article in English | MEDLINE | ID: mdl-22383036

ABSTRACT

Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and â¼1 h of hands-on time.

Subject(s)

DNA, Complementary/genetics , Gene Expression Profiling/methods , Genetic Association Studies/methods , Genomics/methods , Sequence Analysis, DNA/methods , Software

20.

Detection of lineage-specific evolutionary changes among primate species.

Pertea, Mihaela; Pertea, Geo M; Salzberg, Steven L.

BMC Bioinformatics ; 12: 274, 2011 Jul 04.

Article in English | MEDLINE | ID: mdl-21726447

ABSTRACT

BACKGROUND: Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection. RESULTS: We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection. CONCLUSIONS: DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.

Subject(s)

Phylogeny , Primates/genetics , Software , Animals , Biological Evolution , Genome , Genome, Human , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL