Search | VHL Regional Portal

Chromosome-scale assembly of the wild wheat relative Aegilops umbellulata.

Abrouk, Michael; Wang, Yajun; Cavalet-Giorsa, Emile; Troukhan, Maxim; Kravchuk, Maksym; Krattinger, Simon G.

Sci Data ; 10(1): 739, 2023 10 25.

Article in English | MEDLINE | ID: mdl-37880246

ABSTRACT

Wild wheat relatives have been explored in plant breeding to increase the genetic diversity of bread wheat, one of the most important food crops. Aegilops umbellulata is a diploid U genome-containing grass species that serves as a genetic reservoir for wheat improvement. In this study, we report the construction of a chromosome-scale reference assembly of Ae. umbellulata accession TA1851 based on corrected PacBio HiFi reads and chromosome conformation capture. The total assembly size was 4.25 Gb with a contig N50 of 17.7 Mb. In total, 36,268 gene models were predicted. We benchmarked the performance of hifiasm and LJA, two of the most widely used assemblers using standard and corrected HiFi reads, revealing a positive effect of corrected input reads. Comparative genome analysis confirmed substantial chromosome rearrangements in Ae. umbellulata compared to bread wheat. In summary, the Ae. umbellulata assembly provides a resource for comparative genomics in Triticeae and for the discovery of agriculturally important genes.

Subject(s)

Aegilops , Triticum , Aegilops/genetics , Chromosomes, Plant , Genome, Plant , Plant Breeding , Poaceae/genetics , Triticum/genetics

Pan-genome inversion index reveals evolutionary insights into the subpopulation structure of Asian rice.

Zhou, Yong; Yu, Zhichao; Chebotarov, Dmytro; Chougule, Kapeel; Lu, Zhenyuan; Rivera, Luis F; Kathiresan, Nagarajan; Al-Bader, Noor; Mohammed, Nahed; Alsantely, Aseel; Mussurova, Saule; Santos, João; Thimma, Manjula; Troukhan, Maxim; Fornasiero, Alice; Green, Carl D; Copetti, Dario; Kudrna, David; Llaca, Victor; Lorieux, Mathias; Zuccolo, Andrea; Ware, Doreen; McNally, Kenneth; Zhang, Jianwei; Wing, Rod A.

Nat Commun ; 14(1): 1567, 2023 03 21.

Article in English | MEDLINE | ID: mdl-36944612

ABSTRACT

Understanding and exploiting genetic diversity is a key factor for the productive and stable production of rice. Here, we utilize 73 high-quality genomes that encompass the subpopulation structure of Asian rice (Oryza sativa), plus the genomes of two wild relatives (O. rufipogon and O. punctata), to build a pan-genome inversion index of 1769 non-redundant inversions that span an average of ~29% of the O. sativa cv. Nipponbare reference genome sequence. Using this index, we estimate an inversion rate of ~700 inversions per million years in Asian rice, which is 16 to 50 times higher than previously estimated for plants. Detailed analyses of these inversions show evidence of their effects on gene expression, recombination rate, and linkage disequilibrium. Our study uncovers the prevalence and scale of large inversions (≥100 bp) across the pan-genome of Asian rice and hints at their largely unexplored role in functional biology and crop performance.

Subject(s)

Oryza , Oryza/genetics , Sequence Analysis, DNA , Genome, Plant/genetics , Biological Evolution , Phylogeny

Genome-Wide Prediction of Transcription Start Sites in Conifers.

Bondar, Eugeniya I; Troukhan, Maxim E; Krutovsky, Konstantin V; Tatarinova, Tatiana V.

Int J Mol Sci ; 23(3)2022 Feb 03.

Article in English | MEDLINE | ID: mdl-35163661

ABSTRACT

The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.

Subject(s)

Genome, Plant , Tracheophyta/genetics , Transcription Initiation Site , Base Composition/genetics , Binding Sites , DNA, Plant/genetics , Exons/genetics , Molecular Sequence Annotation , Nucleotide Motifs/genetics , Nucleotides/metabolism , Open Reading Frames/genetics , Promoter Regions, Genetic , Transcription Factors/metabolism

High resolution genetic mapping by genome sequencing reveals genome duplication and tetraploid genetic structure of the diploid Miscanthus sinensis.

Ma, Xue-Feng; Jensen, Elaine; Alexandrov, Nickolai; Troukhan, Maxim; Zhang, Liping; Thomas-Jones, Sian; Farrar, Kerrie; Clifton-Brown, John; Donnison, Iain; Swaller, Timothy; Flavell, Richard.

PLoS One ; 7(3): e33821, 2012.

Article in English | MEDLINE | ID: mdl-22439001

ABSTRACT

We have created a high-resolution linkage map of Miscanthus sinensis, using genotyping-by-sequencing (GBS), identifying all 19 linkage groups for the first time. The result is technically significant since Miscanthus has a very large and highly heterozygous genome, but has no or limited genomics information to date. The composite linkage map containing markers from both parental linkage maps is composed of 3,745 SNP markers spanning 2,396 cM on 19 linkage groups with a 0.64 cM average resolution. Comparative genomics analyses of the M. sinensis composite linkage map to the genomes of sorghum, maize, rice, and Brachypodium distachyon indicate that sorghum has the closest syntenic relationship to Miscanthus compared to other species. The comparative results revealed that each pair of the 19 M. sinensis linkages aligned to one sorghum chromosome, except for LG8, which mapped to two sorghum chromosomes (4 and 7), presumably due to a chromosome fusion event after genome duplication. The data also revealed several other chromosome rearrangements relative to sorghum, including two telomere-centromere inversions of the sorghum syntenic chromosome 7 in LG8 of M. sinensis and two paracentric inversions of sorghum syntenic chromosome 4 in LG7 and LG8 of M. sinensis. The results clearly demonstrate, for the first time, that the diploid M. sinensis is tetraploid origin consisting of two sub-genomes. This complete and high resolution composite linkage map will not only serve as a useful resource for novel QTL discoveries, but also enable informed deployment of the wealth of existing genomics resources of other species to the improvement of Miscanthus as a high biomass energy crop. In addition, it has utility as a reference for genome sequence assembly for the forthcoming whole genome sequencing of the Miscanthus genus.

Subject(s)

Poaceae/genetics , Biofuels , Chromosome Mapping , Chromosomes, Plant/genetics , Diploidy , Gene Duplication , Genetic Markers , Genome, Plant , Poaceae/classification , Polymorphism, Single Nucleotide , Sorghum/classification , Sorghum/genetics , Species Specificity , Tetraploidy

Genome-wide discovery of cis-elements in promoter sequences using gene expression.

Troukhan, Maxim; Tatarinova, Tatiana; Bouck, John; Flavell, Richard B; Alexandrov, Nickolai N.

OMICS ; 13(2): 139-51, 2009 Apr.

Article in English | MEDLINE | ID: mdl-19231992

ABSTRACT

The availability of complete or nearly complete genome sequences, a large number of 5' expressed sequence tags, and significant public expression data allow for a more accurate identification of cis-elements regulating gene expression. We have implemented a global approach that takes advantage of available expression data, genomic sequences, and transcript information to predict cis-elements associated with specific expression patterns. The key components of our approach are: (1) precise identification of transcription start sites, (2) specific locations of cis-elements relative to the transcription start site, and (3) assessment of statistical significance for all sequence motifs. By applying our method to promoters of Arabidopsis thaliana and Mus musculus, we have identified motifs that affect gene expression under specific environmental conditions or in certain tissues. We also found that the presence of the TATA box is associated with increased variability of gene expression. Strong correlation between our results and experimentally determined motifs shows that the method is capable of predicting new functionally important cis-elements in promoter sequences.

Subject(s)

Gene Expression , Genome-Wide Association Study , Promoter Regions, Genetic , Algorithms , Animals , Arabidopsis/genetics , Mice

Insights into corn genes derived from large-scale cDNA sequencing.

Alexandrov, Nickolai N; Brover, Vyacheslav V; Freidin, Stanislav; Troukhan, Maxim E; Tatarinova, Tatiana V; Zhang, Hongyu; Swaller, Timothy J; Lu, Yu-Ping; Bouck, John; Flavell, Richard B; Feldmann, Kenneth A.

Plant Mol Biol ; 69(1-2): 179-94, 2009 Jan.

Article in English | MEDLINE | ID: mdl-18937034

ABSTRACT

We present a large portion of the transcriptome of Zea mays, including ESTs representing 484,032 cDNA clones from 53 libraries and 36,565 fully sequenced cDNA clones, out of which 31,552 clones are non-redundant. These and other previously sequenced transcripts have been aligned with available genome sequences and have provided new insights into the characteristics of gene structures and promoters within this major crop species. We found that although the average number of introns per gene is about the same in corn and Arabidopsis, corn genes have more alternatively spliced isoforms. Examination of the nucleotide composition of coding regions reveals that corn genes, as well as genes of other Poaceae (Grass family), can be divided into two classes according to the GC content at the third position in the amino acid encoding codons. Many of the transcripts that have lower GC content at the third position have dicot homologs but the high GC content transcripts tend to be more specific to the grasses. The high GC content class is also enriched with intronless genes. Together this suggests that an identifiable class of genes in plants is associated with the Poaceae divergence. Furthermore, because many of these genes appear to be derived from ancestral genes that do not contain introns, this evolutionary divergence may be the result of horizontal gene transfer from species not only with different codon usage but possibly that did not have introns, perhaps outside of the plant kingdom. By comparing the cDNAs described herein with the non-redundant set of corn mRNAs in GenBank, we estimate that there are about 50,000 different protein coding genes in Zea. All of the sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers EU940701-EU977132 (FLI cDNA) and FK944382-FL482108 (EST).

Subject(s)

DNA, Complementary/genetics , Genes, Plant , Zea mays/genetics , Alternative Splicing , Base Sequence , DNA Primers , Expressed Sequence Tags , Promoter Regions, Genetic , Transcription, Genetic

Features of Arabidopsis genes and genome discovered using full-length cDNAs.

Alexandrov, Nickolai N; Troukhan, Maxim E; Brover, Vyacheslav V; Tatarinova, Tatiana; Flavell, Richard B; Feldmann, Kenneth A.

Plant Mol Biol ; 60(1): 69-85, 2006 Jan.

Article in English | MEDLINE | ID: mdl-16463100

ABSTRACT

Arabidopsis is currently the reference genome for higher plants. A new, more detailed statistical analysis of Arabidopsis gene structure is presented including intron and exon lengths, intergenic distances, features of promoters, and variant 5'-ends of mRNAs transcribed from the same transcription unit. We also provide a statistical characterization of Arabidopsis transcripts in terms of their size, UTR lengths, 3'-end cleavage sites, splicing variants, and coding potential. These analyses were facilitated by scrutiny of our collection of sequenced full-length cDNAs and much larger collection of 5'-ESTs, together with another set of full-length cDNAs from Salk/Stanford/Plant Gene Expression Center/RIKEN. Examples of alternative splicing are observed for transcripts from 7% of the genes and many of these genes display multiple spliced isoforms. Most splicing variants lie in non-coding regions of the transcripts. Non-canonical splice sites constitute less than 1% of all splice sites. Genes with fewer than four introns display reduced average mRNA levels. Putative alternative transcription start sites were observed in 30% of highly expressed genes and in more than 50% of the genes with low expression. Transcription start sites correlate remarkably well with a CG skew peak in the DNA sequences. The intergenic distances vary considerably, those where genes are transcribed towards one another being significantly shorter. New transcripts, missing in the current TIGR genome annotation and ESTs that are non-coding, including those antisense to known genes, are derived and cataloged in the Supplementary Material. They identify 148 new loci in the Arabidopsis genome. The conclusions drawn provide a better understanding of the Arabidopsis genome and how the gene transcripts are processed. The results also allow better predictions to be made for, as yet, poorly defined genes and provide a reference for comparisons with other plant genomes whose complete sequences are currently being determined. Some comparisons with rice are included in this paper.

Subject(s)

Arabidopsis/genetics , DNA, Complementary/genetics , Genes, Plant/genetics , Genome, Plant , Alternative Splicing , Base Sequence , DNA, Intergenic , DNA, Plant/genetics , Exons/genetics , Gene Expression Profiling , Gene Expression Regulation, Plant , Introns/genetics , Transcription Initiation Site

Skew in CG content near the transcription start site in Arabidopsis thaliana.

Tatarinova, Tatiana; Brover, Vyacheslav; Troukhan, Maxim; Alexandrov, Nickolai.

Bioinformatics ; 19 Suppl 1: i313-4, 2003.

Article in English | MEDLINE | ID: mdl-12855475

ABSTRACT

We have discovered a novel statistical feature of Arabidopsis thaliana genome that remarkably correlates with a position of transcription start site--CG skew peak. We hypothesize that the phenomenon can be explained by the higher mutability of unprotected cytosines.

Subject(s)

Arabidopsis/genetics , Cytosine , Guanine , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Transcription Initiation Site , Transcription, Genetic/genetics , Base Composition , DNA Mutational Analysis , Genetic Variation , Genome, Plant

Full-length messenger RNA sequences greatly improve genome annotation.

Haas, Brian J; Volfovsky, Natalia; Town, Christopher D; Troukhan, Maxim; Alexandrov, Nickolai; Feldmann, Kenneth A; Flavell, Richard B; White, Owen; Salzberg, Steven L.

Genome Biol ; 3(6): RESEARCH0029, 2002.

Article in English | MEDLINE | ID: mdl-12093376

ABSTRACT

BACKGROUND: Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. RESULTS: Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. CONCLUSIONS: Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.

Subject(s)

Arabidopsis/genetics , Genome, Plant , RNA, Messenger/genetics , Alternative Splicing/genetics , Computational Biology , Databases, Genetic , Exons/genetics , Genes, Plant/genetics , RNA Splicing/genetics , RNA, Messenger/classification , RNA, Plant/classification , RNA, Plant/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL