Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 29
Filter
1.
PLoS One ; 19(4): e0302271, 2024.
Article in English | MEDLINE | ID: mdl-38630664

ABSTRACT

We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but may also have hidden cluster structure in their rows: for example, they may be drawn from heterogeneous (geographical, socioeconomic, methodological) sources, such that the outcome variable they describe (such as the presence of a disease) may depend not only on the other variables but on the cluster context. Moreover, sharing of biomedical data is often hindered by patient confidentiality laws, and there is current interest in algorithms to generate synthetic tabular data from real data, for example via deep learning. We demonstrate a novel EM-based clustering algorithm, MMM ("Madras Mixture Model"), that outperforms standard algorithms in determining clusters in synthetic heterogeneous data, and recovers structure in real data. Based on this, we demonstrate a synthetic tabular data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data assuming cluster-specific data distributions for the input columns. We benchmark this algorithm by testing the performance of standard ML algorithms when they are trained on synthetic data and tested on real published datasets. Our synthetic data generation algorithm outperforms other literature tabular-data generators, and approaches the performance of training purely with real data.


Subject(s)
Algorithms , Humans , India , Cluster Analysis
2.
Med Mycol ; 62(3)2024 Mar 07.
Article in English | MEDLINE | ID: mdl-38414264

ABSTRACT

Candida auris poses threats to the global medical community due to its multidrug resistance, ability to cause nosocomial outbreaks and resistance to common sterilization agents. Different variants that emerged at different geographical zones were classified as clades. Clade-typing becomes necessary to track its spread, possible emergence of new clades, and to predict the properties that exhibit a clade bias. We previously reported a colony-Polymerase Chain Reaction-based, clade-identification method employing whole genome alignments and identification of clade-specific sequences of four major geographical clades. Here, we expand the panel by identifying clade 5 which was later isolated in Iran, using specific primers designed through in silico analyses.


Candida auris, a multidrug-resistant fungal pathogen, evolves as distinct geographical clades. We describe the identification of clade 5 specific DNA sequence, which was used to design primers that distinguished clade 5 from other clades, adding to the panel of the clade-identification system.


Subject(s)
Candida , Candidiasis , Animals , Candida/genetics , Candidiasis/epidemiology , Candidiasis/veterinary , Candida auris , Polymerase Chain Reaction/veterinary , Genome, Fungal , Antifungal Agents/pharmacology , Microbial Sensitivity Tests/veterinary
3.
R Soc Open Sci ; 11(1): 231088, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38269075

ABSTRACT

Transcription factor binding sites (TFBS), like other DNA sequence, evolve via mutation and selection relating to their function. Models of nucleotide evolution describe DNA evolution via single-nucleotide mutation. A stationary vector of such a model is the long-term distribution of nucleotides, unchanging under the model. Neutrally evolving sites may have uniform stationary vectors, but one expects that sites within a TFBS instead have stationary vectors reflective of the fitness of various nucleotides at those positions. We introduce 'position-specific stationary vectors' (PSSVs), the collection of stationary vectors at each site in a TFBS locus, analogous to the position weight matrix (PWM) commonly used to describe TFBS. We infer PSSVs for human TFs using two evolutionary models (Felsenstein 1981 and Hasegawa-Kishino-Yano 1985). We find that PSSVs reflect the nucleotide distribution from PWMs, but with reduced specificity. We infer ancestral nucleotide distributions at individual positions and calculate 'conditional PSSVs' conditioned on specific choices of majority ancestral nucleotide. We find that certain ancestral nucleotides exert a strong evolutionary pressure on neighbouring sequence while others have a negligible effect. Finally, we present a fast likelihood calculation for the F81 model on moderate-sized trees that makes this approach feasible for large-scale studies along these lines.

4.
iScience ; 26(10): 107846, 2023 Oct 20.
Article in English | MEDLINE | ID: mdl-37767000

ABSTRACT

Early onset of type 2 diabetes and cardiovascular disease are common complications for women diagnosed with gestational diabetes. Prediabetes refers to a condition in which blood glucose levels are higher than normal, but not yet high enough to be diagnosed as type 2 diabetes. Currently, there is no accurate way of knowing which women with gestational diabetes are likely to develop postpartum prediabetes. This study aims to predict the risk of postpartum prediabetes in women diagnosed with gestational diabetes. Our sparse logistic regression approach selects only two variables - antenatal fasting glucose at OGTT and HbA1c soon after the diagnosis of GDM - as relevant, but gives an area under the receiver operating characteristic curve of 0.72, outperforming all other methods. We envision this to be a practical solution, which coupled with a targeted follow-up of high-risk women, could yield better cardiometabolic outcomes in women with a history of GDM.

5.
Heliyon ; 9(8): e18211, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37520992

ABSTRACT

Transcription factors (TFs) and their binding sites have evolved to interact cooperatively or competitively with each other. Here we examine in detail, across multiple cell lines, such cooperation or competition among TFs both in sequential and spatial proximity (using chromatin conformation capture assays), considering in vivo binding data as well as TF binding motifs in DNA. We ascertain significantly co-occurring ("attractive") or avoiding ("repulsive") TF pairs using robust randomized models that retain the essential characteristics of the experimental data. Across human cell lines TFs organize into two groups, with intra-group attraction and inter-group repulsion. This is true for both sequential and spatial proximity, and for both in vivo binding and sequence motifs. Attractive TF pairs exhibit significantly more physical interactions suggesting an underlying mechanism. The two TF groups differ significantly in their genomic and network properties, as well in their function-while one group regulates housekeeping function, the other potentially regulates lineage-specific functions, that are disrupted in cancer. Weaker binding sites tend to occur in spatially interacting regions of the genome. Our results suggest that a complex pattern of spatial cooperativity of TFs and chromatin has evolved with the genome to support housekeeping and lineage-specific functions.

6.
PLoS One ; 17(3): e0264648, 2022.
Article in English | MEDLINE | ID: mdl-35255105

ABSTRACT

OBJECTIVE: The aim of the present study was to identify the factors associated with non-attendance of immediate postpartum glucose test using a machine learning algorithm following gestational diabetes mellitus (GDM) pregnancy. METHOD: A retrospective cohort study of all GDM women (n = 607) for postpartum glucose test due between January 2016 and December 2019 at the George Eliot Hospital NHS Trust, UK. RESULTS: Sixty-five percent of women attended postpartum glucose test. Type 2 diabetes was diagnosed in 2.8% and 21.6% had persistent dysglycaemia at 6-13 weeks post-delivery. Those who did not attend postpartum glucose test seem to be younger, multiparous, obese, and continued to smoke during pregnancy. They also had higher fasting glucose at antenatal oral glucose tolerance test. Our machine learning algorithm predicted postpartum glucose non-attendance with an area under the receiver operating characteristic curve of 0.72. The model could achieve a sensitivity of 70% with 66% specificity at a risk score threshold of 0.46. A total of 233 (38.4%) women attended subsequent glucose test at least once within the first two years of delivery and 24% had dysglycaemia. Compared to women who attended postpartum glucose test, those who did not attend had higher conversion rate to type 2 diabetes (2.5% vs 11.4%; p = 0.005). CONCLUSION: Postpartum screening following GDM is still poor. Women who did not attend postpartum screening appear to have higher metabolic risk and higher conversion to type 2 diabetes by two years post-delivery. Machine learning model can predict women who are unlikely to attend postpartum glucose test using simple antenatal factors. Enhanced, personalised education of these women may improve postpartum glucose screening.


Subject(s)
Diabetes Mellitus, Type 2 , Diabetes, Gestational , Blood Glucose/metabolism , Diabetes Mellitus, Type 2/diagnosis , Diabetes Mellitus, Type 2/epidemiology , Diabetes, Gestational/diagnosis , Diabetes, Gestational/epidemiology , Diabetes, Gestational/metabolism , Female , Glucose , Humans , Machine Learning , Male , Postpartum Period , Pregnancy , Retrospective Studies
7.
Microbiol Spectr ; 10(2): e0063422, 2022 04 27.
Article in English | MEDLINE | ID: mdl-35343775

ABSTRACT

Candida auris, the multidrug-resistant human fungal pathogen, emerged as four major distinct geographical clades (clade 1-clade 4) in the past decade. Though isolates of the same species, C. auris clinical strains exhibit clade-specific properties associated with virulence and drug resistance. In this study, we report the identification of unique DNA sequence junctions by mapping clade-specific regions through comparative analysis of whole-genome sequences of strains belonging to different clades. These unique DNA sequence stretches are used to identify C. auris isolates at the clade level in subsequent in silico and experimental analyses. We develop a colony PCR-based clade-identification system (ClaID), which is rapid and specific. In summary, we demonstrate a proof-of-concept for using unique DNA sequence junctions conserved in a clade-specific manner for the rapid identification of each of the four major clades of C. auris. IMPORTANCE C. auris was first isolated in Japan in 2009 as an antifungal drug-susceptible pathogen causing localized infections. Within a decade, it simultaneously evolved in different parts of the world as distinct clades exhibiting resistance to antifungal drugs at varying levels. Recent studies hinted the mixing of isolates belonging to different geographical clades in a single location, suggesting that the area of isolation alone may not indicate the clade status of an isolate. In this study, we compared the genomes of representative strains of the four major clades to identify clade-specific sequences, which were then used to design clade-specific primers. We propose the utilization of whole genome sequence data to extract clade-specific sequences for clade-typing. The colony PCR-based method employed can rapidly distinguish between the four major clades of C. auris, with scope for expanding the panel by adding more primer pairs.


Subject(s)
Antifungal Agents , Candida , Antifungal Agents/pharmacology , Antifungal Agents/therapeutic use , Candida/genetics , Candida auris , Humans , Japan , Microbial Sensitivity Tests , Virulence
8.
mBio ; 12(3)2021 05 11.
Article in English | MEDLINE | ID: mdl-33975937

ABSTRACT

The thermotolerant multidrug-resistant ascomycete Candida auris rapidly emerged since 2009 causing systemic infections worldwide and simultaneously evolved in different geographical zones. The molecular events that orchestrated this sudden emergence of the killer fungus remain mostly elusive. Here, we identify centromeres in C. auris and related species, using a combined approach of chromatin immunoprecipitation and comparative genomic analyses. We find that C. auris and multiple other species in the Clavispora/Candida clade shared a conserved small regional GC-poor centromere landscape lacking pericentromeres or repeats. Further, a centromere inactivation event led to karyotypic alterations in this species complex. Interspecies genome analysis identified several structural chromosomal changes around centromeres. In addition, centromeres are found to be rapidly evolving loci among the different geographical clades of the same species of C. auris Finally, we reveal an evolutionary trajectory of the unique karyotype associated with clade 2 that consists of the drug-susceptible isolates of C. aurisIMPORTANCECandida auris, the killer fungus, emerged as different geographical clades, exhibiting multidrug resistance and high karyotype plasticity. Chromosomal rearrangements are known to play key roles in the emergence of new species, virulence, and drug resistance in pathogenic fungi. Centromeres, the genomic loci where microtubules attach to separate the sister chromatids during cell division, are known to be hot spots of breaks and downstream rearrangements. We identified the centromeres in C. auris and related species to study their involvement in the evolution and karyotype diversity reported in C. auris We report conserved centromere features in 10 related species and trace the events that occurred at the centromeres during evolution. We reveal a centromere inactivation-mediated chromosome number change in these closely related species. We also observe that one of the geographical clades, the East Asian clade, evolved along a unique trajectory, compared to the other clades and related species.


Subject(s)
Candida/genetics , Centromere/genetics , Centromere/metabolism , Chromosomes/genetics , Evolution, Molecular , Genome, Fungal , Antifungal Agents/pharmacology , Candida/classification , Candida/drug effects , Candidiasis/microbiology , Centromere/classification , Chromosomes/classification , Genomics , Virulence
9.
Genome Res ; 31(4): 607-621, 2021 04.
Article in English | MEDLINE | ID: mdl-33514624

ABSTRACT

The establishment of centromeric chromatin and its propagation by the centromere-specific histone CENPA is mediated by epigenetic mechanisms in most eukaryotes. DNA replication origins, origin binding proteins, and replication timing of centromere DNA are important determinants of centromere function. The epigenetically regulated regional centromeres in the budding yeast Candida albicans have unique DNA sequences that replicate earliest in every chromosome and are clustered throughout the cell cycle. In this study, the genome-wide occupancy of the replication initiation protein Orc4 reveals its abundance at all centromeres in C. albicans Orc4 is associated with four different DNA sequence motifs, one of which coincides with tRNA genes (tDNA) that replicate early and cluster together in space. Hi-C combined with genome-wide replication timing analyses identify that early replicating Orc4-bound regions interact with themselves stronger than with late replicating Orc4-bound regions. We simulate a polymer model of chromosomes of C. albicans and propose that the early replicating and highly enriched Orc4-bound sites preferentially localize around the clustered kinetochores. We also observe that Orc4 is constitutively localized to centromeres, and both Orc4 and the helicase Mcm2 are essential for cell viability and CENPA stability in C. albicans Finally, we show that new molecules of CENPA are recruited to centromeres during late anaphase/telophase, which coincides with the stage at which the CENPA-specific chaperone Scm3 localizes to the kinetochore. We propose that the spatiotemporal localization of Orc4 within the nucleus, in collaboration with Mcm2 and Scm3, maintains centromeric chromatin stability and CENPA recruitment in C. albicans.


Subject(s)
Candida albicans , Centromere , Chromatin , Origin Recognition Complex/metabolism , Candida albicans/genetics , Centromere/genetics , Chromatin/chemistry , Chromatin/genetics , Chromatin/metabolism , Histones/metabolism , Kinetochores , Replication Origin/genetics
10.
PLoS One ; 15(11): e0242375, 2020.
Article in English | MEDLINE | ID: mdl-33211740

ABSTRACT

Vasoplegia observed post cardiopulmonary bypass (CPB) is associated with substantial morbidity, multiple organ failure and mortality. Circulating counts of hematopoietic stem cells (HSCs) and endothelial progenitor cells (EPC) are potential markers of neo-vascularization and vascular repair. However, the significance of changes in the circulating levels of these progenitors in perioperative CPB, and their association with post-CPB vasoplegia, are currently unexplored. We enumerated HSC and EPC counts, via flow cytometry, at different time-points during CPB in 19 individuals who underwent elective cardiac surgery. These 19 individuals were categorized into two groups based on severity of post-operative vasoplegia, a clinically insignificant vasoplegic Group 1 (G1) and a clinically significant vasoplegic Group 2 (G2). Differential changes in progenitor cell counts during different stages of surgery were compared across these two groups. Machine-learning classifiers (logistic regression and gradient boosting) were employed to determine if differential changes in progenitor counts could aid the classification of individuals into these groups. Enumerating progenitor cells revealed an early and significant increase in the circulating counts of CD34+ and CD34+CD133+ hematopoietic stem cells (HSC) in G1 individuals, while these counts were attenuated in G2 individuals. Additionally, EPCs (CD34+VEGFR2+) were lower in G2 individuals compared to G1. Gradient boosting outperformed logistic regression in assessing the vasoplegia grouping based on the fold change in circulating CD 34+ levels. Our findings indicate that a lack of early response of CD34+ cells and CD34+CD133+ HSCs might serve as an early marker for development of clinically significant vasoplegia after CPB.


Subject(s)
Blood Cell Count , Cardiopulmonary Bypass/adverse effects , Endothelial Progenitor Cells , Hematopoietic Stem Cells , Vasoplegia/blood , Adrenergic beta-Antagonists/therapeutic use , Adult , Aged , Angiotensin II Type 1 Receptor Blockers/therapeutic use , Angiotensin-Converting Enzyme Inhibitors/therapeutic use , Anthropometry , Comorbidity , Elective Surgical Procedures , Female , Humans , Hydroxymethylglutaryl-CoA Reductase Inhibitors/therapeutic use , Intraoperative Period , Kinetics , Machine Learning , Male , Middle Aged , Pilot Projects , Postoperative Period , Severity of Illness Index , Vasoplegia/physiopathology
11.
Elife ; 92020 01 20.
Article in English | MEDLINE | ID: mdl-31958060

ABSTRACT

Genomic rearrangements associated with speciation often result in variation in chromosome number among closely related species. Malassezia species show variable karyotypes ranging between six and nine chromosomes. Here, we experimentally identified all eight centromeres in M. sympodialis as 3-5-kb long kinetochore-bound regions that span an AT-rich core and are depleted of the canonical histone H3. Centromeres of similar sequence features were identified as CENP-A-rich regions in Malassezia furfur, which has seven chromosomes, and histone H3 depleted regions in Malassezia slooffiae and Malassezia globosa with nine chromosomes each. Analysis of synteny conservation across centromeres with newly generated chromosome-level genome assemblies suggests two distinct mechanisms of chromosome number reduction from an inferred nine-chromosome ancestral state: (a) chromosome breakage followed by loss of centromere DNA and (b) centromere inactivation accompanied by changes in DNA sequence following chromosome-chromosome fusion. We propose that AT-rich centromeres drive karyotype diversity in the Malassezia species complex through breakage and inactivation.


Millions of yeast, bacteria and other microbes live in or on the human body. A type of yeast known as Malassezia is one of the most abundantmicrobes living on our skin. Generally, Malassezia do not cause symptoms in humans but are associated with dandruff, dermatitis and other skin conditions in susceptible individuals. They have also been found in the human gut, where they exacerbate Crohn's disease and pancreatic cancer. There are 18 closely related species of Malassezia and all have an unusually small amount of genetic material compared with other types of yeast. In yeast, like in humans, the genetic material is divided among several chromosomes. The number of chromosomes in different Malassezia species varies between six and nine. A region of each chromosome known as the centromere is responsible for ensuring that the equal numbers of chromosomes are passed on to their offspring. This means that any defects in centromeres can lead to the daughter yeast cells inheriting unequal numbers of chromosomes. Changes in chromosome number can drive the evolution of new species, but it remains unclear if and how centromere loss may have contributed to the evolution of Malassezia species. Sankaranarayanan et al. have now used biochemical, molecular genetic, and comparative genomic approaches to study the chromosomes of Malassezia species. The experiments revealed that nine Malassezia species had centromeres that shared common features such as being rich in adenine and thymine nucleotides, two of the building blocks of DNA. Sankaranarayanan et al. propose that these adenines and thymines make the centromeres more fragile leading to occasional breaks. This may have contributed to the loss of centromeres in some Malassezia cells and helped new species to evolve with fewer chromosomes. A better understanding of how Malassezia organize their genetic material should enable in-depth studies of how these yeasts interact with their human hosts and how they contribute to skin disease, cancer, Crohn's disease and other health conditions. More broadly, these findings may help scientists to better understand how changes in chromosomes cause new species to evolve.


Subject(s)
Centromere , Evolution, Molecular , Karyotyping , Malassezia/physiology , Chromosomes, Fungal , Malassezia/classification , Malassezia/genetics , Species Specificity
12.
PLoS Comput Biol ; 15(3): e1006921, 2019 03.
Article in English | MEDLINE | ID: mdl-30897079

ABSTRACT

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription factor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, ChIPulate, we assess the impact of various biological and experimental sources of variation on several outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of replicates needed to confidently infer binding strength. We find that the TF motif can be recovered despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the motif is, however, affected to a larger extent by the fraction of sites that are either cooperatively or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recommended community standards. Our results establish statistical limits on the accuracy of inferences of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, rather than amplification efficiency, would better improve sensitivity. The source code and instructions for running ChIPulate can be found at https://github.com/vishakad/chipulate.


Subject(s)
Chromatin Immunoprecipitation/methods , Computational Biology/methods , Sequence Analysis, DNA/methods , Software , Transcription Factors , Binding Sites/genetics , Computer Simulation , DNA/chemistry , DNA/genetics , DNA/metabolism , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Escherichia coli/genetics , Escherichia coli Proteins/chemistry , Escherichia coli Proteins/genetics , Escherichia coli Proteins/metabolism , High-Throughput Nucleotide Sequencing , Protein Binding/genetics , Transcription Factors/chemistry , Transcription Factors/genetics , Transcription Factors/metabolism
13.
PLoS One ; 13(7): e0199771, 2018.
Article in English | MEDLINE | ID: mdl-30016330

ABSTRACT

Transcription factors (TFs) often work cooperatively, where the binding of one TF to DNA enhances the binding affinity of a second TF to a nearby location. Such cooperative binding is important for activating gene expression from promoters and enhancers in both prokaryotic and eukaryotic cells. Existing methods to detect cooperative binding of a TF pair rely on analyzing the sequence that is bound. We propose a method that uses, instead, only ChIP-seq peak intensities and an expectation maximization (CPI-EM) algorithm. We validate our method using ChIP-seq data from cells where one of a pair of TFs under consideration has been genetically knocked out. Our algorithm relies on our observation that cooperative TF-TF binding is correlated with weak binding of one of the TFs, which we demonstrate in a variety of cell types, including E. coli, S. cerevisiae and M. musculus cells. We show that this method performs significantly better than a predictor based only on the ChIP-seq peak distance of the TFs under consideration. This suggests that peak intensities contain information that can help detect the cooperative binding of a TF pair. CPI-EM also outperforms an existing sequence-based algorithm in detecting cooperative binding. The CPI-EM algorithm is available at https://github.com/vishakad/cpi-em.


Subject(s)
Chromatin Immunoprecipitation/methods , Protein Interaction Mapping/methods , Software , Transcription Factors/metabolism , Animals , Escherichia coli , Mice , Protein Binding , Saccharomyces cerevisiae
14.
Nucleic Acids Res ; 46(5): e29, 2018 03 16.
Article in English | MEDLINE | ID: mdl-29267972

ABSTRACT

We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large 'window' sizes (≥50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.


Subject(s)
Algorithms , Computational Biology/methods , DNA/genetics , High-Throughput Nucleotide Sequencing/methods , Nucleotide Motifs/genetics , Binding Sites/genetics , Chromatin/genetics , Chromatin/metabolism , Chromatin Immunoprecipitation/methods , Cluster Analysis , DNA/chemistry , DNA/metabolism , Genomics/methods , Humans , Protein Binding , Reproducibility of Results , Transcription Factors/metabolism
16.
Nucleic Acids Res ; 45(5): 2629-2643, 2017 03 17.
Article in English | MEDLINE | ID: mdl-28100699

ABSTRACT

Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies.


Subject(s)
Fungal Proteins/genetics , Genome, Fungal , Malassezia/genetics , Molecular Sequence Annotation/methods , Proteogenomics/methods , Genes, Fungal , Genome, Mitochondrial , Peptides/genetics , Protein Domains , Sequence Analysis, RNA
17.
PLoS Genet ; 12(2): e1005839, 2016 Feb.
Article in English | MEDLINE | ID: mdl-26845548

ABSTRACT

The centromere, on which kinetochore proteins assemble, ensures precise chromosome segregation. Centromeres are largely specified by the histone H3 variant CENP-A (also known as Cse4 in yeasts). Structurally, centromere DNA sequences are highly diverse in nature. However, the evolutionary consequence of these structural diversities on de novo CENP-A chromatin formation remains elusive. Here, we report the identification of centromeres, as the binding sites of four evolutionarily conserved kinetochore proteins, in the human pathogenic budding yeast Candida tropicalis. Each of the seven centromeres comprises a 2 to 5 kb non-repetitive mid core flanked by 2 to 5 kb inverted repeats. The repeat-associated centromeres of C. tropicalis all share a high degree of sequence conservation with each other and are strikingly diverged from the unique and mostly non-repetitive centromeres of related Candida species--Candida albicans, Candida dubliniensis, and Candida lusitaniae. Using a plasmid-based assay, we further demonstrate that pericentric inverted repeats and the underlying DNA sequence provide a structural determinant in CENP-A recruitment in C. tropicalis, as opposed to epigenetically regulated CENP-A loading at centromeres in C. albicans. Thus, the centromere structure and its influence on de novo CENP-A recruitment has been significantly rewired in closely related Candida species. Strikingly, the centromere structural properties along with role of pericentric repeats in de novo CENP-A loading in C. tropicalis are more reminiscent to those of the distantly related fission yeast Schizosaccharomyces pombe. Taken together, we demonstrate, for the first time, fission yeast-like repeat-associated centromeres in an ascomycetous budding yeast.


Subject(s)
Candida tropicalis/genetics , Centromere/genetics , Repetitive Sequences, Nucleic Acid/genetics , Autoantigens/metabolism , Base Pairing/genetics , Centromere Protein A , Chromatin Immunoprecipitation , Chromosomal Proteins, Non-Histone/metabolism , Chromosome Mapping , Chromosome Segregation/genetics , Chromosomes, Fungal/metabolism , Conserved Sequence , Evolution, Molecular , Gene Rearrangement/genetics , Genome, Fungal , Inverted Repeat Sequences/genetics , Kinetochores/metabolism , Mitosis , Schizosaccharomyces/genetics , Species Specificity
18.
Mol Cell Biol ; 34(9): 1547-63, 2014 May.
Article in English | MEDLINE | ID: mdl-24550006

ABSTRACT

A common function of the TFIID and SAGA complexes, which are recruited by transcriptional activators, is to deliver TBP to promoters to stimulate transcription. Neither the relative contributions of the five shared TBP-associated factor (TAF) subunits in TFIID and SAGA nor the requirement for different domains in shared TAFs for transcriptional activation is well understood. In this study, we uncovered the essential requirement for the highly conserved C-terminal region (CRD) of Taf9, a shared TAF, for transcriptional activation in yeast. Transcriptome profiling performed under Gcn4-activating conditions showed that the Taf9 CRD is required for induced expression of ∼9% of the yeast genome. The CRD was not essential for the Taf9-Taf6 interaction, TFIID or SAGA integrity, or Gcn4 interaction with SAGA in cell extracts. Microarray profiling of a SAGA mutant (spt20Δ) yielded a common set of genes induced by Spt20 and the Taf9 CRD. Chromatin immunoprecipitation (ChIP) assays showed that, although the Taf9 CRD mutation did not impair Gcn4 occupancy, the occupancies of TFIID, SAGA, and the preinitiation complex were severely impaired at several promoters. These results suggest a crucial role for the Taf9 CRD in genome-wide transcription and highlight the importance of conserved domains, other than histone fold domains, as a common determinant for TFIID and SAGA functions.


Subject(s)
Gene Expression Regulation, Fungal , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/genetics , TATA-Binding Protein Associated Factors/chemistry , TATA-Binding Protein Associated Factors/metabolism , Trans-Activators/metabolism , Transcription Factor TFIID/metabolism , Arginase/genetics , Basic-Leucine Zipper Transcription Factors/metabolism , Mutation , Promoter Regions, Genetic , Protein Interaction Maps , Protein Structure, Tertiary , Saccharomyces cerevisiae/chemistry , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , TATA-Binding Protein Associated Factors/genetics , Transcription Factor TFIID/chemistry , Transcription Factor TFIID/genetics , Transcriptional Activation
19.
BMC Bioinformatics ; 11: 464, 2010 Sep 16.
Article in English | MEDLINE | ID: mdl-20846408

ABSTRACT

BACKGROUND: While most multiple sequence alignment programs expect that all or most of their input is known to be homologous, and penalise insertions and deletions, this is not a reasonable assumption for non-coding DNA, which is much less strongly conserved than protein-coding genes. Arguing that the goal of sequence alignment should be the detection of homology and not similarity, we incorporate an evolutionary model into a previously published multiple sequence alignment program for non-coding DNA, Sigma, as a sensitive likelihood-based way to assess the significance of alignments. Version 1 of Sigma was successful in eliminating spurious alignments but exhibited relatively poor sensitivity on synthetic data. Sigma 1 used a p-value (the probability under the "null hypothesis" of non-homology) to assess the significance of alignments, and, optionally, a background model that captured short-range genomic correlations. Sigma version 2, described here, retains these features, but calculates the p-value using a sophisticated evolutionary model that we describe here, and also allows for a transition matrix for different substitution rates from and to different nucleotides. Our evolutionary model takes separate account of mutation and fixation, and can be extended to allow for locally differing functional constraints on sequence. RESULTS: We demonstrate that, on real and synthetic data, Sigma-2 significantly outperforms other programs in specificity to genuine homology (that is, it minimises alignment of spuriously similar regions that do not have a common ancestry) while it is now as sensitive as the best current programs. CONCLUSIONS: Comparing these results with an extrapolation of the best results from other available programs, we suggest that conservation rates in intergenic DNA are often significantly over-estimated. It is increasingly important to align non-coding DNA correctly, in regulatory genomics and in the context of whole-genome alignment, and Sigma-2 is an important step in that direction.


Subject(s)
DNA, Intergenic/chemistry , Evolution, Molecular , Genomics/methods , Sequence Alignment/methods , Sequence Analysis, DNA , Software , Likelihood Functions
20.
PLoS One ; 5(3): e9722, 2010 Mar 22.
Article in English | MEDLINE | ID: mdl-20339533

ABSTRACT

BACKGROUND: Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as "position weight matrices" (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps. METHODOLOGY/PRINCIPAL FINDINGS: I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a "dinucleotide weight matrix" (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined "core motifs" by about 10 bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the "signature" in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region. CONCLUSION/SIGNIFICANCE: While computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.


Subject(s)
Computational Biology/methods , Nucleotides/genetics , Transcription Factors/chemistry , Algorithms , Binding Sites , Genes, Fungal , Models, Genetic , Models, Statistical , Pattern Recognition, Automated/methods , Position-Specific Scoring Matrices , Software , Transcription Factors/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...