Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
NAR Genom Bioinform ; 6(2): lqae041, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38774514

ABSTRACT

Microbial genome sequences are rapidly accumulating, enabling large-scale studies of sequence variation. Existing studies primarily focus on coding regions to study amino acid substitution patterns in proteins. However, non-coding regulatory regions also play a distinct role in determining physiologic responses. To investigate intergenic sequence variation on a large-scale, we identified non-coding regulatory region alleles across 2350 Escherichia coli strains. This 'alleleome' consists of 117 781 unique alleles for 1169 reference regulatory regions (transcribing 1975 genes) at single base-pair resolution. We find that 64% of nucleotide positions are invariant, and variant positions vary in a median of just 0.6% of strains. Additionally, non-coding alleles are sufficient to recover E. coli phylogroups. We find that core promoter elements and transcription factor binding sites are significantly conserved, especially those located upstream of essential or highly-expressed genes. However, variability in conservation of transcription factor binding sites is significant both within and across regulons. Finally, we contrast mutations acquired during adaptive laboratory evolution with wild-type variation, finding that the former preferentially alter positions that the latter conserves. Overall, this analysis elucidates the wealth of information found in E. coli non-coding sequence variation and expands pangenomic studies to non-coding regulatory regions at single-nucleotide resolution.

2.
PLoS Comput Biol ; 20(1): e1011824, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38252668

ABSTRACT

The transcriptional regulatory network (TRN) of E. coli consists of thousands of interactions between regulators and DNA sequences. Regulons are typically determined either from resource-intensive experimental measurement of functional binding sites, or inferred from analysis of high-throughput gene expression datasets. Recently, independent component analysis (ICA) of RNA-seq compendia has shown to be a powerful method for inferring bacterial regulons. However, it remains unclear to what extent regulons predicted by ICA structure have a biochemical basis in promoter sequences. Here, we address this question by developing machine learning models that predict inferred regulon structures in E. coli based on promoter sequence features. Models were constructed successfully (cross-validation AUROC > = 0.8) for 85% (40/47) of ICA-inferred E. coli regulons. We found that: 1) The presence of a high scoring regulator motif in the promoter region was sufficient to specify regulatory activity in 40% (19/47) of the regulons, 2) Additional features, such as DNA shape and extended motifs that can account for regulator multimeric binding, helped to specify regulon structure for the remaining 60% of regulons (28/47); 3) investigating regulons where initial machine learning models failed revealed new regulator-specific sequence features that improved model accuracy. Finally, we found that strong regulatory binding sequences underlie both the genes shared between ICA-inferred and experimental regulons as well as genes in the E. coli core pan-regulon of Fur. This work demonstrates that the structure of ICA-inferred regulons largely can be understood through the strength of regulator binding sites in promoter regions, reinforcing the utility of top-down inference for regulon discovery.


Subject(s)
Escherichia coli , Regulon , Regulon/genetics , Escherichia coli/genetics , Escherichia coli/metabolism , Bacteria/genetics , Binding Sites/genetics , Promoter Regions, Genetic/genetics , Gene Expression Regulation, Bacterial/genetics , Bacterial Proteins/metabolism
3.
Nucleic Acids Res ; 51(19): 10176-10193, 2023 10 27.
Article in English | MEDLINE | ID: mdl-37713610

ABSTRACT

Transcriptomic data is accumulating rapidly; thus, scalable methods for extracting knowledge from this data are critical. Here, we assembled a top-down expression and regulation knowledge base for Escherichia coli. The expression component is a 1035-sample, high-quality RNA-seq compendium consisting of data generated in our lab using a single experimental protocol. The compendium contains diverse growth conditions, including: 9 media; 39 supplements, including antibiotics; 42 heterologous proteins; and 76 gene knockouts. Using this resource, we elucidated global expression patterns. We used machine learning to extract 201 modules that account for 86% of known regulatory interactions, creating the regulatory component. With these modules, we identified two novel regulons and quantified systems-level regulatory responses. We also integrated 1675 curated, publicly-available transcriptomes into the resource. We demonstrated workflows for analyzing new data against this knowledge base via deconstruction of regulation during aerobic transition. This resource illuminates the E. coli transcriptome at scale and provides a blueprint for top-down transcriptomic analysis of non-model organisms.


Subject(s)
Escherichia coli , Knowledge Bases , Escherichia coli/genetics , Escherichia coli/metabolism , Escherichia coli Proteins/genetics , Escherichia coli Proteins/metabolism , Gene Expression Profiling , Gene Expression Regulation, Bacterial , Transcriptome
4.
mSystems ; 8(5): e0043723, 2023 Oct 26.
Article in English | MEDLINE | ID: mdl-37638727

ABSTRACT

IMPORTANCE: Pseudomonas syringae pv. tomato DC3000 is a model plant pathogen that infects tomatoes and Arabidopsis thaliana. The current understanding of global transcriptional regulation in the pathogen is limited. Here, we applied iModulon analysis to a compendium of RNA-seq data to unravel its transcriptional regulatory network. We characterize each co-regulated gene set, revealing the activity of major regulators across diverse conditions. We provide new insights on the transcriptional dynamics in interactions with the plant immune system and with other bacterial species, such as AlgU-dependent regulation of flagellar genes during plant infection and downregulation of siderophore production in the presence of a siderophore cheater. This study demonstrates the novel application of iModulons in studying temporal dynamics during host-pathogen and microbe-microbe interactions, and reveals specific insights of interest.


Subject(s)
Arabidopsis , Microbiota , Pseudomonas syringae/genetics , Bacterial Proteins/genetics , Transcriptome/genetics , Arabidopsis/genetics , Machine Learning , Siderophores
5.
mSystems ; 8(3): e0024723, 2023 Jun 29.
Article in English | MEDLINE | ID: mdl-37278526

ABSTRACT

Streptococcus pyogenes can cause a wide variety of acute infections throughout the body of its human host. An underlying transcriptional regulatory network (TRN) is responsible for altering the physiological state of the bacterium to adapt to each unique host environment. Consequently, an in-depth understanding of the comprehensive dynamics of the S. pyogenes TRN could inform new therapeutic strategies. Here, we compiled 116 existing high-quality RNA sequencing data sets of invasive S. pyogenes serotype M1 and estimated the TRN structure in a top-down fashion by performing independent component analysis (ICA). The algorithm computed 42 independently modulated sets of genes (iModulons). Four iModulons contained the nga-ifs-slo virulence-related operon, which allowed us to identify carbon sources that control its expression. In particular, dextrin utilization upregulated the nga-ifs-slo operon by activation of two-component regulatory system CovRS-related iModulons, altering bacterial hemolytic activity compared to glucose or maltose utilization. Finally, we show that the iModulon-based TRN structure can be used to simplify the interpretation of noisy bacterial transcriptome data at the infection site. IMPORTANCE S. pyogenes is a pre-eminent human bacterial pathogen that causes a wide variety of acute infections throughout the body of its host. Understanding the comprehensive dynamics of its TRN could inform new therapeutic strategies. Since at least 43 S. pyogenes transcriptional regulators are known, it is often difficult to interpret transcriptomic data from regulon annotations. This study shows the novel ICA-based framework to elucidate the underlying regulatory structure of S. pyogenes allows us to interpret the transcriptome profile using data-driven regulons (iModulons). Additionally, the observations of the iModulon architecture lead us to identify the multiple regulatory inputs governing the expression of a virulence-related operon. The iModulons identified in this study serve as a powerful guidepost to further our understanding of S. pyogenes TRN structure and dynamics.


Subject(s)
Streptococcus pyogenes , Toxins, Biological , Humans , Streptococcus pyogenes/genetics , Bacterial Proteins/genetics , Virulence/genetics , Toxins, Biological/metabolism , Transcriptome
6.
BMC Bioinformatics ; 22(1): 584, 2021 Dec 08.
Article in English | MEDLINE | ID: mdl-34879815

ABSTRACT

BACKGROUND: Independent component analysis is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, independent component analysis effectively reveals both the source signals of the transcriptome as co-regulated gene sets, and the activity levels of the underlying regulators across diverse experimental conditions. Two major variables that affect the final gene sets are the diversity of the expression profiles contained in the underlying data, and the user-defined number of independent components, or dimensionality, to compute. Availability of high-quality transcriptomic datasets has grown exponentially as high-throughput technologies have advanced; however, optimal dimensionality selection remains an open question. METHODS: We computed independent components across a range of dimensionalities for four gene expression datasets with varying dimensions (both in terms of number of genes and number of samples). We computed the correlation between independent components across different dimensionalities to understand how the overall structure evolves as the number of user-defined components increases. We then measured how well the resulting gene clusters reflected known regulatory mechanisms, and developed a set of metrics to assess the accuracy of the decomposition at a given dimension. RESULTS: We found that over-decomposition results in many independent components dominated by a single gene, whereas under-decomposition results in independent components that poorly capture the known regulatory structure. From these results, we developed a new method, called OptICA, for finding the optimal dimensionality that controls for both over- and under-decomposition. Specifically, OptICA selects the highest dimension that produces a low number of components that are dominated by a single gene. We show that OptICA outperforms two previously proposed methods for selecting the number of independent components across four transcriptomic databases of varying sizes. CONCLUSIONS: OptICA avoids both over-decomposition and under-decomposition of transcriptomic datasets resulting in the best representation of the organism's underlying transcriptional regulatory network.


Subject(s)
Gene Regulatory Networks , Transcriptome , Algorithms , Databases, Factual , Gene Expression Profiling
7.
Nucleic Acids Res ; 48(18): 10157-10163, 2020 10 09.
Article in English | MEDLINE | ID: mdl-32976587

ABSTRACT

A genome contains the information underlying an organism's form and function. Yet, we lack formal framework to represent and study this information. Here, we introduce the Bitome, a matrix composed of binary digits (bits) representing the genomic positions of genomic features. We form a Bitome for the genome of Escherichia coli K-12 MG1655. We find that: (i) genomic features are encoded unevenly, both spatially and categorically; (ii) coding and intergenic features are recapitulated at high resolution; (iii) adaptive mutations are skewed towards genomic positions with fewer features; and (iv) the Bitome enhances prediction of adaptively mutated and essential genes. The Bitome is a formal representation of a genome and may be used to study its fundamental organizational properties.


Subject(s)
Escherichia coli K12/genetics , Genome, Bacterial , Genomics
SELECTION OF CITATIONS
SEARCH DETAIL
...