Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 31
Filter
1.
BMC Genomics ; 22(1): 644, 2021 Sep 06.
Article in English | MEDLINE | ID: mdl-34488632

ABSTRACT

BACKGROUND: Inversion Symmetry is a generalization of the second Chargaff rule, stating that the count of a string of k nucleotides on a single chromosomal strand equals the count of its inverse (reverse-complement) k-mer. It holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. Building on this formalism we introduce the concept of k-mer distances between chromosomes. We formulate two k-mer distance measures, D1 and D2, which depend on k. D1 takes into account all k-mers (for a single k) appearing on single strands of the two compared chromosomes, whereas D2 takes into account both strands of each chromosome. Both measures reflect dissimilarities in global chromosomal structures. RESULTS: After defining the various distance measures and summarizing their properties, we also define proximities that rely on the existence of synteny blocks between chromosomes of different bacterial strains. Comparing pairs of strains of bacteria, we find negative correlations between synteny proximities and k-mer distances, thus establishing the meaning of the latter as measures of evolutionary distances among bacterial strains. The synteny measures we use are appropriate for closely related bacterial strains, where considerable sections of chromosomes demonstrate high direct or reversed equality. These measures are not appropriate for comparing different bacteria or eukaryotes. K-mer structural distances can be defined for all species. Because of the arbitrariness of strand choices, we employ only the D2 measure when comparing chromosomes of different species. The results for comparisons of various eukaryotes display interesting behavior which is partially consistent with conventional understanding of evolutionary genomics. In particular, we define ratios of minimal k-mer distances (KDR) between unmasked and masked chromosomes of two species, which correlate with both short and long evolutionary scales. CONCLUSIONS: k-mer distances reflect dissimilarities among global chromosomal structures. They carry information which aggregates all mutations. As such they can complement traditional evolution studies , which mainly concentrate on coding regions.


Subject(s)
Chromosomes , Genomics , Chromosome Inversion , Chromosomes/genetics , Eukaryota , Evolution, Molecular , Humans , Synteny
2.
Sci Rep ; 9(1): 12734, 2019 09 04.
Article in English | MEDLINE | ID: mdl-31484964

ABSTRACT

Genome conformation capture techniques permit a systematic investigation into the functional spatial organization of genomes, including functional aspects like assessing the co-localization of sets of genomic elements. For example, the co-localization of genes targeted by a transcription factor (TF) within a transcription factory. We quantify spatial co-localization using a rigorous statistical model that measures the enrichment of a subset of elements in neighbourhoods inferred from Hi-C data. We also control for co-localization that can be attributed to genomic order. We systematically apply our open-sourced framework, spatial-mHG, to search for spatial co-localization phenomena in multiple unicellular Hi-C datasets with corresponding genomic annotations. Our biological findings shed new light on the functional spatial organization of genomes, including: In C. crescentus, DNA replication genes reside in two genomic clusters that are spatially co-localized. Furthermore, these clusters contain similar gene copies and lay in genomic vicinity to the ori and ter sequences. In S. cerevisae, Ty5 retrotransposon family element spatially co-localize at a spatially adjacent subset of telomeres. In N. crassa, both Proteasome lid subcomplex genes and protein refolding genes jointly spatially co-localize at a shared location. An implementation of our algorithms is available online.


Subject(s)
Bacillus subtilis/genetics , Caulobacter crescentus/genetics , Genome, Bacterial , Genome, Fungal , Neurospora crassa/genetics , Saccharomyces cerevisiae/genetics , Schizosaccharomyces/genetics , Genomics , Models, Genetic
3.
Bioinformatics ; 34(17): i638-i646, 2018 09 01.
Article in English | MEDLINE | ID: mdl-30423078

ABSTRACT

Motivation: The complexes formed by binding of proteins to RNAs play key roles in many biological processes, such as splicing, gene expression regulation, translation and viral replication. Understanding protein-RNA binding may thus provide important insights to the functionality and dynamics of many cellular processes. This has sparked substantial interest in exploring protein-RNA binding experimentally, and predicting it computationally. The key computational challenge is to efficiently and accurately infer protein-RNA binding models that will enable prediction of novel protein-RNA interactions to additional transcripts of interest. Results: We developed DLPRB (Deep Learning for Protein-RNA Binding), a new deep neural network (DNN) approach for learning intrinsic protein-RNA binding preferences and predicting novel interactions. We present two different network architectures: a convolutional neural network (CNN), and a recurrent neural network (RNN). The novelty of our network hinges upon two key aspects: (i) the joint analysis of both RNA sequence and structure, which is represented as a probability vector of different RNA structural contexts; (ii) novel features in the architecture of the networks, such as the application of RNNs to RNA-binding prediction, and the combination of hundreds of variable-length filters in the CNN. Our results in inferring accurate RNA-binding models from high-throughput in vitro data exhibit substantial improvements, compared to all previous approaches for protein-RNA binding prediction (both DNN and non-DNN based). A more modest, yet statistically significant, improvement is achieved for in vivo binding prediction. When incorporating experimentally-measured RNA structure, compared to predicted one, the improvement on in vivo data increases. By visualizing the binding specificities, we can gain biological insights underlying the mechanism of protein RNA-binding. Availability and implementation: The source code is publicly available at https://github.com/ilanbb/dlprb. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Neural Networks, Computer , RNA-Binding Proteins/metabolism , RNA/metabolism , Software
4.
J Theor Biol ; 420: 318-323, 2017 05 07.
Article in English | MEDLINE | ID: mdl-28263816

ABSTRACT

Ancestral maximum likelihood (AML) is a phylogenetic tree reconstruction criteria that "lies between" maximum parsimony (MP) and maximum likelihood (ML). ML has long been known to be statistically consistent. On the other hand, Felsenstein (1978) showed that MP is statistically inconsistent, and even positively misleading: There are cases where the parsimony criteria, applied to data generated according to one tree topology, will be optimized on a different tree topology. The question of weather AML is statistically consistent or not has been open for a long time. Mossel et al. (2009) have shown that AML can "shrink" short tree edges, resulting in a star tree with no internal resolution, which yields a better AML score than the original (resolved) model. This result implies that AML is statistically inconsistent, but not that it is positively misleading, because the star tree is compatible with any other topology. We show that AML is confusingly misleading: For some simple, four taxa (resolved) tree, the ancestral likelihood optimization criteria is maximized on an incorrect (resolved) tree topology, as well as on a star tree (both with specific edge lengths), while the tree with the original, correct topology, has strictly lower ancestral likelihood. Interestingly, the two short edges in the incorrect, resolved tree topology are of length zero, and are not adjacent, so this resolved tree is in fact a simple path. While for MP, the underlying phenomenon can be described as long edge attraction, it turns out that here we have long edge repulsion.


Subject(s)
Biological Evolution , Biometry/methods , Models, Genetic , Phylogeny , Computer Simulation , Likelihood Functions
5.
Bioinformatics ; 32(17): i559-i566, 2016 09 01.
Article in English | MEDLINE | ID: mdl-27587675

ABSTRACT

MOTIVATION: Complex interactions among alleles often drive differences in inherited properties including disease predisposition. Isolating the effects of these interactions requires phasing information that is difficult to measure or infer. Furthermore, prevalent sequencing technologies used in the essential first step of determining a haplotype limit the range of that step to the span of reads, namely hundreds of bases. With the advent of pseudo-long read technologies, observable partial haplotypes can span several orders of magnitude more. Yet, measuring whole-genome-single-individual haplotypes remains a challenge. A different view of whole genome measurement addresses the 3D structure of the genome-with great development of Hi-C techniques in recent years. A shortcoming of current Hi-C, however, is the difficulty in inferring information that is specific to each of a pair of homologous chromosomes. RESULTS: In this work, we develop a robust algorithmic framework that takes two measurement derived datasets: raw Hi-C and partial short-range haplotypes, and constructs the full-genome haplotype as well as phased diploid Hi-C maps. By analyzing both data sets together we thus bridge important gaps in both technologies-from short to long haplotypes and from un-phased to phased Hi-C. We demonstrate that our method can recover ground truth haplotypes with high accuracy, using measured biological data as well as simulated data. We analyze the impact of noise, Hi-C sequencing depth and measured haplotype lengths on performance. Finally, we use the inferred 3D structure of a human genome to point at transcription factor targets nuclear co-localization. AVAILABILITY AND IMPLEMENTATION: The implementation available at https://github.com/YakhiniGroup/SpectraPh CONTACT: zohar.yakhini@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Chromosomes , Genome, Human , Haplotypes , Molecular Conformation , Algorithms , Genetic Variation , Genome-Wide Association Study , Humans
6.
BMC Genomics ; 17: 696, 2016 08 31.
Article in English | MEDLINE | ID: mdl-27580854

ABSTRACT

BACKGROUND: The generalization of the second Chargaff rule states that counts of any string of nucleotides of length k on a single chromosomal strand equal the counts of its inverse (reverse-complement) k-mer. This Inversion Symmetry (IS) holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. The existence of IS has been demonstrated in the literature, and other pair-wise candidate symmetries (e.g. reverse or complement) have been ruled out. RESULTS: Studying IS in the human genome, we find that IS holds up to k = 10. It holds for complete chromosomes, also after applying the low complexity mask. We introduce a numerical IS criterion, and define the k-limit, KL, as the highest k for which this criterion is valid. We demonstrate that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic dependence of KL ~ 0.7 ln(L), where L is the length of the chromosome. We introduce a statistical IS-Poisson model that allows us to apply confidence measures to our numerical findings. We find good agreement for large k, where the variance of the Poisson distribution determines the outcome of the analysis. This model predicts the observed logarithmic increase of KL with length. The model allows us to conclude that for low k, e.g. k = 1 where IS becomes the 2(nd) Chargaff rule, IS violation, although extremely small, is significant. Studying this violation we come up with an unexpected observation for human chromosomes, finding a meaningful correlation with the excess of genes on particular strands. CONCLUSIONS: Our IS-Poisson model agrees well with genomic data, and accounts for the universal behavior of k-limits. For low k we point out minute, yet significant, deviations from the model, including excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. Interestingly, this correlates with a significant (but small) excess of genes on the same positive strands.


Subject(s)
Chromosomes, Human/genetics , DNA/genetics , Humans , Models, Genetic , Poisson Distribution
7.
J Proteome Res ; 15(8): 2871-80, 2016 08 05.
Article in English | MEDLINE | ID: mdl-27354160

ABSTRACT

Modeling and simulation of biological networks is an effective and widely used research methodology. The Biological Network Simulator (BioNSi) is a tool for modeling biological networks and simulating their discrete-time dynamics, implemented as a Cytoscape App. BioNSi includes a visual representation of the network that enables researchers to construct, set the parameters, and observe network behavior under various conditions. To construct a network instance in BioNSi, only partial, qualitative biological data suffices. The tool is aimed for use by experimental biologists and requires no prior computational or mathematical expertise. BioNSi is freely available at http://bionsi.wix.com/bionsi , where a complete user guide and a step-by-step manual can also be found.


Subject(s)
Models, Biological , Software , Computer Simulation , Internet
8.
J Comput Biol ; 23(6): 461-71, 2016 06.
Article in English | MEDLINE | ID: mdl-27058690

ABSTRACT

Clustered regularly interspaced short palindromic repeats (CRISPR) are structured regions in bacterial and archaeal genomes, which are part of an adaptive immune system against phages. CRISPRs are important for many microbial studies and are playing an essential role in current gene editing techniques. As such, they attract substantial research interest. The exponential growth in the amount of bacterial sequence data in recent years enables the exploration of CRISPR loci in more and more species. Most of the automated tools that detect CRISPR loci rely on fully assembled genomes. However, many assemblers do not handle repetitive regions successfully. The first tool to work directly on raw sequence data is Crass, which requires reads that are long enough to contain two copies of the same repeat. We present a method to identify CRISPR repeats from raw sequence data of short reads. The algorithm is based on an observation differentiating CRISPR repeats from other types of repeats, and it involves a series of partial constructions of the overlap graph. This enables us to avoid many of the difficulties that assemblers face, as we merely aim to identify the repeats that belong to CRISPR loci. A preliminary implementation of the algorithm shows good results and detects CRISPR repeats in cases where other existing tools fail to do so.


Subject(s)
Clustered Regularly Interspaced Short Palindromic Repeats , Computational Biology/methods , Sequence Analysis, DNA/methods , Algorithms , Archaea/genetics , Bacteria/genetics
9.
J Theor Biol ; 374: 54-9, 2015 Jun 07.
Article in English | MEDLINE | ID: mdl-25843219

ABSTRACT

The evolution of aligned DNA sequence sites is generally modeled by a Markov process operating along the edges of a phylogenetic tree. It is well known that the probability distribution on the site patterns at the tips of the tree determines the tree topology, and its branch lengths. However, the number of patterns is typically much larger than the number of edges, suggesting considerable redundancy in the branch length estimation. In this paper we ask whether the probabilities of just the 'edge-specific' patterns (the ones that correspond to a change of state on a single edge) suffice to recover the branch lengths of the tree, under a symmetric 2-state Markov process. We first show that this holds provided the branch lengths are sufficiently short, by applying the inverse function theorem. We then consider whether this restriction to short branch lengths is necessary. We show that for trees with up to four leaves it can be lifted. This leaves open the interesting question of whether this holds in general. Our results also extend to certain Markov processes on more than 2-states, such as the Jukes-Cantor model.


Subject(s)
Evolution, Molecular , Models, Biological , Phylogeny , Algorithms , Markov Chains , Probability , Software
10.
PLoS Comput Biol ; 10(11): e1003897, 2014 Nov.
Article in English | MEDLINE | ID: mdl-25411839

ABSTRACT

We join the increasing call to take computational education of life science students a step further, beyond teaching mere programming and employing existing software tools. We describe a new course, focusing on enriching the curriculum of life science students with abstract, algorithmic, and logical thinking, and exposing them to the computational "culture." The design, structure, and content of our course are influenced by recent efforts in this area, collaborations with life scientists, and our own instructional experience. Specifically, we suggest that an effective course of this nature should: (1) devote time to explicitly reflect upon computational thinking processes, resisting the temptation to drift to purely practical instruction, (2) focus on discrete notions, rather than on continuous ones, and (3) have basic programming as a prerequisite, so students need not be preoccupied with elementary programming issues. We strongly recommend that the mere use of existing bioinformatics tools and packages should not replace hands-on programming. Yet, we suggest that programming will mostly serve as a means to practice computational thinking processes. This paper deals with the challenges and considerations of such computational education for life science students. It also describes a concrete implementation of the course and encourages its use by others.


Subject(s)
Biological Science Disciplines/education , Computational Biology/education , Information Science/education , Algorithms , Humans , Software
11.
Bioinformatics ; 30(24): 3515-23, 2014 Dec 15.
Article in English | MEDLINE | ID: mdl-25183486

ABSTRACT

MOTIVATION: New sequencing technologies generate larger amount of short reads data at decreasing cost. De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences. Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs. Yet, a number of recent works use the paradigm of string graph, using a variety of methods for storing and processing suffixes and prefixes, like suffix arrays, the Burrows-Wheeler transform or the FM index. Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts. RESULTS: We introduce a novel hash-based method for constructing the string graph. We use incremental hashing, and specifically a modification of the Karp-Rabin fingerprint, and Bloom filters. Using these probabilistic methods might create false-positive and false-negative edges during the algorithm's execution, but these are all detected and corrected. The advantages of the proposed approach over existing methods are its simplicity and the incorporation of established probabilistic techniques in the context of de novo genome sequencing. Our preliminary implementation is favorably comparable with the first string graph construction of Simpson and Durbin (2010) (but not with subsequent improvements). Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state-of-the-art string graph-based assemblers.


Subject(s)
Algorithms , Sequence Analysis, DNA/methods , Genomics/methods
12.
PLoS One ; 9(3): e90282, 2014.
Article in English | MEDLINE | ID: mdl-24594619

ABSTRACT

The availability of many complete, annotated proteomes enables the systematic study of the relationships between protein conservation and functionality. We explore this question based solely on the presence or absence of protein homologues (a.k.a. conservation profiles). We study 18 metazoans, from two distinct points of view: the human's and the fly's. Using the GOrilla gene ontology (GO) analysis tool, we explore functional enrichment of the "universal proteins", those with homologues in all 17 other species, and of the "non-universal proteins". A large number of GO terms are strongly enriched in both human and fly universal proteins. Most of these functions are known to be essential. A smaller number of GO terms, exhibiting markedly different properties, are enriched in both human and fly non-universal proteins. We further explore the non-universal proteins, whose conservation profiles are consistent with the "tree of life" (TOL consistent), as well as the TOL inconsistent proteins. Finally, we applied Quantum Clustering to the conservation profiles of the TOL consistent proteins. Each cluster is strongly associated with one or a small number of specific monophyletic clades in the tree of life. The proteins in many of these clusters exhibit strong functional enrichment associated with the "life style" of the related clades. Most previous approaches for studying function and conservation are "bottom up", studying protein families one by one, and separately assessing the conservation of each. By way of contrast, our approach is "top down". We globally partition the set of all proteins hierarchically, as described above, and then identify protein families enriched within different subdivisions. While supporting previous findings, our approach also provides a tool for discovering novel relations between protein conservation profiles, functionality, and evolutionary history as represented by the tree of life.


Subject(s)
Conserved Sequence , Proteins/genetics , Animals , Cluster Analysis , Gene Ontology , Genes, Essential , Gorilla gorilla , Humans , Mice , Phylogeny
13.
BMC Res Notes ; 6: 311, 2013 Aug 06.
Article in English | MEDLINE | ID: mdl-23915717

ABSTRACT

BACKGROUND: Bench biologists often do not take part in the development of computational models for their systems, and therefore, they frequently employ them as "black-boxes". Our aim was to construct and test a model that does not depend on the availability of quantitative data, and can be directly used without a need for intensive computational background. RESULTS: We present a discrete transition model. We used cell-cycle in budding yeast as a paradigm for a complex network, demonstrating phenomena such as sequential protein expression and activity, and cell-cycle oscillation. The structure of the network was validated by its response to computational perturbations such as mutations, and its response to mating-pheromone or nitrogen depletion. The model has a strong predicative capability, demonstrating how the activity of a specific transcription factor, Hcm1, is regulated, and what determines commitment of cells to enter and complete the cell-cycle. CONCLUSION: The model presented herein is intuitive, yet is expressive enough to elucidate the intrinsic structure and qualitative behavior of large and complex regulatory networks. Moreover our model allowed us to examine multiple hypotheses in a simple and intuitive manner, giving rise to testable predictions. This methodology can be easily integrated as a useful approach for the study of networks, enriching experimental biology with computational insights.


Subject(s)
Cell Cycle , Models, Biological , Saccharomyces cerevisiae/cytology
14.
J Comput Biol ; 20(2): 63, 2013 Feb.
Article in English | MEDLINE | ID: mdl-23383993
15.
J Comput Biol ; 19(8): 945-56, 2012 Aug.
Article in English | MEDLINE | ID: mdl-22876786

ABSTRACT

Whole genome sequences are a rich source of molecular data, with a potential for the discovery of novel evolutionary information. Yet, many parts of these sequences are not known to be under evolutionary pressure and, thus, are not conserved. Furthermore, a good model for whole genome evolution does not exist. Consequently, it is not a priori clear if a meaningful phylogenetic signal exists and can be extracted from the sequences as a whole. Indeed, very few phylogenies were reconstructed based on these sequences. Prior to this work, only two reconstruction methods were applied to large eukaryotic genomes: the K(r) method (Haubold et al., 2009), which was applied to genomes of rather small diversity (Drosophila species), and the feature frequency profile method (Sims et al., 2009a), which was applied to genomes of moderate diversity (mammals). We investigate the whole genome-based phylogenetic reconstruction question with respect to a much wider taxonomic sample. We apply K(r), FFP, and an alternative alignment-free method, the average common subsequence (ACS) (Ulitsky et al., 2006), to 24 multicellular eukaryotes (vertebrates, invertebrates, and plants). We also apply ACS to the proteome sequences of these 24 taxa. We compare the resulting trees to a standard reference, the National Center for Biotechnology Information (NCBI) taxonomy tree. Trees produced by ACS(AA), based on proteomes, are in complete agreement with the NCBI tree. For the genome-based reconstruction, ACS(DNA) produces trees whose agreement with the NCBI tree is excellent to very good for divergence times up to 800 million years ago, medium at 1 billion years ago, and poor at 1.6 billion years ago. We conclude that whole genomes do carry a clear phylogenetic signal, yet this signal "saturates" with longer divergence times. Furthermore, from the few existing methods, ACS is best capable of detecting this signal.


Subject(s)
Computer Simulation , Eukaryota/genetics , Genome , Models, Genetic , Phylogeny , Algorithms , Animals , Genetic Speciation , Humans , Proteome/genetics , Reference Standards , Sequence Homology, Nucleic Acid
16.
Article in English | MEDLINE | ID: mdl-20431156

ABSTRACT

We study simple geometric properties of gene expression data sets, where samples are taken from two distinct classes (e.g., two types of cancer). Specifically, the problem of linear separability for pairs of genes is investigated. If a pair of genes exhibits linear separation with respect to the two classes, then the joint expression level of the two genes is strongly correlated to the phenomena of the sample being taken from one class or the other. This may indicate an underlying molecular mechanism relating the two genes and the phenomena(e.g., a specific cancer). We developed and implemented novel efficient algorithmic tools for finding all pairs of genes that induce a linear separation of the two sample classes. These tools are based on computational geometric properties and were applied to 10 publicly available cancer data sets. For each data set, we computed the number of actual separating pairs and compared it to an upper bound on the number expected by chance and to the numbers resulting from shuffling the labels of the data at random empirically. Seven out of these 10 data sets are highly separable. Statistically, this phenomenon is highly significant, very unlikely to occur at random. It is therefore reasonable to expect that it manifests a functional association between separating genes and the underlying phenotypic classes.


Subject(s)
Computational Biology/methods , Databases, Genetic , Gene Expression Profiling/methods , Linear Models , Oligonucleotide Array Sequence Analysis/methods , Algorithms , High-Throughput Screening Assays , Humans , Neoplasms/genetics
17.
Article in English | MEDLINE | ID: mdl-20150680

ABSTRACT

We explore the maximum parsimony (MP) and ancestral maximum likelihood (AML) criteria in phylogenetic tree reconstruction. Both problems are NP-hard, so we seek approximate solutions. We formulate the two problems as Steiner tree problems under appropriate distances. The gist of our approach is the succinct characterization of Steiner trees for a small number of leaves for the two distances. This enables the use of known Steiner tree approximation algorithms. The approach leads to a 16/9 approximation ratio for AML and asymptotically to a 1.55 approximation ratio for MP.


Subject(s)
Algorithms , DNA Mutational Analysis/methods , Evolution, Molecular , Models, Genetic , Sequence Analysis, DNA/methods , Base Sequence , Computer Simulation , Data Interpretation, Statistical , Likelihood Functions , Models, Statistical , Molecular Sequence Data
18.
Genome Biol ; 10(10): R108, 2009.
Article in English | MEDLINE | ID: mdl-19814784

ABSTRACT

BACKGROUND: The empirical frequencies of DNA k-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of k-mers with non-trivial lengths is now possible. RESULTS: We have studied the k-mer spectra of more than 100 species from Archea, Bacteria, and Eukaryota, particularly looking at the modalities of the distributions. As expected, most species have a unimodal k-mer spectrum. However, a few species, including all mammals, have multimodal spectra. These species coincide with the tetrapods. Genomic sequences are clearly very complex, and cannot be fully explained by any simple probabilistic model. Yet we sought such an explanation for the observed modalities, and discovered that low-order Markov models capture this property (and some others) fairly well. CONCLUSIONS: Multimodal spectra are characterized by specific ranges of values of C+G content and of CpG dinucleotide suppression, a range that encompasses all tetrapods analyzed. Other genomes, like that of the protozoa Entamoeba histolytica, which also exhibits CpG suppression, do not have multimodal k-mer spectra. Groupings of functional elements of the human genome also have a clear modality, and exhibit either a unimodal or multimodal behaviour, depending on the two above mentioned values.


Subject(s)
DNA/genetics , Genome/genetics , Models, Genetic , Animals , Base Composition/genetics , Chickens/genetics , Computer Simulation , CpG Islands/genetics , Humans , Markov Chains , Zebrafish/genetics
19.
BMC Syst Biol ; 3: 86, 2009 Sep 03.
Article in English | MEDLINE | ID: mdl-19728874

ABSTRACT

BACKGROUND: Analyses of gene expression data from microarray experiments has become a central tool for identifying co-regulated, functional gene modules. A crucial aspect of such analysis is the integration of data from different experiments and different laboratories. How to weigh the contribution of different experiments is an important point influencing the final outcomes. We have developed a novel method for this integration, and applied it to genome-wide data from multiple Arabidopsis microarray experiments performed under a variety of experimental conditions. The goal of this study is to identify functional globally co-regulated gene modules in the Arabidopsis genome. RESULTS: Following the analysis of 21,000 Arabidopsis genes in 43 datasets and about 2 x 10(8) gene pairs, we identified a globally co-expressed gene network. We found clusters of globally co-expressed Arabidopsis genes that are enriched for known Gene Ontology annotations. Two types of modules were identified in the regulatory network that differed in their sensitivity to the node-scoring parameter; we further showed these two pertain to general and specialized modules. Some of these modules were further investigated using the Genevestigator compendium of microarray experiments. Analyses of smaller subsets of data lead to the identification of condition-specific modules. CONCLUSION: Our method for identification of gene clusters allows the integration of diverse microarray experiments from many sources. The analysis reveals that part of the Arabidopsis transcriptome is globally co-expressed, and can be further divided into known as well as novel functional gene modules. Our methodology is general enough to apply to any set of microarray experiments, using any scoring function.


Subject(s)
Arabidopsis Proteins/metabolism , Arabidopsis/metabolism , Gene Expression Profiling/methods , Gene Expression Regulation, Plant/physiology , Models, Biological , Multigene Family/physiology , Signal Transduction/physiology , Transcription, Genetic/physiology , Algorithms , Computer Simulation
20.
Protein Sci ; 16(10): 2251-9, 2007 Oct.
Article in English | MEDLINE | ID: mdl-17893362

ABSTRACT

There are 3,200,000 amino acid sequences of length 5 (penta-peptides). Statistically, we expect to see a distribution of penta-peptides that is determined by the frequency of the participating amino acids. We show, however, that not only are there thousands of such penta-peptides that are absent from all known proteomes, but many of them are coded for multiple times in the non-coding genomic regions. This suggests a strong selection process that prevents these peptides from being expressed. We also show that the characteristics of these forbidden penta-peptides vary among different phylogenetic groups (e.g., eukaryotes, prokaryotes, and archaea). Our analysis provides the first steps toward understanding the "grammar" of the forbidden penta-peptides.


Subject(s)
Oligopeptides/chemistry , Animals , Archaea/genetics , Bacteria/genetics , DNA, Intergenic/chemistry , Oligopeptides/genetics , Phylogeny , Proteomics , Sequence Analysis, Protein
SELECTION OF CITATIONS
SEARCH DETAIL
...