Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 16(5): 412-24, 2000 May.
Article in English | MEDLINE | ID: mdl-10871264

ABSTRACT

We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadvantages of each approach. For classification tasks, we derive new learning algorithms for the design of prediction systems by directly optimising the correlation coefficient. We observe and prove several results relating sensitivity and specificity of optimal systems. While the principles are general, we illustrate the applicability on specific problems such as protein secondary structure and signal peptide prediction.


Subject(s)
Algorithms , Classification/methods , Computational Biology , Learning , Models, Statistical , Neural Networks, Computer
2.
Comput Chem ; 23(3-4): 191-207, 1999 Jun 15.
Article in English | MEDLINE | ID: mdl-10404615

ABSTRACT

Computational prediction of eukaryotic promoters from the nucleotide sequence is one of the most attractive problems in sequence analysis today, but it is also a very difficult one. Thus, current methods predict in the order of one promoter per kilobase in human DNA, while the average distance between functional promoters has been estimated to be in the range of 30-40 kilobases. Although it is conceivable that some of these predicted promoters correspond to cryptic initiation sites that are used in vivo, it is likely that most are false positives. This suggests that it is important to carefully reconsider the biological data that forms the basis of current algorithms, and we here present a review of data that may be useful in this regard. The review covers the following topics: (1) basal transcription and core promoters, (2) activated transcription and transcription factor binding sites, (3) CpG islands and DNA methylation, (4) chromosomal structure and nucleosome modification, and (5) chromosomal domains and domain boundaries. We discuss the possible lessons that may be learned, especially with respect to the wealth of information about epigenetic regulation of transcription that has been appearing in recent years.


Subject(s)
Promoter Regions, Genetic , Chromosomes , CpG Islands , DNA/genetics , DNA Methylation , Eukaryotic Cells , Humans , Regulatory Sequences, Nucleic Acid , Transcription, Genetic
3.
Bioinformatics ; 15(11): 918-29, 1999 Nov.
Article in English | MEDLINE | ID: mdl-10743558

ABSTRACT

MOTIVATION: Over a dozen major degenerative disorders, including myotonic distrophy, Huntington's disease and fragile X syndrome, result from unstable expansions of particular trinucleotides. Remarkably, only some of all the possible triplets, namely CAG/CTG, CGG/CCG and GAA/TTC, have been associated with the known pathological expansions. This raises some basic questions at the DNA level. Why do particular triplets seem to be singled out? What is the mechanism for their expansion and how does it depend on the triplet itself? Could other triplets or longer repeats be involved in other diseases? RESULTS: Using several different computational models of DNA structure, we show that the triplets involved in the pathological repeats generally fall into extreme classes. Thus, CAG/CTG repeats are particularly flexible, whereas GCC, CGG and GAA repeats appear to display both flexible and rigid (but curved) characteristics depending on the method of analysis. The fact that (1) trinucleotide repeats often become increasingly unstable when they exceed a length of approximately 50 repeats, and (2) repeated 12-mers display a similar increase in instability above 13 repeats, together suggest that approximately 150 bp is a general threshold length for repeat instability. Since this is about the length of DNA wrapped up in a single nucleosome core particle, we speculate that chromatin structure may play an important role in the expansion mechanism. We furthermore suggest that expansion of a dodecamer repeat, which we predict to have very high flexibility, may play a role in the pathogenesis of the neurodegenerative disorder multiple system atrophy (MSA). CONTACT: pfbaldi@ics.uci.edu, yves@netid.com, brunak@cbs.dtu.dk, gorm@cbs.dtu.dk.


Subject(s)
Computer Simulation , Genetic Diseases, Inborn/genetics , Models, Genetic , Multiple System Atrophy/genetics , Trinucleotide Repeat Expansion/genetics , Anticipation, Genetic , Carrier Proteins/genetics , DNA/chemistry , DNA/genetics , DNA/metabolism , Deoxyribonuclease I/metabolism , Humans , Nerve Tissue Proteins/genetics , Nucleic Acid Conformation , Nucleosomes/genetics , Nucleosomes/metabolism , Pliability , Protein Binding/genetics , Trinucleotide Repeats/genetics
4.
Article in English | MEDLINE | ID: mdl-9783207

ABSTRACT

We study from a computational standpoint several different physical scales associated with structural features of DNA sequences, including dinucleotide scales such as base stacking energy and propeller twist, and trinucleotide scales such as bendability and nucleosome positioning. We show that these scales provide an alternative or complementary compact representation of DNA sequences. As an example we construct a strand invariant representation of DNA sequences. The scales can also be used to analyze and discover new DNA structural patterns, especially in combinations with hidden Markov models (HMMs). The scales are applied to HMMs of human promoter sequences revealing a number of significant differences between regions upstream and downstream of the transcriptional start point. Finally we show, with some qualifications, that such scales are by and large independent, and therefore complement each other.


Subject(s)
DNA/chemistry , Artificial Intelligence , Base Sequence , DNA/genetics , Humans , Markov Chains , Molecular Structure , Oligodeoxyribonucleotides/chemistry , Oligodeoxyribonucleotides/genetics , Pattern Recognition, Automated , Promoter Regions, Genetic , TATA Box , Thermodynamics
5.
J Mol Biol ; 281(4): 663-73, 1998 Aug 28.
Article in English | MEDLINE | ID: mdl-9710538

ABSTRACT

The fact that DNA three-dimensional structure is important for transcriptional regulation begs the question of whether eukaryotic promoters contain general structural features independently of what genes they control. We present an analysis of a large set of human RNA polymerase II promoters with a very low level of sequence similarity. The sequences, which include both TATA-containing and TATA-less promoters, are aligned by hidden Markov models. Using three different models of sequence-derived DNA bendability, the aligned promoters display a common structural profile with bendability being low in a region upstream of the transcriptional start point and significantly higher downstream. Investigation of the sequence composition in the two regions shows that the bendability profile originates from the sequential structure of the DNA, rather than the general nucleotide composition. Several trinucleotides known to have high propensity for major groove compression are found much more frequently in the regions downstream of the transcriptional start point, while the upstream regions contain more low-bendability triplets. Within the region downstream of the start point, we observe a periodic pattern in sequence and bendability, which is in phase with the DNA helical pitch. The periodic bendability profile shows bending peaks roughly at every 10 bp with stronger bending at 20 bp intervals. These observations suggest that DNA in the region downstream of the transcriptional start point is able to wrap around protein in a manner reminiscent of DNA in a nucleosome. This notion is further supported by the finding that the periodic bendability is caused mainly by the complementary triplet pairs CAG/CTG and GGC/GCC, which previously have been found to correlate with nucleosome positioning. We present models where the high-bendability regions position nucleosomes at the downstream end of the transcriptional start point, and consider the possibility of interaction between histone-like TAFs and this area. We also propose the use of this structural signature in computational promoter-finding algorithms.


Subject(s)
DNA/chemistry , Promoter Regions, Genetic/genetics , RNA Polymerase II/genetics , Algorithms , Humans , Markov Chains , Nucleic Acid Conformation , Nucleosomes/chemistry , Nucleotides/chemistry , Sequence Analysis, DNA
6.
J Mol Biol ; 263(4): 503-10, 1996 Nov 08.
Article in English | MEDLINE | ID: mdl-8918932

ABSTRACT

We describe the structural implications of a periodic pattern found in human exons and introns by hidden Markov models. We show that exons (besides the reading frame) have a specific sequential structure in the form of a pattern with triplet consensus non-T(A/T)G, and a minimal periodicity of roughly ten nucleotides. The periodic pattern is also present in intron sequences, although the strength per nucleotide is weaker. Using two independent profile methods based on triplet bendability parameters from DNase I experiments and nucleosome positioning data, we show that the pattern in multiple alignments of internal exon and intron sequences corresponds to a periodic "in phase" bending potential towards the major groove of the DNA. The nucleosome positioning data show that the consensus triplets (and their complements) have a preference for locations on a bent double helix where the major groove faces inward and is compressed. The in-phase triplets are located adjacent to GCC/GGC triplets known to have the strongest bias in their positioning on the nuclesome. Analysis of mRNA sequences encoding proteins with known tertiary structure exclude the possibility that the pattern is a consequence of the previously well-known periodicity caused by the encoding of alpha-helices in proteins. Finally, we discuss the relation between the bending potential of coding and non-coding regions and its impact on the translational positioning of nucleosomes and the recognition of genes by the transcriptional machinery.


Subject(s)
DNA/chemistry , Exons , Introns , Models, Theoretical , Nucleosomes/genetics , Base Sequence , Conserved Sequence , DNA/metabolism , Deoxyribonuclease I/metabolism , Humans , Models, Molecular , Nucleic Acid Conformation , Nucleosomes/chemistry , Nucleosomes/metabolism , RNA, Messenger/chemistry , Repetitive Sequences, Nucleic Acid , Sequence Alignment , Software
7.
Neural Comput ; 8(7): 1541-65, 1996 Oct 01.
Article in English | MEDLINE | ID: mdl-8823946

ABSTRACT

We describe a hybrid modeling approach where the parameters of a mode are calculated and modulated by another model, typically a neural network (NN), to avoid both overfitting and underfitting. We develop the approach for the case of Hidden Markov Models (HMMs), by deriving a class of hybrid HMM/NN architectures. These architectures can be trained with unified algorithms that blend HMM dynamic programming with NN backpropagation. In the case of complex data, mixtures of HMMs or modulated HMMs must be used. NNs can then be applied both to the parameters of each single HMM, and to the switching or modulatation of the models, as a function of input or context. Hybrid HMM/NN architectures provide a flexible NN parameterization for the control of model structure and complexity. At the same time, they can capture distributions that, in practice, are inaccessible to single HMMs. The HMM/NN hybrid approach is tested, in its simplest form, by constructing a model of the immunoglobulin protein family. A hybrid model is trained, and a multiple alignment derived, with less than a fourth of the number of parameters used with previous single HMMs.


Subject(s)
Immunoglobulins/genetics , Models, Genetic , Models, Statistical , Neural Networks, Computer , Algorithms , Amino Acid Sequence , Computer Simulation , Humans , Molecular Sequence Data
8.
Article in English | MEDLINE | ID: mdl-8877518

ABSTRACT

In this paper we utilize hidden Markov models (HMMs) and information theory to analyze prokaryotic and eukaryotic promoters. We perform this analysis with special emphasis on the fact that promoters are divided into a number of different classes, depending on which polymerase-associated factors that bind to them. We find that HMMs trained on such subclasses of Escherichia coli promoters (specifically, the so-called sigma 70 and sigma 54 classes) give an excellent classification of unknown promoters with respect to sigma-class. HMMs trained on eukaryotic sequences from human genes also model nicely all the essential well known signals, in addition to a potentially new signal upstream of the TATA-box. We furthermore employ a novel technique for automatically discovering different classes in the input data (the promoters) using a system of self-organizing parallel HMMs. These self-organizing HMMs have at the same time the ability to find clusters and the ability to model the sequential structure in the input data. This is highly relevant in situations where the variance in the data is high, as is the case for the subclass structure in for example promoter sequences.


Subject(s)
Markov Chains , Promoter Regions, Genetic , Escherichia coli , Genomic Library , Humans
9.
Article in English | MEDLINE | ID: mdl-7584451

ABSTRACT

We analyse the sequential structure of human exons and their flanking introns by hidden Markov models. Together, models of donor site regions, acceptor site regions and flanked internal exons, show that exons--besides the reading frame--hold a specific periodic pattern. The pattern, which has the consensus: non-T(A/T)G and a minimal periodicity of roughly 10 nucleotides, is not a consequence of the nucleotide statistics in the three codon positions, nor of the well known nucleosome positioning signal. We discuss the relation between the pattern and other known sequence elements responsible for the intrinsic bending or curvature of DNA.


Subject(s)
Base Sequence , Exons , Consensus Sequence , DNA/chemistry , Humans , Introns , Markov Chains , Models, Statistical , Pattern Recognition, Automated , RNA/chemistry
10.
Article in English | MEDLINE | ID: mdl-7584463

ABSTRACT

Hidden Markov Models (HMMs) are useful in a number of tasks in computational molecular biology, and in particular to model and align protein families. We argue that HMMs are somewhat optimal within a certain modeling hierarchy. Single first order HMMs, however, have two potential limitations: a large number of unstructured parameters, and a built-in inability to deal with long-range dependencies. Hybrid HMM/Neural Network (NN) architectures attempt to overcome these limitations. In hybrid HMM/NN, the HMM parameters are computed by a NN. This provides a reparametrization that allows for flexible control of model complexity, and incorporation of constraints. The approach is tested on the immunoglobulin family. A hybrid model is trained, and a multiple alignment derived, with less than a fourth of the number of parameters used with previous single HMMs. To capture dependencies, however, one must resort to a larger hybrid model class, where the data is modeled by multiple HMMs. The parameters of the HMMs, and their modulation as a function of input or context, is again calculated by a NN.


Subject(s)
Immunoglobulins/chemistry , Markov Chains , Models, Theoretical , Neural Networks, Computer , Proteins/chemistry , Amino Acid Sequence , Animals , Computer Simulation , Mathematics , Molecular Sequence Data , Proteins/classification
11.
Proc Natl Acad Sci U S A ; 91(3): 1059-63, 1994 Feb 01.
Article in English | MEDLINE | ID: mdl-8302831

ABSTRACT

Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences.


Subject(s)
Markov Chains , Models, Genetic , Proteins/genetics , Sequence Alignment/methods , Algorithms , Amino Acid Sequence , Animals , Globins/genetics , Humans , Immunoglobulins/genetics , Molecular Sequence Data , Protein Kinases/genetics , Sequence Alignment/statistics & numerical data , Sequence Homology, Amino Acid
12.
J Comput Biol ; 1(4): 311-36, 1994.
Article in English | MEDLINE | ID: mdl-8790474

ABSTRACT

Hidden Markov Model techniques are used to derive a new model of the G-protein-coupled receptor family. The transition and emission parameters of the model are adjusted using a training set comprising 142 sequences. The resulting model is shown to perform well on a number of tasks, including multiple alignments, discrimination, large data base searches, classification, and fragment detection. General analytical results on the expectation and standard deviation of the likelihood of random sequences are also presented.


Subject(s)
GTP-Binding Proteins/metabolism , Markov Chains , Models, Theoretical , Receptors, Cell Surface/chemistry , Sequence Alignment/methods , Algorithms , Amino Acid Sequence , Animals , Databases, Factual , Evolution, Molecular , GTP-Binding Proteins/chemistry , Humans , Likelihood Functions , Molecular Sequence Data , Receptors, Cell Surface/classification , Receptors, Cell Surface/metabolism , Sequence Alignment/trends
SELECTION OF CITATIONS
SEARCH DETAIL
...