Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 26
Filter
Add more filters










Publication year range
1.
Nat Genet ; 26(2): 225-8, 2000 Oct.
Article in English | MEDLINE | ID: mdl-11017083

ABSTRACT

Elucidating the human transcriptional regulatory network is a challenge of the post-genomic era. Technical progress so far is impressive, including detailed understanding of regulatory mechanisms for at least a few genes in multicellular organisms, rapid and precise localization of regulatory regions within extensive regions of DNA by means of cross-species comparison, and de novo determination of transcription-factor binding specificities from large-scale yeast expression data. Here we address two problems involved in extending these results to the human genome: first, it has been unclear how many model organism genomes will be needed to delineate most regulatory regions; and second, the discovery of transcription-factor binding sites (response elements) from expression data has not yet been generalized from single-celled organisms to multicellular organisms. We found that 98% (74/75) of experimentally defined sequence-specific binding sites of skeletal-muscle-specific transcription factors are confined to the 19% of human sequences that are most conserved in the orthologous rodent sequences. Also we found that in using this restriction, the binding specificities of all three major muscle-specific transcription factors (MYF, SRF and MEF2) can be computationally identified.


Subject(s)
Genome, Human , Mice/genetics , Regulatory Sequences, Nucleic Acid , Algorithms , Animals , Base Sequence , Consensus Sequence , Gene Expression Regulation , Humans , Models, Genetic , Sequence Alignment , Transcription, Genetic
2.
Genome Res ; 10(10): 1631-42, 2000 Oct.
Article in English | MEDLINE | ID: mdl-11042160

ABSTRACT

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.


Subject(s)
Computational Biology/methods , DNA/chemistry , DNA/genetics , Genes/genetics , Base Composition , Chromosomes, Artificial/chemistry , Chromosomes, Artificial/genetics , Humans , Reproducibility of Results , Software
3.
Curr Opin Biotechnol ; 11(1): 19-24, 2000 Feb.
Article in English | MEDLINE | ID: mdl-10679343

ABSTRACT

A complex network of regulatory controls governs the patterns of gene expression. Enabled by the tools of molecular cloning, initial experimental queries into the gene regulatory network elucidated a wide array of transcription factors and their cognate binding sites from hundreds of genes. The recent fusion of genome-scale experimental tools, a more comprehensive gene catalog, and concomitant advances in computational methodology, has extended the range of questions being posed. The potential to further our understanding of the biochemical mechanisms of transcriptional regulation and to accelerate the delineation of regulatory control regions in the human genome is enormous.


Subject(s)
Computational Biology , Regulatory Sequences, Nucleic Acid/genetics , Transcription Factors/metabolism , Transcription, Genetic/genetics , Animals , Base Sequence , Binding Sites , DNA Footprinting , DNA-Binding Proteins/metabolism , Humans , Phylogeny , Promoter Regions, Genetic/genetics
4.
Genome Res ; 9(12): 1288-93, 1999 Dec.
Article in English | MEDLINE | ID: mdl-10613851

ABSTRACT

Alternative splicing can produce variant proteins and expression patterns as different as the products of different genes, yet the prevalence of alternative splicing has not been quantified. Here the spliced alignment algorithm was used to make a first inventory of exon-intron structures of known human genes using EST contigs from the TIGR Human Gene Index. The results on any one gene may be incomplete and will require verification, yet the overall trends are significant. Evidence of alternative splicing was shown in 35% of genes and the majority of splicing events occurred in 5' untranslated regions, suggesting wide occurrence of alternative regulation. Most of the alternative splices of coding regions generated additional protein domains rather than alternating domains.


Subject(s)
Alternative Splicing , 5' Untranslated Regions/genetics , Base Sequence/genetics , Contig Mapping/methods , Databases, Factual , Exons/genetics , Expressed Sequence Tags , Humans , Introns/genetics , Molecular Sequence Data , Protein Isoforms/genetics , Sequence Alignment
5.
Nucleic Acids Res ; 27(17): 3577-82, 1999 Sep 01.
Article in English | MEDLINE | ID: mdl-10446249

ABSTRACT

With the growing number of completely sequenced bacterial genes, accurate gene prediction in bacterial genomes remains an important problem. Although the existing tools predict genes in bacterial genomes with high overall accuracy, their ability to pinpoint the translation start site remains unsatisfactory. In this paper, we present a novel approach to bacterial start site prediction that takes into account multiple features of a potential start site, viz., ribosome binding site (RBS) binding energy, distance of the RBS from the start codon, distance from the beginning of the maximal ORF to the start codon, the start codon itself and the coding/non-coding potential around the start site. Mixed integer programing was used to optimize the discriminatory system. The accuracy of this approach is up to 90%, compared to 70%, using the most common tools in fully automated mode (that is, without expert human post-processing of results). The approach is evaluated using Bacillus subtilis, Escherichia coli and Pyrococcus furiosus. These three genomes cover a broad spectrum of bacterial genomes, since B.subtilis is a Gram-positive bacterium, E.coli is a Gram-negative bacterium and P. furiosus is an archaebacterium. A significant problem is generating a set of 'true' start sites for algorithm training, in the absence of experimental work. We found that sequence conservation between P. furiosus and the related Pyrococcus horikoshii clearly delimited the gene start in many cases, providing a sufficient training set.


Subject(s)
Codon, Initiator , Genome, Bacterial , Protein Biosynthesis , Algorithms , Amino Acid Sequence , Bacillus subtilis/genetics , Conserved Sequence , Escherichia coli/genetics , Molecular Sequence Data , Pyrococcus furiosus/genetics , Sequence Homology, Amino Acid
7.
J Mol Biol ; 278(1): 167-81, 1998 Apr 24.
Article in English | MEDLINE | ID: mdl-9571041

ABSTRACT

For many newly sequenced genes, sequence analysis of the putative protein yields no clue on function. It would be beneficial to be able to identify in the genome the regulatory regions that confer temporal and spatial expression patterns for the uncharacterized genes. Additionally, it would be advantageous to identify regulatory regions within genes of known expression pattern without performing the costly and time consuming laboratory studies now required. To achieve these goals, the wealth of case studies performed over the past 15 years will have to be collected into predictive models of expression. Extensive studies of genes expressed in skeletal muscle have identified specific transcription factors which bind to regulatory elements to control gene expression. However, potential binding sites for these factors occur with sufficient frequency that it is rare for a gene to be found without one. Analysis of experimentally determined muscle regulatory sequences indicates that muscle expression requires multiple elements in close proximity. A model is generated with predictive capability for identifying these muscle-specific regulatory modules. Phylogenetic footprinting, the identification of sequences conserved between distantly related species, complements the statistical predictions. Through the use of logistic regression analysis, the model promises to be easily modified to take advantage of the elucidation of additional factors, cooperation rules, and spacing constraints.


Subject(s)
Gene Expression Regulation , Muscle, Skeletal/metabolism , Regulatory Sequences, Nucleic Acid , Transcription Factors/metabolism , Binding Sites , DNA Footprinting , Genetic Complementation Test , Genome , Mathematical Computing , Models, Molecular , Phylogeny , Transcription Factors/genetics
10.
Trends Genet ; 12(8): 316-20, 1996 Aug.
Article in English | MEDLINE | ID: mdl-8783942

ABSTRACT

Discovering new genes, and their functions, can be aided not only by special purpose gene (and coding region) finding software, but also by searches in key databases, and by programs for finding particular sites relevant to gene expression, such as promoters and splice sites. No one software package includes all the necessary tools. I describe here the main kinds of tools; their working principles, strengths and limitations; and how combined evidence from multiple tools can aid in optimum gene identification.


Subject(s)
Computational Biology , Databases, Factual , Genes , Amino Acid Sequence , Animals , Base Sequence , Codon , DNA/chemistry , Exons , Humans , Molecular Sequence Data , Repetitive Sequences, Nucleic Acid , Software
11.
Gene ; 172(1): GC19-32, 1996 Jun 12.
Article in English | MEDLINE | ID: mdl-8654964

ABSTRACT

The MEF2 and MyoD families of transcriptional regulatory factors both play central roles in the terminal differentiation of skeletal muscle. Further, binding sites for the two families often occur nearby, and there have been a number of indications that members of the two families may bind coordinately. The present study provides evidence that known binding sites for the two occur with precise geometric restrictions related to the DNA helical repeat unit, that pairs of putative sites following these restrictions are indicative of skeletal muscle-specific transcriptional regulatory regions, and that the geometric relationship can help provide a consistent interpretation for data that has until now been difficult to explain.


Subject(s)
DNA-Binding Proteins/metabolism , Myogenin/metabolism , Transcription Factors/metabolism , Animals , Base Sequence , Binding Sites , Biological Evolution , Conserved Sequence , DNA-Binding Proteins/genetics , Enhancer Elements, Genetic , Humans , MEF2 Transcription Factors , Molecular Sequence Data , Myogenic Regulatory Factors , Myogenin/genetics , Oligodeoxyribonucleotides , Transcription Factors/genetics , Transcription, Genetic
12.
Comput Chem ; 20(1): 103-18, 1996 Mar.
Article in English | MEDLINE | ID: mdl-16749184

ABSTRACT

The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher eukaryotes. Thus it is not surprising that the number of algorithm and software developers working in the area is rapidly increasing. The present paper is an overview of the field, with an emphasis on eukaryotes, for such developers.


Subject(s)
Genes/genetics , Base Sequence/genetics , Codon/genetics , Exons/genetics , Gene Expression/genetics , Models, Genetic , Sequence Homology
13.
Mol Cell Biol ; 16(1): 437-41, 1996 Jan.
Article in English | MEDLINE | ID: mdl-8524326

ABSTRACT

Myocyte-specific enhancer factor 2 (MEF2) is a family of closely related transcription factors that play a key role in the differentiation of muscle tissues and are important in the muscle-specific expression of a number of genes. Given the centrality of MEF2 in muscle differentiation, regulatory regions newly determined to be muscle specific are often studied for potential MEF2 binding sites. Possible sites are often located by comparison to a homologous gene or by matching to the consensus MEF2 sequence. Enough data have accumulated that a richer description of the MEF2 binding site, a position weight matrix, can be reliably constructed and its usefulness can be assessed. It was shown that scores from such a matrix approximate MEF2 binding energy and enable recognition of naturally occurring MEF2 sites with high sensitivity and specificity. Regulation of genes via MEF2-like sites is complicated by the fact that a number of transcription factors are involved. Not only is MEF2 itself a family of proteins, but several other, nonhomologous, transcription factors overlap MEF2 in DNA-binding specificity. Thus, more quantitative methods for recognizing potential sites may help with the lengthy process of disentangling the complex regulatory circuits of muscle-specific expression.


Subject(s)
DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Transcription Factors/genetics , Transcription Factors/metabolism , Amino Acid Sequence , Animals , Binding Sites/genetics , Biometry , DNA/metabolism , Humans , MEF2 Transcription Factors , Molecular Sequence Data , Muscles/metabolism , Mutagenesis, Site-Directed , Myogenic Regulatory Factors
14.
J Mol Biol ; 253(1): 51-60, 1995 Oct 13.
Article in English | MEDLINE | ID: mdl-7473716

ABSTRACT

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.


Subject(s)
Base Sequence/genetics , DNA/genetics , Genes/genetics , Algorithms , Base Composition , Databases, Factual , Discriminant Analysis , Humans , Open Reading Frames/genetics , Proteins/genetics
15.
J Comput Biol ; 2(1): 117-23, 1995.
Article in English | MEDLINE | ID: mdl-7497114

ABSTRACT

The length of an open reading frame (ORF) is one important piece of evidence often used in locating new genes, particularly in organisms where splicing is rare. However, there have been no systematic studies quantifying the degree of correlation between length of ORF, on the one hand, and likelihood of gene function, on the other. In this paper, techniques are derived to estimate the conditional probability of gene function, given ORF length, based on evidence both from the databases and from simulation. Several complete chromosomes of Saccharomyces cerevisiae have now been sequenced, and considerable effort is being expended on locating and characterizing the genes in these sequences. Thus, we illustrate the techniques for this organism.


Subject(s)
Chromosomes, Fungal , Databases, Factual , Genes , Open Reading Frames , Saccharomyces cerevisiae/genetics , Amino Acid Sequence , Base Sequence , Fungal Proteins/chemistry , Fungal Proteins/genetics , Protein Biosynthesis , RNA Splicing
16.
Comput Chem ; 18(3): 203-5, 1994 Sep.
Article in English | MEDLINE | ID: mdl-7952890

ABSTRACT

One expects that in DNA without protein coding function, stop codons (which constitute three of the 64 possible codons) should occur frequently in all reading frames, and that a long open reading frame (ORF) can be interpreted as a sign for the existence of a gene. We make a beginning on introducing quantitative measures of confidence into this inference--taking Saccharomyces cerevisiae as a sample case--and show that some common assumptions can reasonably be questioned. In particular we show that statistical support for the biological function of shorter ORFs listed as putative genes in recent papers is in fact very weak. This is an issue of practical as well as theoretical interest, since researching the function of a putative gene is difficult and expensive.


Subject(s)
Genes , Open Reading Frames , Base Composition , Chromosomes, Artificial, Yeast , DNA, Fungal/genetics , Genes, Fungal , Models, Genetic , Saccharomyces cerevisiae/genetics
17.
Nucleic Acids Res ; 21(12): 2837-44, 1993 Jun 25.
Article in English | MEDLINE | ID: mdl-8332493

ABSTRACT

A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C. elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.


Subject(s)
Base Composition , Codon , DNA/chemistry , Proteins/genetics , Animals , Caenorhabditis elegans/genetics , Cosmids , DNA/analysis , Genes, Fungal , Humans , Sequence Analysis, DNA , Statistics as Topic
18.
Nucleic Acids Res ; 20(24): 6441-50, 1992 Dec 25.
Article in English | MEDLINE | ID: mdl-1480466

ABSTRACT

A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.


Subject(s)
Base Sequence , DNA/genetics , Genes , Genetic Techniques , Proteins/genetics , Algorithms , Base Composition , Codon/genetics , Exons , Fourier Analysis , Humans
19.
Genomics ; 13(4): 1056-64, 1992 Aug.
Article in English | MEDLINE | ID: mdl-1505943

ABSTRACT

We model the base compositional structure of the human and Escherichia coli genomes. Three particular properties are first quantified: (1) There is a significant tendency for any region of either genome to have a strand-symmetric base composition. (2) The variation in base composition from region to region, within each genome, is very much larger than expected from common homogeneous stochastic models. (3) A given local base composition tends to persist over a scale of at least kilobases (E. coli) or tens of kilobases (human). Multidomain stochastic models from the literature are reviewed and sharpened. In particular, quantitative measurements of the third property lead us to suggest a significant shift in the style of domain models, in which the variation of A+T content with position is modeled by a random walk with frequent small steps rather than with large quantum jumps. As an application, we suggest a way to reduce the amount of computation in the assembly of large sequences from sequences of randomly chosen fragments.


Subject(s)
Escherichia coli/genetics , Genome, Bacterial , Genome, Human , Humans
20.
Biotechniques ; 10(6): 764-7, 1991 Jun.
Article in English | MEDLINE | ID: mdl-1878210

ABSTRACT

SCORE, a program for computer-assisted scoring of Southern blots of clone DNA, retains the use of expert human judgment while taking over much of the drudgery of the scoring task. The primary functions of the program are to help make an aligned overlay of the fluorescence gel image and the autoradiogram blot image, to keep track of band and lane locations and to store the resulting data directly into a database. Use of SCORE has resulted in greatly increased efficiency and accuracy.


Subject(s)
Blotting, Southern , Software , Autoradiography , Chromosome Mapping/methods , DNA Fingerprinting/methods , Electrophoresis, Agar Gel , Humans , Image Processing, Computer-Assisted/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...