Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 24
Filter
Add more filters










Publication year range
2.
Mol Biol Evol ; 29(3): 929-37, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22009060

ABSTRACT

In phylogenetic inference, an evolutionary model describes the substitution processes along each edge of a phylogenetic tree. Misspecification of the model has important implications for the analysis of phylogenetic data. Conventionally, however, the selection of a suitable evolutionary model is based on heuristics or relies on the choice of an approximate input tree. We introduce a method for model Selection in Phylogenetics based on linear INvariants (SPIn), which uses recent insights on linear invariants to characterize a model of nucleotide evolution for phylogenetic mixtures on any number of components. Linear invariants are constraints among the joint probabilities of the bases in the operational taxonomic units that hold irrespective of the tree topologies appearing in the mixtures. SPIn therefore requires no input tree and is designed to deal with nonhomogeneous phylogenetic data consisting of multiple sequence alignments showing different patterns of evolution, for example, concatenated genes, exons, and/or introns. Here, we report on the results of the proposed method evaluated on multiple sequence alignments simulated under a variety of single-tree and mixture settings for both continuous- and discrete-time models. In the simulations, SPIn successfully recovers the underlying evolutionary model and is shown to perform better than existing approaches.


Subject(s)
Evolution, Molecular , Models, Genetic , Phylogeny , Base Sequence , Computer Simulation , Markov Chains , Sequence Alignment
3.
Bioinformatics ; 26(21): 2656-63, 2010 Nov 01.
Article in English | MEDLINE | ID: mdl-20861026

ABSTRACT

MOTIVATION: Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins, UGA is recoded to Sec in presence of specific features on selenoprotein gene transcripts. Due to the dual role of the UGA codon, selenoprotein prediction and annotation are difficult tasks, and even known selenoproteins are often misannotated in genome databases. RESULTS: We present an homology-based in silico method to scan genomes for members of the known eukaryotic selenoprotein families: selenoprofiles. The core of the method is a set of manually curated highly reliable multiple sequence alignments of selenoprotein families, which are used as queries to scan genomic sequences. Results of the scan are processed through a number of steps, to produce highly accurate predictions of selenoprotein genes with little or no human intervention. Selenoprofiles is a valuable tool for bioinformatic characterization of eukaryotic selenoproteomes, and can complement genome annotation pipelines. AVAILABILITY AND IMPLEMENTATION: Selenoprofiles is a python-built pipeline that internally runs psitblastn, exonerate, genewise, SECISearch and a number of custom-made scripts and programs. The program is available at http://big.crg.cat/services/selenoprofiles. The predictions presented in this article are available through DAS at http://genome.crg.cat:9000/das/Selenoprofiles_ensembl.


Subject(s)
Genome , Selenoproteins/genetics , Codon, Terminator , Databases, Genetic , Regulatory Sequences, Nucleic Acid , Selenocysteine/chemistry , Selenoproteins/chemistry , Sequence Alignment
4.
Article in English | MEDLINE | ID: mdl-21502407

ABSTRACT

The human genome contains thousands of long noncoding RNAs (ncRNAs) transcribed from diverse genomic locations. A large set of long ncRNAs is transcribed independent of protein-coding genes. We have used the GENCODE annotation of the human genome to identify 3019 long ncRNAs expressed in various human cell lines and tissue. This set of long ncRNAs responds to differentiation signals in primary human keratinocytes and is coexpressed with important regulators of keratinocyte development. Depletion of a number of these long ncRNAs leads to the repression of specific genes in their surrounding locus, supportive of an activating function for ncRNAs. Using reporter assays, we confirmed such activating function and show that such transcriptional enhancement is mediated through the long ncRNA transcripts. Our studies show that long ncRNAs exhibit functions similar to classically defined enhancers, through an RNA-dependent mechanism.


Subject(s)
Enhancer Elements, Genetic/genetics , Gene Expression Regulation , RNA, Untranslated/genetics , Cell Differentiation/genetics , Conserved Sequence/genetics , Genome, Human/genetics , Humans , Keratinocytes/metabolism , Molecular Sequence Annotation , Open Reading Frames/genetics , Snail Family Transcription Factors , Software , Transcription Factors/metabolism
5.
EMBO J ; 20(19): 5354-60, 2001 Oct 01.
Article in English | MEDLINE | ID: mdl-11574467

ABSTRACT

The evolutionary significance of introns remains a mystery. The current availability of several complete eukaryotic genomes permits new studies to probe the possible function of these peculiar genomic features. Here we investigate the degree to which gene structure (intron position, phase and length) is conserved between homologous protein domains. We find that for certain extracellular-signalling and nuclear domains, gene structures are similar even when protein sequence similarity is low or not significant and sequences can only be aligned with a knowledge of protein tertiary structure. In contrast, other domains, including most intracellular signalling modules, show little gene structure conservation. Intriguingly, many domains with conserved gene structures, such as cytokines, are involved in similar biological processes, such as the immune response. This suggests that gene structure conservation may be a record of key events in evolution, such as the origin of the vertebrate immune system or the duplication of nuclear receptors in nematodes. The results suggest ways to detect new and potentially very remote homologues, and to construct phylogenies for proteins with limited sequence similarity.


Subject(s)
Evolution, Molecular , Exons/genetics , Introns/genetics , Amino Acid Sequence , Cytokines/genetics , GTPase-Activating Proteins/genetics , Genes , Models, Molecular , Protein Structure, Secondary , Protein Structure, Tertiary
6.
Genome Res ; 11(9): 1574-83, 2001 Sep.
Article in English | MEDLINE | ID: mdl-11544202

ABSTRACT

Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of depends little on species-specific properties such as codon usage or the nucleotide distribution. may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.


Subject(s)
Algorithms , Genes/genetics , Sequence Alignment/methods , Sequence Homology, Nucleic Acid , Animals , Brassica/genetics , Codon/genetics , Databases, Factual , Evolution, Molecular , Humans , Mice , RNA Splice Sites/genetics , Rats
7.
EMBO Rep ; 2(8): 697-702, 2001 Aug.
Article in English | MEDLINE | ID: mdl-11493597

ABSTRACT

In selenoproteins, incorporation of the amino acid selenocysteine is specified by the UGA codon, usually a stop signal. The alternative decoding of UGA is conferred by an mRNA structure, the SECIS element, located in the 3'-untranslated region of the selenoprotein mRNA. Because of the non-standard use of the UGA codon, current computational gene prediction methods are unable to identify selenoproteins in the sequence of the eukaryotic genomes. Here we describe a method to predict selenoproteins in genomic sequences, which relies on the prediction of SECIS elements in coordination with the prediction of genes in which the strong codon bias characteristic of protein coding regions extends beyond a TGA codon interrupting the open reading frame. We applied the method to the Drosophila melanogaster genome, and predicted four potential selenoprotein genes. One of them belongs to a known family of selenoproteins, and we have tested experimentally two other predictions with positive results. Finally, we have characterized the expression pattern of these two novel selenoprotein genes.


Subject(s)
Codon, Terminator/genetics , Drosophila melanogaster/genetics , Genome , Insect Proteins/genetics , Proteins/genetics , Selenocysteine/metabolism , Amino Acid Sequence , Animals , Cell Line , Drosophila melanogaster/embryology , Gene Expression Profiling , Humans , In Situ Hybridization , Insect Proteins/chemistry , Molecular Sequence Data , Nucleic Acid Conformation , Proteins/chemistry , Regulatory Sequences, Nucleic Acid/genetics , Selenium Radioisotopes/metabolism , Selenoproteins , Sequence Alignment
8.
Bioinformatics ; 16(8): 743-4, 2000 Aug.
Article in English | MEDLINE | ID: mdl-11099262

ABSTRACT

gff2psis a program for visualizing annotations of genomic sequences. The program takes the annotated features on a genomic sequence in GFF format as input, and produces a visual output in PostScript. While it can be used in a very simple way, it also allows for a great degree of customization through a number of options and/or customization files.


Subject(s)
Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods , Software , Computational Biology , Genome
9.
Genome Res ; 10(11): 1743-56, 2000 Nov.
Article in English | MEDLINE | ID: mdl-11076860

ABSTRACT

UEV proteins are enzymatically inactive variants of the E2 ubiquitin-conjugating enzymes that regulate noncanonical elongation of ubiquitin chains. In Saccharomyces cerevisiae, UEV is part of the RAD6-mediated error-free DNA repair pathway. In mammalian cells, UEV proteins can modulate c-FOS transcription and the G2-M transition of the cell cycle. Here we show that the UEV genes from phylogenetically distant organisms present a remarkable conservation in their exon-intron structure. We also show that the human UEV1 gene is fused with the previously unknown gene Kua. In Caenorhabditis elegans and Drosophila melanogaster, Kua and UEV are in separated loci, and are expressed as independent transcripts and proteins. In humans, Kua and UEV1 are adjacent genes, expressed either as separate transcripts encoding independent Kua and UEV1 proteins, or as a hybrid Kua-UEV transcript, encoding a two-domain protein. Kua proteins represent a novel class of conserved proteins with juxtamembrane histidine-rich motifs. Experiments with epitope-tagged proteins show that UEV1A is a nuclear protein, whereas both Kua and Kua-UEV localize to cytoplasmic structures, indicating that the Kua domain determines the cytoplasmic localization of Kua-UEV. Therefore, the addition of a Kua domain to UEV in the fused Kua-UEV protein confers new biological properties to this regulator of variant polyubiquitination.


Subject(s)
Biopolymers/metabolism , Ligases/genetics , Recombination, Genetic , Saccharomyces cerevisiae Proteins , Transcription Factors , Ubiquitins/metabolism , Amino Acid Sequence , Animals , Base Sequence , Caenorhabditis elegans/genetics , Conserved Sequence/genetics , Gene Expression Profiling , HeLa Cells , Humans , Introns/genetics , Jurkat Cells , Ligases/isolation & purification , Mice , Molecular Sequence Data , Multigene Family/genetics , Polyubiquitin , Tumor Cells, Cultured , Ubiquitin-Conjugating Enzymes
10.
Genome Res ; 10(10): 1631-42, 2000 Oct.
Article in English | MEDLINE | ID: mdl-11042160

ABSTRACT

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.


Subject(s)
Computational Biology/methods , DNA/chemistry , DNA/genetics , Genes/genetics , Base Composition , Chromosomes, Artificial/chemistry , Chromosomes, Artificial/genetics , Humans , Reproducibility of Results , Software
11.
Genome Res ; 10(4): 511-5, 2000 Apr.
Article in English | MEDLINE | ID: mdl-10779490

ABSTRACT

GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.


Subject(s)
Databases, Factual , Drosophila melanogaster/genetics , Genes, Insect/genetics , Software , Alcohol Dehydrogenase/genetics , Algorithms , Animals , Computational Biology , Drosophila melanogaster/enzymology
12.
Brief Bioinform ; 1(4): 381-8, 2000 Nov.
Article in English | MEDLINE | ID: mdl-11465055

ABSTRACT

An important computational technique for extracting the wealth of information hidden in human genomic sequence data is to compare the sequence with that from the corresponding region of the mouse genome, looking for segments that are conserved over evolutionary time. Moreover, the approach generalises to comparison of sequences from any two related species. The underlying rationale (which is abundantly confirmed by observation) is that a random mutation in a functional region is usually deleterious to the organism, and hence unlikely to become fixed in the population, whereas mutations in a non-functional region are free to accumulate over time. The potential value of this approach is so attractive that the public and private projects to sequence the human genome are now turning to sequencing the mouse, and you will soon be able to compare the human and mouse sequences of your favourite genomic region. We are currently witnessing an explosion of computer tools for comparative analysis of two genomic sequences. Here the capabilities of two new network servers for comparing genomic sequences from any pair of closely related species are sketched. The Syntenic Gene Prediction Program SGP-I utilises sequence comparisons to enhance the ability to locate protein coding segments in genomic data. PipMaker attempts to determine all conserved genomic regions, regardless of their function.


Subject(s)
Computational Biology , Genome , Genomics/statistics & numerical data , Sequence Alignment/statistics & numerical data , Animals , Conserved Sequence , Evolution, Molecular , Genome, Human , Humans , Interleukin-13/genetics , Interleukin-4/genetics , Mice , Software
13.
J Comput Biol ; 5(4): 681-702, 1998.
Article in English | MEDLINE | ID: mdl-10072084

ABSTRACT

In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.


Subject(s)
Algorithms , Exons , Genes , Models, Genetic , Linear Models , Software
14.
J Comput Aided Mol Des ; 11(4): 395-408, 1997 Jul.
Article in English | MEDLINE | ID: mdl-9334905

ABSTRACT

The three-dimensional modelling of proteins is a useful tool to fill the gap between the number of sequenced proteins and the number of experimentally known 3D structures. However, when the degree of homology between the protein and the available 3D templates is low, model building becomes a difficult task and the reliability of the results depends critically on the correctness of the sequence alignment. For this reason, we have undertaken the modelling of human cytochrome P450 1A2 starting by a careful analysis of several sequence alignment strategies (multiple sequence alignments and the TOPITS threading technique). The best results were obtained using TOPITS followed by a manual refinement to avoid unlikely gaps. Because TOPITS uses secondary structure predictions, several methods that are available for this purpose (Levin, Gibrat, DPM, NnPredict, PHD, SOPM and NNSP) have also been evaluated on cytochromes P450 with known 3D structures. More reliable predictions on alpha-helices have been obtained with PHD, which is the method implemented in TOPITS. Thus, a 3D model for human cytochrome P450 1A2 has been built using the known crystal coordinates of P450 BM3 as the template. The model was refined using molecular mechanics computations. The model obtained shows a consistent location of the substrate recognition segments previously postulated for the CYP2 family members. The interaction of caffeine and a carcinogenic aromatic amine (MeIQ), which are characteristic P450 1A2 substrates, has been investigated. The substrates were solvated taking into account their molecular electrostatic potential distributions. The docking of the solvated substrates in the active site of the model was explored with the AUTODOCK programme, followed by molecular mechanics optimisation of the most interesting complexes. Stable complexes were obtained that could explain the oxidation of the considered substrates by cytochrome P450 1A2 and could offer an insight into the role played by water molecules.


Subject(s)
Bacterial Proteins , Caffeine/metabolism , Computer Simulation , Cytochrome P-450 CYP1A2/chemistry , Cytochrome P-450 CYP1A2/metabolism , Cytochrome P-450 Enzyme System/chemistry , Mixed Function Oxygenases/chemistry , Models, Molecular , Protein Conformation , Quinolines/metabolism , Amino Acid Sequence , Binding Sites , Caffeine/chemistry , Conserved Sequence , Humans , Molecular Sequence Data , NADPH-Ferrihemoprotein Reductase , Protein Structure, Secondary , Quinolines/chemistry , Sequence Alignment , Sequence Homology, Amino Acid , Software
15.
J Mol Med (Berl) ; 75(6): 389-93, 1997 Jun.
Article in English | MEDLINE | ID: mdl-9231878
16.
Comput Chem ; 21(4): 215-22, 1997.
Article in English | MEDLINE | ID: mdl-9415986

ABSTRACT

As the Human Genome Project enters the large-scale sequencing phase, computational gene identification methods are becoming essential for the automatic analysis and annotation of large uncharacterized genomic sequences. Currently available computer programs relying mainly on sequence coding statistics are of great use in pin-pointing regions in genomic sequences containing exons. Such programs perform rather poorly, however, when the problem is to fully elucidate gene structure. For this problem, the DNA sequence signals involved in the specification of the genes--start sites and splice sites--carry a lot of information, and simple methods relying on such information can predict gene structure with an accuracy to some extent comparable to that of other more sophisticated computational methods.


Subject(s)
Genes , Genetic Techniques , Human Genome Project , Base Sequence , DNA/chemistry , DNA/genetics , Exons , Humans , RNA Splicing , Software
17.
Mol Phylogenet Evol ; 6(2): 189-213, 1996 Oct.
Article in English | MEDLINE | ID: mdl-8899723

ABSTRACT

Support for contradictory phylogenies is often obtained when molecular sequence data from different genes is used to reconstruct phylogenetic histories. Contradictory phylogenies can result from many data anomalies including unrecognized paralogy. Paralogy, defined as the reconstruction of a phylogenetic tree from a mixture of genes generated by duplications, has generally not been formally included in phylogenetic reconstructions. Here we undertake the task of reconstructing a single most likely evolutionary relationship among a range of taxa from a large set of apparently inconsistent gene trees. Under the assumption that differences among gene trees can be explained by gene duplications, and consequent losses, we have developed a method to obtain the global phylogeny minimizing the total number of postulated duplications and losses and to trace back such individual gene duplications to global genome duplications. We have used this method to infer the most likely phylogenetic relationship among 16 major higher eukaryotic taxa from the sequences of 53 different genes. Only five independent genome duplication events need to be postulated in order to explain the inconsistencies among these trees.


Subject(s)
Phylogeny , Algorithms , Animals , Biological Evolution , Genes , Models, Biological , Multigene Family , Species Specificity
18.
Genomics ; 34(3): 353-67, 1996 Jun 15.
Article in English | MEDLINE | ID: mdl-8786136

ABSTRACT

We evaluate a number of computer programs designed to predict the structure of protein coding genes in genomic DNA sequences. Computational gene identification is set to play an increasingly important role in the development of the genome projects, as emphasis turns from mapping to large-scale sequencing. The evaluation presented here serves both to assess the current status of the problem and to identify the most promising approaches to ensure further progress. The programs analyzed were uniformly tested on a large set of vertebrate sequences with simple gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The results indicated that the predictive accuracy of the programs analyzed was lower than originally found. The accuracy was even lower when considering only those sequences that had recently been entered and that did not show any similarity to previously entered sequences. This indicates that the programs are overly dependent on the particularities of the examples they learn from. For most of the programs, accuracy in this test set ranged from 0.60 to 0.70 as measured by the Correlation Coefficient (where 1.0 corresponds to a perfect prediction and 0.0 is the value expected for a random prediction), and the average percentage of exons exactly identified was less than 50%. Only those programs including protein sequence database searches showed substantially greater accuracy. The accuracy of the programs was severely affected by relatively high rates of sequence errors. Since the set on which the programs were tested included only relatively short sequences with simple gene structure, the accuracy of the programs is likely to be even lower when used for large uncharacterized genomic sequences with complex structure. While in such cases, programs currently available may still be of great use in pinpointing the regions likely to contain exons, they are far from being powerful enough to elucidate its genomic structure completely.


Subject(s)
DNA/chemistry , Genes , Models, Genetic , Proteins/genetics , Software , Alternative Splicing , Animals , DNA/genetics , Exons , Humans , Information Systems , Mathematics , Probability , Protein Biosynthesis , Proteins/chemistry , Pseudogenes , Reproducibility of Results , Sensitivity and Specificity , Vertebrates
19.
J Mol Biol ; 253(1): 51-60, 1995 Oct 13.
Article in English | MEDLINE | ID: mdl-7473716

ABSTRACT

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.


Subject(s)
Base Sequence/genetics , DNA/genetics , Genes/genetics , Algorithms , Base Composition , Databases, Factual , Discriminant Analysis , Humans , Open Reading Frames/genetics , Proteins/genetics
20.
Nucleic Acids Res ; 21(12): 2837-44, 1993 Jun 25.
Article in English | MEDLINE | ID: mdl-8332493

ABSTRACT

A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C. elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.


Subject(s)
Base Composition , Codon , DNA/chemistry , Proteins/genetics , Animals , Caenorhabditis elegans/genetics , Cosmids , DNA/analysis , Genes, Fungal , Humans , Sequence Analysis, DNA , Statistics as Topic
SELECTION OF CITATIONS
SEARCH DETAIL
...