Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
Reproduction ; 138(2): 289-99, 2009 Aug.
Article in English | MEDLINE | ID: mdl-19465487

ABSTRACT

Genome reprogramming is the ability of a nucleus to modify its epigenetic characteristics and gene expression pattern when placed in a new environment. Low efficiency of mammalian cloning is attributed to the incomplete and aberrant nature of genome reprogramming after somatic cell nuclear transfer (SCNT) in oocytes. To date, the aspects of genome reprogramming critical for full-term development after SCNT remain poorly understood. To identify the key elements of this process, changes in gene expression during maternal-to-embryonic transition in normal bovine embryos and changes in gene expression between donor cells and SCNT embryos were compared using a new cDNA array dedicated to embryonic genome transcriptional activation in the bovine. Three groups of transcripts were mostly affected during somatic reprogramming: endogenous terminal repeat (LTR) retrotransposons and mitochondrial transcripts were up-regulated, while genes encoding ribosomal proteins were downregulated. These unexpected data demonstrate specific categories of transcripts most sensitive to somatic reprogramming and likely affecting viability of SCNT embryos. Importantly, massive transcriptional activation of LTR retrotransposons resulted in similar levels of their transcripts in SCNT and fertilized embryos. Taken together, these results open a new avenue in the quest to understand nuclear reprogramming driven by oocyte cytoplasm.


Subject(s)
Cellular Reprogramming , Embryo, Mammalian/physiology , Gene Expression Regulation, Developmental , Genome , Retroelements/genetics , Animals , Cattle , Cloning, Organism , Embryonic Development/genetics , Epigenesis, Genetic , Fertilization , Gene Expression , Gene Expression Profiling/methods , Nuclear Transfer Techniques , Oligonucleotide Array Sequence Analysis , Reverse Transcriptase Polymerase Chain Reaction
2.
Comput Chem ; 26(5): 511-9, 2002 Jul.
Article in English | MEDLINE | ID: mdl-12144179

ABSTRACT

In the framework of genome annotation, scientific literature is obviously the major source of biological knowledge. The aim of the work described in this paper is to exploit this source of data for the model plant Arabidopsis thaliana. The first step has consisted in constituting a relevant bibliographic references dataset for plant genomic research. Genes co-citations have then been systematically annotated in this reference dataset, starting from the simple idea that if genes are cited in the same publication, they must probably share some related functional properties. In order to deal with the synonymous gene name problem, a gene name reference list has been constituted starting from A. thaliana SwissProt entries. This list was used to build clusters of co-cited genes by a single linkage procedure such that any gene in a given cluster possesses at least one co-cited partner in the same cluster. Analysis of the clusters demonstrate the biological consistency of this approach, with only very few fortuitous links. As an example, a cluster including genes related to flowering time is more deeply described in the paper. Finally, a graphical representation of each cluster was performed, which provides a convenient way to retrieve the genes (the nodes of the graphs) and the references in which they were co-cited (the edges of the graphs). All the results can be accessed at the URL http://chlora.Igi.infobiogen.fr:1234/bib_arath/.


Subject(s)
Arabidopsis/genetics , Computational Biology/methods , Databases, Bibliographic , Genome, Plant , Physical Chromosome Mapping/methods , Arabidopsis Proteins/genetics , Cluster Analysis , Databases, Protein , Genes, Plant/genetics , Internet , Knowledge , Molecular Sequence Data , Research , Species Specificity , Terminology as Topic
3.
Bioinformatics ; 18(3): 490-1, 2002 Mar.
Article in English | MEDLINE | ID: mdl-11934752

ABSTRACT

SUMMARY: GeneANOVA is an ANOVA-based software devoted to the analysis of gene expression data. AVAILABILITY: GeneANOVA is freely available on request for non-commercial use.


Subject(s)
Analysis of Variance , Gene Expression/genetics , Genetic Variation/genetics , Software , User-Computer Interface , Databases, Genetic , Oligonucleotide Array Sequence Analysis
4.
J Comput Biol ; 8(4): 381-99, 2001.
Article in English | MEDLINE | ID: mdl-11571074

ABSTRACT

We propose and study a new approach for the analysis of families of protein sequences. This method is related to the LogDet distances used in phylogenetic reconstructions; it can be viewed as an attempt to embed these distances into a multidimensional framework. The proposed method starts by associating a Markov matrix to each pairwise alignment deduced from a given multiple alignment. The central objects under consideration here are matrix-valued logarithms L of these Markov matrices, which exist under conditions that are compatible with fairly large divergence between the sequences. These logarithms allow us to compare data from a family of aligned proteins with simple models (in particular, continuous reversible Markov models) and to test the adequacy of such models. If one neglects fluctuations arising from the finite length of sequences, any continuous reversible Markov model with a single rate matrix Q over an arbitrary tree predicts that all the observed matrices L are multiples of Q. Our method exploits this fact, without relying on any tree estimation. We test this prediction on a family of proteins encoded by the mitochondrial genome of 26 multicellular animals, which include vertebrates, arthropods, echinoderms, molluscs, and nematodes. A principal component analysis of the observed matrices L shows that a single rate model can be used as a rough approximation to the data, but that systematic deviations from any such model are unmistakable and related to the evolutionary history of the species under consideration.


Subject(s)
Computational Biology , Proteins/genetics , Sequence Alignment/statistics & numerical data , Computer Simulation , DNA, Mitochondrial/genetics , Evolution, Molecular , Markov Chains , Phylogeny , Sequence Analysis, Protein/statistics & numerical data , Stochastic Processes
5.
Genome Biol ; 2(6): RESEARCH0019, 2001.
Article in English | MEDLINE | ID: mdl-11423008

ABSTRACT

BACKGROUND: In global gene expression profiling experiments, variation in the expression of genes of interest can often be hidden by general noise. To determine how biologically significant variation can be distinguished under such conditions we have analyzed the differences in gene expression when Bacillus subtilis is grown either on methionine or on methylthioribose as sulfur source. RESULTS: An unexpected link between arginine metabolism and sulfur metabolism was discovered, enabling us to identify a high-affinity arginine transport system encoded by the yqiXYZ genes. In addition, we tentatively identified a methionine/methionine sulfoxide transport system which is encoded by the operon ytmIJKLMhisP and is presumably used in the degradation of methionine sulfoxide to methane sulfonate for sulfur recycling. Experimental parameters resulting in systematic biases in gene expression were also uncovered. In particular, we found that the late competence operons comE, comF and comG were associated with subtle variations in growth conditions. CONCLUSIONS: Using variance analysis it is possible to distinguish between systematic biases and relevant gene-expression variation in transcriptome experiments. Co-variation of metabolic gene expression pathways was thus uncovered linking nitrogen and sulfur metabolism in B. subtilis.


Subject(s)
Arginine/metabolism , Bacillus subtilis/genetics , Bacillus subtilis/metabolism , Gene Expression Regulation, Bacterial , Methionine/analogs & derivatives , Methionine/metabolism , Bacillus subtilis/growth & development , Gene Expression Profiling , Genes, Bacterial , Genetic Variation , Membrane Transport Proteins/genetics , Mutation , Oligonucleotide Array Sequence Analysis , Operon , RNA, Bacterial/biosynthesis , Sulfur/metabolism , Thioglycosides/metabolism
6.
Mol Biol Evol ; 18(7): 1231-45, 2001 Jul.
Article in English | MEDLINE | ID: mdl-11420363

ABSTRACT

Previous analyses of retroviral nucleotide sequences, suggest a so-called "scrambled duplicative stepwise molecular evolution" (many sectors with successive duplications/deletions of short and longer motifs) that could have stemmed from one or several starter tandemly repeated short sequence(s). In the present report, we tested this hypothesis by focusing on the long terminal repeats (LTRs) (and flanking sequences) of 24 human and 3 simian immunodeficiency viruses. By using a calculation strategy applicable to short sequences, we found consensus overrepresented motifs (often containing CTG or CAG) that were congruent with the previously defined "retroviral signature." We also show many local repetition patterns that are significant when compared with simply shuffled sequences. First- and second-order Markov chain analyses demonstrate that a major portion of the overrepresented oligonucleotides can be predicted from the dinucleotide compositions of the sequences, but by no means can biological mechanisms be deduced from these results: some of the listed local repetitions remain significant against dinucleotide-conserving shuffled sequences; together with previous results, this suggests that interspersed and/or local mononucleotide and oligonucleotide repetitions could have biased the dinucleotide compositions of the sequences. We searched for suggestive evolutionary patterns by scrutinizing a reliable multiple alignment of the 27 sequences. A manually constructed alignment based on homology blocks was in good agreement with the polypeptide alignment in the coding sectors and has been exhaustively assessed by using a multiplied alphabet obtained by the promising mathematical strategy called the N-block presentation (taking into account the environment of each nucleotide in a sequence). Sector by sector, we hypothesize many successive duplication/deletion scenarios that fit our previous evolutionary hypotheses. This suggests an important duplication/deletion role for the reverse transcriptase, particularly in inducing stuttering cryptic simplicity patterns.


Subject(s)
Evolution, Molecular , HIV Long Terminal Repeat , HIV-1/genetics , HIV-2/genetics , Algorithms , Animals , Base Sequence , Consensus Sequence , DNA, Viral/genetics , Humans , Models, Genetic , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Deletion , Simian Immunodeficiency Virus/genetics
7.
Comput Chem ; 23(3-4): 317-31, 1999 Jun 15.
Article in English | MEDLINE | ID: mdl-10627144

ABSTRACT

The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large datasets of protein sequences, leading to a law of probability that the experimental Z-values follow. First, we determine the relationships between the computed Z-value, an estimation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid composition as the real ones) lead to Z-value distributions that statistically fit the extreme value distribution, more precisely the Gumbel distribution (global EVD, Extreme Value Distribution). However, for real protein sequences, we observe an over-representation of high Z-values. We determine first a cutoff value which separates these overestimated Z-values from those which follow the global EVD. We then show that the interesting part of the tail of distribution of Z-values can be approximated by another EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law. This has been confirmed for all proteins analysed so far, whether extracted from individual genomes, or from the ensemble of five complete microbial genomes comprising altogether 16956 protein sequences.


Subject(s)
Genome, Bacterial , Genome, Fungal , Sequence Alignment , Computing Methodologies , Escherichia coli/genetics , Mathematics , Monte Carlo Method , Saccharomyces cerevisiae/genetics
8.
FEMS Microbiol Rev ; 22(4): 207-27, 1998 Oct.
Article in English | MEDLINE | ID: mdl-9862121

ABSTRACT

The present article describes a genome database reviewing gene-related knowledge of two model bacteria, Bacillus subtilis and Escherichia coli. The database, Indigo, is open through the World-Wide Web (http://indigo.genetique.uvsq.fr). The concept used for organising the data, the concept of neighbourhood, allows one to explore the database content in an efficient although somewhat unusual way. Here, genes are related to each other by a variety of neighbourhoods, including proximity in the chromosome, phylogenetic kinship, participation in a common metabolic pathway, common presence in an article of the literature, or similar use of the genetic code. Several examples illustrate how this concept of neighbourhood permits one to review the available knowledge about a given gene or gene family, and elaborate unexpected, but revealing, analyses about gene functions.


Subject(s)
Bacillus subtilis/genetics , Databases as Topic , Escherichia coli/genetics , Genome, Bacterial , Bacillus subtilis/classification , Escherichia coli/classification , Genes, Bacterial/genetics , Ligases/genetics , RNA, Transfer/classification
9.
Gene ; 209(1-2): GC1-GC38, 1998 Mar 16.
Article in English | MEDLINE | ID: mdl-9583944

ABSTRACT

In this paper, the relationship between codon usage and the physiological pattern of expression of a gene is investigated while considering a dataset of 815 nuclear genes of Arabidopsis thaliana. Factorial Correspondence Analysis, a commonly used multivariate statistical approach in codon usage analysis, was used in order to analyse codon usage bias gene by gene. The analysis reveals a single major trend in codon usage among genes in Arabidopsis. At one end of the trend lie genes with a highly G/C biased codon usage. This group contains mainly photosynthetic and housekeeping genes which are known to encode the most abundant proteins of the vegetal cell. At the other extreme lie genes with a weaker A/T-biased codon usage. This group contain genes with various functions which exhibits most of the time a strong tissue-specific pattern of expression in relation, for example, to stress conditions. These observations were confirmed by the detailed analysis of codon usage in the multigene family of tubulins and appear to be general in plant species, even as distant from Arabidopsis thaliana as a monocotyledonous plant such as maize.


Subject(s)
Arabidopsis/genetics , Codon/genetics , Databases, Factual , Genes, Plant , Base Composition , Base Sequence , Genome, Plant , Plant Proteins/genetics
10.
Electrophoresis ; 19(4): 515-27, 1998 Apr.
Article in English | MEDLINE | ID: mdl-9588797

ABSTRACT

Present availability of the genomic text of bacteria allows assignment of biological known functions to many genes (typically, half of the genome's gene content). It is now time to try and predict new unexpected functions, using inductive procedures that allow correlating the content of the genomic text to possible biological functions. We show here that analysis of the genomes of Escherichia coli and Bacillus subtilis for the distribution of AGCT motifs predicts that genes exist for which the mRNA molecule can be translated as several different proteins synthesized after ribosomal frameshifting or hopping. Among these genes we found that several coded for the same function in E. coli and B. subtilis. We analyzed in depth the situation of the infB gene (experimentally known to specify synthesis of several proteins differing in their translation starts), the aceF/pdhC gene, the eno gene, and the rplI gene. In addition, genes specific to E. coli were also studied: ompA, ompFand tolA (predicting epigenetic variation that could help escape infection by phages or colicins).


Subject(s)
Bacillus subtilis/genetics , Escherichia coli Proteins , Escherichia coli/genetics , Frameshifting, Ribosomal , Genome, Bacterial , Microsatellite Repeats , Acetyltransferases/genetics , Amino Acid Sequence , Bacterial Outer Membrane Proteins/genetics , Bacterial Proteins/genetics , Consensus Sequence , Dihydrolipoyllysine-Residue Acetyltransferase , Glyceraldehyde-3-Phosphate Dehydrogenases/genetics , Mathematical Computing , Molecular Sequence Data , Peptide Initiation Factors/genetics , Phosphopyruvate Hydratase/genetics , Porins/genetics , Prokaryotic Initiation Factor-2 , Pyruvate Dehydrogenase Complex/genetics , RNA, Messenger , Ribosomal Proteins/genetics , Sequence Homology, Amino Acid
11.
DNA Res ; 4(4): 257-65, 1997 Aug 31.
Article in English | MEDLINE | ID: mdl-9405933

ABSTRACT

Analysis of the codon usage of genes coding for the structural components of the outer membrane in Escherichia coli, is consistent with the requirement for high expression of these genes. Because porins (which constitute the major protein component of the outer membrane), and LPS (which constitute the major outermost constituent of the outer membrane), are synthesized from genes displaying widely different codon usage, it is possible to investigate the origin of the outer membrane. The analysis predicts that the outer membrane might originate from a genome other than the genome coding for the major part of the cell. Such a special origin would explain in structural terms, the likely lethality of porins if they were inadvertently inserted within the inner membrane, giving rise to the Gram-negative bacterial type, having an envelope comprising two membranes, instead of a single cytoplasmic membrane and a murein sacculus.


Subject(s)
Bacterial Outer Membrane Proteins/genetics , Codon , Escherichia coli/genetics , Genome, Bacterial , RNA, Transfer/genetics
12.
Comput Appl Biosci ; 13(2): 131-6, 1997 Apr.
Article in English | MEDLINE | ID: mdl-9146959

ABSTRACT

MOTIVATION: Compression algorithms can be used to analyse genetic sequences. A compression algorithm tests a given property on the sequence and uses it to encode the sequence: if the property is true, it reveals some structure of the sequence which can be described briefly, this yields a description of the sequence which is shorter than the sequence of nucleotides given in extenso. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. RESULTS: We present a compression algorithm that tests the presence of a particular type of dosDNA (defined ordered sequence-DNA): approximate tandem repeats of small motifs (i.e. of lengths < 4). This algorithm has been experimented with on four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes.


Subject(s)
Algorithms , DNA/genetics , Repetitive Sequences, Nucleic Acid , Base Sequence , Chromosomes, Fungal/genetics , DNA, Fungal/genetics , Evaluation Studies as Topic , Molecular Sequence Data , Saccharomyces cerevisiae/genetics , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/statistics & numerical data , Software
13.
J Mol Evol ; 44(2): 214-25, 1997 Feb.
Article in English | MEDLINE | ID: mdl-9069182

ABSTRACT

A computer-assisted analysis was made of 24 complete nucleotide sequences selected from the vertebrate retroviruses to represent the ten viral groups. The conclusions of this analysis extend and strengthen the previously made hypothesis on the Moloney murine leukemia virus: The evolution of the nucleotide sequence appears to have occurred mainly through at least three overlapping levels of duplication: (1) The distributions of overrepresented (3-6)-mers are consistent with the universal rule of a trend toward TG/CT excess and with the persistence of a certain degree of symmetry between the two strands of DNA. This suggests one or several original tandemly repeated sequences and some inverted duplications. (2) The existence of two general core consensuses at the level of these (3-6)-mers supports the hypothesis of a common evolutionary origin of vertebrate retroviruses. Consensuses more specific to certain sequences are compatible with phylogenetic trees established independently. The consensuses could correspond to intermediary evolutionary stages. (3) Most of the (3-6)-mers with a significantly higher than average frequency appear to be internally repeated (with monomeric or oligomeric internal iterations) and seem to be at least partly the cause of the bias observed by other researchers at the level of retroviral nucleotide composition. They suggest a third evolutionary stage by slippage-like stepwise local duplications.


Subject(s)
Base Composition , Evolution, Molecular , Repetitive Sequences, Nucleic Acid/genetics , Retroviridae/genetics , Animals , Consensus Sequence/genetics , DNA, Viral/chemistry , DNA, Viral/genetics , Molecular Sequence Data , Oligodeoxyribonucleotides/genetics , Phylogeny , Vertebrates
14.
Genetica ; 100(1-3): 271-9, 1997.
Article in English | MEDLINE | ID: mdl-9440280

ABSTRACT

We investigate the nucleotide sequences of 23 retroelements (4 mammalian retroviruses, 1 human, 3 yeast, 2 plant, and 13 invertebrate retrotransposons) in terms of their oligonucleotide composition in order to address the problem of relationship between retrotransposons and retroviruses, and the coadaptation of these retroelements to their host genomes. We have identified by computer analysis over-represented 3-through 6-mers in each sequence. Our results indicate retrotransposons are heterogeneous in contrast to retroviruses, suggesting different modes of evolution by slippage-like mechanisms. Moreover, we have calculated the Observed/Expected number ratio for each of the 256 tetramers and analysed the data using a multivariate approach. The tetramer composition of retroelement sequences appears to be influenced by host genomic factors like methylase activity.


Subject(s)
Genome , Retroelements/genetics , Retroviridae/genetics , Animals , Base Sequence , Humans , Molecular Sequence Data , Phylogeny
16.
J Mol Biol ; 257(3): 574-85, 1996 Apr 05.
Article in English | MEDLINE | ID: mdl-8648625

ABSTRACT

This work reconsiders the GATC motif distribution in a 1.6 Mb segment of the Escherichia coli genome, compared to its distribution in phages and plasmids. At first sight the distribution of GATC words looks random. But when a realistic model of the chromosome (made of average genes having the same codon usage as in the real chromasome), is used as a theoretical reference, strong biasesare observed. GATC pairs such as GATCNNGATC are under-represented while there is a strong positive selection for motifs separated by 10, 19, 70 and 1100 bp. The last class is the only one present in E. coli parasites. It can be ascribed to the triggering sequences of the long-patch mismatch repair system. The 6 bp class overlaps with the consensus of CAP (catabolite activator protein) and FNR (fumarate/nitrate regulator) binding sites, thus accounting for counter-selection. The other classes, which could be targets for a nucleic acid-binding protein, are almost always present inside protein coding sequences, and are members of clusters of GATC motifs. Analysis of the genes containing these motifs suggests that they correspond to a regulatory process monitoring the shift from anaerobic to aerobic growth conditions. In particular this regulation, closing down transcription of a large number of genes involved in intermediary metabolism would be well suited for the cold and oxygen shift from the mammal's gut to the standard environmental conditions. In this process the methylation status of GATC clusters would be very important for tuning transcription, and a DNA binding protein, probably a member of the cold-shock proteins family would be needed for alleviating the effects mediated by slackening of the pace of methylation during the shift.


Subject(s)
Bacteriophages/genetics , DNA, Bacterial/genetics , Escherichia coli/genetics , Oligonucleotides/genetics , Plasmids/genetics , Sequence Analysis, DNA , Base Sequence , Molecular Sequence Data
17.
J Mol Biol ; 250(2): 123-7, 1995 Jul 07.
Article in English | MEDLINE | ID: mdl-7608964

ABSTRACT

The availability of specialized sequence databanks for Escherichia coli, Saccharomyces cerevisiae and Bacillus subtilis made it possible to build a set of 105 protein-coding genes that are homologous in these three species. An analysis of the triplets at both the nucleotide and amino acid level revealed that the codon bias of some amino acids are significantly higher at conserved rather than at non-conserved positions. Comparisons of homologous genes in E. coli and Salmonella typhimurium, and in S. cerevisiae and Drosophila melanogaster, led to the same conclusion. A special case was made for serine in E. coli, whose major codon is AGC for non-conserved and TCC for conserved residues. We interpret this observation as evidence that the primordial codons for serine were TCN, while codons AGY appeared later. This conclusion is substantiated by an analysis of the codon usage of catalytic serine residues in ancient, ubiquitous and essential proteins (ATP synthases and topoisomerases). It is shown that in these proteins the proportion of the catalytic serine residues coded by TCN is significantly higher than the one expected from the overall codon usage of serine residues.


Subject(s)
Biological Evolution , Codon/genetics , Conserved Sequence/genetics , Genetic Code/genetics , Serine/genetics , Amino Acid Sequence , Bacillus subtilis/genetics , Base Sequence , Escherichia coli/genetics , Saccharomyces cerevisiae/genetics
18.
C R Acad Sci III ; 318(5): 599-608, 1995 May.
Article in English | MEDLINE | ID: mdl-7671006

ABSTRACT

Complex genomes contain numerous simple sequence repeats, the biological significance of which remains obscure. Recently it has been shown that several human diseases are the result of changes in such sequences. Thus it has become urgent to undertake a systematic study of their properties. We have set the task of describing as completely as possible the set of sequences which contain bases organized according to symmetrical elements, the dosDNA: defined ordered sequence. Examination of local anomalies in dinucleotide composition serves to identify dosDNA zones in the genome. The study of chromosomes II, III, VIII and XI of Saccharomyces cerevisiae reveals these dosDNA zones comprise about 2% of the genome. They are regularly distributed along the chromosomes, regardless of the functional significance of the sequence. A more detailed analysis of dosDNA segments seems to indicate that simple repeats are the consequence of local properties of the chromosome, and not due to any motif in particular.


Subject(s)
Chromosomes, Fungal/genetics , DNA, Fungal/chemistry , Repetitive Sequences, Nucleic Acid , Saccharomyces cerevisiae/genetics , Base Sequence , Genetic Variation , Molecular Sequence Data
19.
Comput Appl Biosci ; 10(4): 401-8, 1994 Jul.
Article in English | MEDLINE | ID: mdl-7804872

ABSTRACT

A program for assembling sequences by using a global approach has been developed. By successive steps, a more and more precise classification of DNA fragments permits the positioning of the sequences on the contig; after having detected the pairs of overlapping sequences, groups are formed such that all sequences in a group overlap. Sequences common to several groups enable the groups to be ordered in a series. Ambiguities in the order of groups can arise at this stage, due to the presence of repeated fragments; different solutions are then proposed. Putting the groups into order leads to a preclassification of sequences. The fragments are then aligned by group, by searching for words common to all sequences in the group, and using an algorithm of dynamic programming. A detailed example on a set of nine sequences accompanies the description of the method.


Subject(s)
Sequence Analysis, DNA/methods , Software , Algorithms , Base Sequence , DNA/genetics , DNA, Complementary/genetics , Genetic Techniques , Molecular Sequence Data , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Analysis, DNA/statistics & numerical data
20.
Microbiol Rev ; 57(3): 623-54, 1993 Sep.
Article in English | MEDLINE | ID: mdl-8246843

ABSTRACT

Several data libraries have been created to organize all the data obtained worldwide about the Escherichia coli genome. Because the known data now amount to more than 40% of the whole genome sequence, it has become necessary to organize the data in such a way that appropriate procedures can associate knowledge produced by experiments about each gene to its position on the chromosome and its relation to other relevant genes, for example. In addition, global properties of genes, affected by the introduction of new entries, should be present as appropriate description fields. A data base, implemented on Macintosh by using the data base management system 4th Dimension, is described. It is constructed around a core constituted by known contigs of E. coli sequences and links data collected in general libraries (unmodified) to data associated with evolving knowledge (with modifiable fields). Biologically significant results obtained through the coupling of appropriate procedures (learning or statistical data analysis) are presented. The data base is available through a 4th Dimension runtime and through FTP on Internet. It has been regularly updated and will be systematically linked to other E. coli data bases (M. Kroger, R. Wahl, G. Schachtel, and P. Rice, Nucleic Acids Res. 20(Suppl.):2119-2144, 1992; K. E. Rudd, W. Miller, C. Werner, J. Ostell, C. Tolstoshev, and S. G. Satterfield, Nucleic Acids Res. 19:637-647, 1991) in the near future.


Subject(s)
Databases, Factual , Escherichia coli/genetics , Genome, Bacterial , Bacterial Proteins/genetics , Base Sequence , Chromosome Mapping , Chromosomes, Bacterial , DNA Replication , Data Display , Database Management Systems , Genes, Bacterial , Models, Theoretical , Molecular Sequence Data , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...