Search | VHL Regional Portal

1.

A relational schema for both array-based and SAGE gene expression experiments.

Stoeckert, C; Pizarro, A; Manduchi, E; Gibson, M; Brunk, B; Crabtree, J; Schug, J; Shen-Orr, S; Overton, G C.

Bioinformatics ; 17(4): 300-8, 2001 Apr.

Article in English | MEDLINE | ID: mdl-11301298

ABSTRACT

MOTIVATION AND RESULTS: A relational schema is described for capturing highly parallel gene expression experiments using different technologies. This schema grew out of efforts to build a database for collaborators working on different biological systems and using different types of platforms in their gene expression experiments as well as different types of image quantification software. The tables are conceptually organized into three categories of information: Platform, Experiment (which includes image scanning and quantification), and Data. The strengths of the schema are: (i) integrating information on array elements using a gene index; (ii) describing samples using ontologies; (iii) reducing an experiment to a single RNA source for precise descriptions yet not losing the relationships between experiments done at the same time or for the same project; and (iv) maintaining both raw and processed (e.g. cleansed and normalized) data and recording how the data is processed. The result is a novel schema, which can hold both array and non-array data, is extensible for detailed experimental descriptions that are precise and consistent, and allows for meaningful comparisons of genes between experiments.

Subject(s)

Databases, Factual , Gene Expression , Oligonucleotide Array Sequence Analysis

2.

Generation of patterns from gene expression data by assigning confidence to differentially expressed genes.

Manduchi, E; Grant, G R; McKenzie, S E; Overton, G C; Surrey, S; Stoeckert, C J.

Bioinformatics ; 16(8): 685-98, 2000 Aug.

Article in English | MEDLINE | ID: mdl-11099255

ABSTRACT

MOTIVATION: A protocol is described to attach expression patterns to genes represented in a collection of hybridization array experiments. Discrete values are used to provide an easily interpretable description of differential expression. Binning cutoffs for each sample type are chosen automatically, depending on the desired false-positive rate for the predictions of differential expression. Confidence levels are derived for the statement that changes in observed levels represent true changes in expression. We have a novel method for calculating this confidence, which gives better results than the standard methods. Our method reflects the broader change of focus in the field from studying a few genes with many replicates to studying many (possibly thousands) of genes simultaneously, but with relatively few replicates. Our approach differs from standard methods in that it exploits the fact that there are many genes on the arrays. These are used to estimate for each sample type an appropriate distribution that is employed to control the false-positive rate of the predictions made. Satisfactory results can be obtained using this method with as few as two replicates. RESULTS: The method is illustrated through applications to macroarray and microarray datasets. The first is an erythroid development dataset that we have generated using nylon filter arrays. Clones for genes whose expression is known in these cells were assigned expression patterns which are in accordance with what was expected and which are not picked up by the standards methods. Moreover, genes differentially expressed between normal and leukemic cells were identified. These included genes whose expression was altered upon induction of the leukemic cells to differentiate. The second application is to the microarray data by Alizadeh et al. (2000). Our results are in accordance with their major findings and offer confidence measures for the predictions made. They also provide new insights for further analysis.

Subject(s)

Databases, Factual , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis , Algorithms , Humans , Leukemia, Erythroblastic, Acute/genetics , Nylons , Tumor Cells, Cultured

3.

The genetic program of hematopoietic stem cells.

Phillips, R L; Ernst, R E; Brunk, B; Ivanova, N; Mahan, M A; Deanehan, J K; Moore, K A; Overton, G C; Lemischka, I R.

Science ; 288(5471): 1635-40, 2000 Jun 02.

Article in English | MEDLINE | ID: mdl-10834841

ABSTRACT

Blood cell production originates from a rare population of multipotent, self-renewing stem cells. A genome-wide gene expression analysis was performed in order to define regulatory pathways in stem cells as well as their global genetic program. Subtracted complementary DNA libraries from highly purified murine fetal liver stem cells were analyzed with bioinformatic and array hybridization strategies. A large percentage of the several thousand gene products that have been characterized correspond to previously undescribed molecules with properties suggestive of regulatory functions. The complete data, available in a biological process-oriented database, represent the molecular phenotype of the hematopoietic stem cell.

Subject(s)

Gene Expression Profiling , Genes , Hematopoietic Stem Cells/physiology , Proteins/genetics , Proteins/physiology , Amino Acid Sequence , Animals , Computational Biology , Databases, Factual , Expressed Sequence Tags , Gene Library , Hematopoietic Stem Cells/chemistry , Hematopoietic Stem Cells/cytology , Liver/cytology , Liver/embryology , Membrane Proteins/chemistry , Membrane Proteins/genetics , Membrane Proteins/physiology , Mice , Molecular Sequence Data , Polymerase Chain Reaction , Proteins/chemistry , Signal Transduction , Transcription Factors/chemistry , Transcription Factors/genetics , Transcription Factors/physiology

4.

Transcription regulatory regions database (TRRD): its status in 2000.

Kolchanov, N A; Podkolodnaya, O A; Ananko, E A; Ignatieva, E V; Stepanenko, I L; Kel-Margoulis, O V; Kel, A E; Merkulova, T I; Goryachkovskaya, T N; Busygina, T V; Kolpakov, F A; Podkolodny, N L; Naumochkin, A N; Korostishevskaya, I M; Romashchenko, A G; Overton, G C.

Nucleic Acids Res ; 28(1): 298-301, 2000 Jan 01.

Article in English | MEDLINE | ID: mdl-10592253

ABSTRACT

Transcription Regulatory Regions Database (TRRD) has been developed for accumulation of experimental information on the structure-function features of regulatory regions of eukaryotic genes. Each entry in TRRD corresponds to a particular gene and contains a description of structure-function features of its regulatory regions (transcription factor binding sites, promoters, enhancers, silencers, etc.) and gene expression regulation patterns. The current release, TRRD 4.2.5, comprises the description of 760 genes, 3403 expression patterns, and >4600 regulatory elements including 3604 transcription factor binding sites, 600 promoters and 152 enhancers. This information was obtained through annotation of 2537 scientific publications. TRRD 4.2.5 is available through the WWW at http://wwwmgs.bionet.nsc.ru/mgs/dbases/trrd4/

Subject(s)

Databases, Factual , Transcription, Genetic , Enhancer Elements, Genetic , Internet , Promoter Regions, Genetic , Regulatory Sequences, Nucleic Acid

5.

[Averaging results of site recognition can increase the accuracy of annotating the human genome]. / Usrednenie rezul'tatov raspoznavaniia saitov mozhet uvelichit' tochnost' annotatsii genoma cheloveka.

Ponomarenko, M P; Ponomarenko, Iu V; Podkolodnaia, O A; Frolov, A S; Vorob'ev, D V; Kolchanov, N A; Overton, G C.

Biofizika ; 44(4): 649-54, 1999.

Article in Russian | MEDLINE | ID: mdl-10544815

ABSTRACT

A systemic approach is proposed, which makes it possible to increase the accuracy of recognition of functional sites in arbitrary DNA sequences. The approach is based on the Central limit theorem and consists in the averaging of a large number of recognitions of a particular site. To obtain a rather large number of recognitions within the framework of conventional methods of recognition, consensus, and frequency matrix, 20 novel oligonucleotide alphabets were used. The approach was used to study the binding sites of GATA-1 and C/EBP transcription factors. It was found that the averaged recognition of these sites is more precise than each of specific recognitions, which just follows from the Central limit theorem.

Subject(s)

DNA/metabolism , Genome, Human , Base Sequence , Binding Sites , CCAAT-Enhancer-Binding Proteins , DNA/genetics , DNA-Binding Proteins/metabolism , Erythroid-Specific DNA-Binding Factors , GATA1 Transcription Factor , Humans , Nuclear Proteins/metabolism , Transcription Factors/metabolism

6.

Oligonucleotide frequency matrices addressed to recognizing functional DNA sites.

Ponomarenko, M P; Ponomarenko, J V; Frolov, A S; Podkolodnaya, O A; Vorobyev, D G; Kolchanov, N A; Overton, G C.

Bioinformatics ; 15(7-8): 631-43, 1999.

Article in English | MEDLINE | ID: mdl-10487871

ABSTRACT

MOTIVATION: Recognition of functional sites remains a key event in the course of genomic DNA annotation. It is well known that a number of sites have their own specific oligonucleotide content. This pinpoints the fact that the preference of the site-specific nucleotide combinations at adjacent positions within an analyzed functional site could be informative for this site recognition. Hence, Web-available resources describing the site-specific oligonucleotide content of the functional DNA sites and applying the above approach for site recognition are needed. However, they have been poorly developed up to now. RESULTS: To describe the specific oligonucleotide content of the functional DNA sites, we introduce the oligonucleotide alphabets, out of which the frequency matrix for a given site could be constructed in addition to a traditional nucleotide frequency matrix. Thus, site recognition accuracy increases. This approach was implemented in the activated MATRIX database accumulating oligonucleotide frequency matrices of the functional DNA sites. We have demonstrated that the false-positive error of the functional site recognition decreases if the oligonucleotide frequency matrixes are added to the nucleotide frequency matrixes commonly used. AVAILABILITY: The MATRIX database is available on the Web, http://wwwmgs.bionet.nsc.ru/Dbases/MATRIX/ and the mirror site, http://www.cbil.upenn.edu/mgs/systems/c onsfreq/.

Subject(s)

DNA/genetics , DNA/metabolism , Databases, Factual , Algorithms , Base Sequence , Binding Sites/genetics , DNA-Binding Proteins/metabolism , Genome , Molecular Sequence Data , NFI Transcription Factors , Oligodeoxyribonucleotides/genetics , Transcription Factors/metabolism

7.

Integrated databases and computer systems for studying eukaryotic gene expression.

Kolchanov, N A; Ponomarenko, M P; Frolov, A S; Ananko, E A; Kolpakov, F A; Ignatieva, E V; Podkolodnaya, O A; Goryachkovskaya, T N; Stepanenko, I L; Merkulova, T I; Babenko, V V; Ponomarenko, Y V; Kochetov, A V; Podkolodny, N L; Vorobiev, D V; Lavryushev, S V; Grigorovich, D A; Kondrakhin, Y V; Milanesi, L; Wingender, E; Solovyev, V; Overton, G C.

Bioinformatics ; 15(7-8): 669-86, 1999.

Article in English | MEDLINE | ID: mdl-10487874

ABSTRACT

MOTIVATION: The goal of the work was to develop a WWW-oriented computer system providing a maximal integration of informational and software resources on the regulation of gene expression and navigation through them. Rapid growth of the variety and volume of information accumulated in the databases on regulation of gene expression necessarily requires the development of computer systems for automated discovery of the knowledge that can be further used for analysis of regulatory genomic sequences. RESULTS: The GeneExpress system developed includes the following major informational and software modules: (1) Transcription Regulation (TRRD) module, which contains the databases on transcription regulatory regions of eukaryotic genes and TRRD Viewer for data visualization; (2) Site Activity Prediction (ACTIVITY), the module for analysis of functional site activity and its prediction; (3) Site Recognition module, which comprises (a) B-DNA-VIDEO system for detecting the conformational and physicochemical properties of DNA sites significant for their recognition, (b) Consensus and Weight Matrices (ConsFrec) and (c) Transcription Factor Binding Sites Recognition (TFBSR) systems for detecting conservative contextual regions of functional sites and their recognition; (4) Gene Networks (GeneNet), which contains an object-oriented database accumulating the data on gene networks and signal transduction pathways, and the Java-based Viewer for exploration and visualization of the GeneNet information; (5) mRNA Translation (Leader mRNA), designed to analyze structural and contextual properties of mRNA 5'-untranslated regions (5'-UTRs) and predict their translation efficiency; (6) other program modules designed to study the structure-function organization of regulatory genomic sequences and regulatory proteins. AVAILABILITY: GeneExpress is available at http://wwwmgs.bionet.nsc. ru/systems/GeneExpress/ and the links to the mirror site(s) can be found at http://wwwmgs.bionet.nsc.ru/mgs/links/mirrors.html+ ++.

Subject(s)

Computer Systems , Databases, Factual , Gene Expression , Algorithms , Artificial Intelligence , Base Sequence , Binding Sites/genetics , Chemical Phenomena , Chemistry, Physical , DNA/chemistry , DNA/genetics , DNA/metabolism , Eukaryotic Cells , Internet , Nucleic Acid Conformation , Promoter Regions, Genetic , Protein Biosynthesis , RNA, Messenger/genetics , Software , TATA Box , Transcription Factors/metabolism

8.

Conformational and physicochemical DNA features specific for transcription factor binding sites.

Ponomarenko, J V; Ponomarenko, M P; Frolov, A S; Vorobyev, D G; Overton, G C; Kolchanov, N A.

Bioinformatics ; 15(7-8): 654-68, 1999.

Article in English | MEDLINE | ID: mdl-10487873

ABSTRACT

MOTIVATION: A reliable recognition of transcription factor binding sites is essential for analysis of regulatory genomic sequences. The experimental data make evident an important role of DNA conformational features for site functioning. However, Internet-available tools for revealing conformational and physicochemical DNA features significant for the site functioning and subsequent use of these features for site recognition have not been developed up to now. RESULTS: We suggest an approach for revealing significant conformational and physicochemical properties of functional sites implemented in the database B-DNA-VIDEO. This database is designed to study the sets of various transcription factor binding sites, providing evidence that transcription factor binding sites are characterized by specific sets of significant conformational and physicochemical DNA properties. For a fixed site, by using the B-DNA features selected for this site recognition, the C-program recognizing this site may be generated, control tested and stored in the database B-DNA-VIDEO. Each B-DNA-VIDEO entry links to the Web-applet recognizing the site, whose significant B-DNA features are stored in this entry as the 'site recognition programs'. The pairwise linked entry-applet pairs are compiled within the B-DNA-VIDEO system, which is simultaneously the database and the program tools package applicable immediately for recognizing the sites stored in the database. Indeed, this is the novelty. Hence, B-DNA-VIDEO is the Web resource of both 'searching for static data' and 'active computation' type, that is why it was called an 'activated database'. AVAILABILITY: B-DNA-VIDEO is available at http://wwwmgs.bionet.nsc.ru/systems/BDNAVideo/ and the mirror site at http://www.cbil.upenn.edu/mgs/systems/c onsfreq/.

Subject(s)

DNA/chemistry , DNA/genetics , Databases, Factual , Transcription Factors/metabolism , Base Sequence , Binding Sites/genetics , Chemical Phenomena , Chemistry, Physical , DNA/metabolism , Internet , Molecular Sequence Data , Nucleic Acid Conformation , Software , TATA Box

9.

Identification of sequence-dependent DNA features correlating to activity of DNA sites interacting with proteins.

Ponomarenko, M P; Ponomarenko, J V; Frolov, A S; Podkolodny, N L; Savinkova, L K; Kolchanov, N A; Overton, G C.

Bioinformatics ; 15(7-8): 687-703, 1999.

Article in English | MEDLINE | ID: mdl-10487875

ABSTRACT

MOTIVATION: The commonly accepted statistical mechanical theory is now multiply confirmed by using the weight matrix methods successfully recognizing DNA sites binding regulatory proteins in prokaryotes. Nevertheless, the recent evaluation of weight matrix methods application for transcription factor binding site recognition in eukaryotes has unexpectedly revealed that the matrix scores correlate better to each other than to the activity of DNA sites interacting with proteins. This observation points out that molecular mechanisms of DNA/protein recognition are more complicated in eukaryotes than in prokaryotes. As the extra events in eukaryotes, the following processes may be considered: (i) competition between the proteins and nucleosome core particle for DNA sites binding these proteins and (ii) interaction between two synergetic/antagonist proteins recognizing a composed element compiled from two DNA sites binding these proteins. That is why identification of the sequence-dependent DNA features correlating with affinity magnitudes of DNA sites interacting with a protein can pinpoint the molecular event limiting this protein/DNA recognition machinery. RESULTS: An approach for predicting site activity based on its primary nucleotide sequence has been developed. The approach is realized in the computer system ACTIVITY, containing the databases on site activity and on conformational and physicochemical DNA/RNA parameters. By using the system ACTIVITY, an analysis of some sites was provided and the methods for predicting site activity were constructed. The methods developed are in good agreement with the experimental data. AVAILABILITY: The database ACTIVITY is available at http://wwwmgs.bionet.nsc.ru/systems/Activity/ and the mirror site, http://www.cbil.upenn.edu/mgs/systems/acti vity/.

Subject(s)

Computer Systems , DNA/genetics , DNA/metabolism , Proteins/metabolism , Algorithms , Animals , Base Sequence , Binding Sites/genetics , Chemical Phenomena , Chemistry, Physical , DNA/chemistry , Databases, Factual , Humans , MADS Domain Proteins , MEF2 Transcription Factors , Molecular Sequence Data , Mutation , Myogenic Regulatory Factors/genetics , Myogenic Regulatory Factors/metabolism , Nucleic Acid Conformation , TATA Box

10.

EpoDB: a prototype database for the analysis of genes expressed during vertebrate erythropoiesis.

Stoeckert, C J; Salas, F; Brunk, B; Overton, G C.

Nucleic Acids Res ; 27(1): 200-3, 1999 Jan 01.

Article in English | MEDLINE | ID: mdl-9847180

ABSTRACT

EpoDB is a database of genes expressed in vertebrate red blood cells. It is also a prototype for the creation of cell and tissue-specific databases from multiple external sources. The information in EpoDB obtained from GenBank, SWISS-PROT, Transfac, TRRD and GERD is curated to provide high quality data for sequence analysis aimed at understanding gene regulation during erythropoiesis. New protocols have been developed for data integration and updating entries. Using a BLAST-based algorithm, we have grouped GenBank entries representing the same gene together. This sequence similarity protocol was also used to identify new entries to be included in EpoDB. We have recently implemented our database in Sybase (relational tables) in addition to SICStus Prolog to provide us with greater flexibility in asking complex queries that utilize information from multiple sources. New additions to the public web site (http://www.cbil.upenn.edu/epodb) for accessing EpoDB are the ability to retrieve groups of entries representing different variants of the same gene and to retrieve gene expression data. The BLAST query has been enhanced by incorporating BLASTView, an interactive and graphical display of BLAST results. We have also enhanced the queries for retrieving sequence from specified genes by the addition of MEME, a motif discovery tool, to the integrated analysis tools which include CLUSTALW and TESS.

Subject(s)

Databases, Factual , Erythrocytes/metabolism , Erythropoiesis/genetics , Gene Expression , Animals , Base Sequence , Information Storage and Retrieval , Internet , Sequence Homology , Software , Vertebrates

11.

bioWidgets: data interaction components for genomics.

Fischer, S; Crabtree, J; Brunk, B; Gibson, M; Overton, G C.

Bioinformatics ; 15(10): 837-46, 1999 Oct.

Article in English | MEDLINE | ID: mdl-10705436

ABSTRACT

MOTIVATION: The presentation of genomics data in a perspicuous visual format is critical for its rapid interpretation and validation. Relatively few public database developers have the resources to implement sophisticated front-end user interfaces themselves. Accordingly, these developers would benefit from a reusable toolkit of user interface and data visualization components. RESULTS: We have designed the bioWidget toolkit as a set of JavaBean components. It includes a wide array of user interface components and defines an architecture for assembling applications. The toolkit is founded on established software engineering design patterns and principles, including componentry, Model-View-Controller, factored models and schema neutrality. As a proof of concept, we have used the bioWidget toolkit to create three extendible applications: AnnotView, BlastView and AlignView.

Subject(s)

Databases, Factual , Genome , User-Computer Interface , Amino Acid Sequence , Base Sequence , Computational Biology , Computer Graphics , Computer Simulation , DNA/genetics , Molecular Sequence Data , Proteins/genetics , Sequence Alignment

12.

The GAIA software framework for genome annotation.

Overton, G C; Bailey, C; Crabtree, J; Gibson, M; Fischer, S; Schug, J.

Pac Symp Biocomput ; : 291-302, 1998.

Article in English | MEDLINE | ID: mdl-9697190

ABSTRACT

We describe a software framework, GAIA, that supports semi-automated annotation of uncharacterized sequence data. The annotation framework incorporates annotation by data source integration, data analysis, and manual data entry. Components of the system include a configurable, open data analysis pipeline, a relational information storage manager, and Java-based graphical user interfaces. We discuss design decisions and tradeoffs in building such a system, and policies and strategies for producing consistent, uniform, high quality annotation.

Subject(s)

Base Sequence , Chromosomes, Human, Pair 22 , Computational Biology/methods , Genome, Human , Genome , Models, Genetic , Software , Computer Graphics , Expressed Sequence Tags , Humans , Physical Chromosome Mapping/methods , Templates, Genetic

13.

Analysis of EST-driven gene annotation in human genomic sequence.

Bailey, L C; Searls, D B; Overton, G C.

Genome Res ; 8(4): 362-76, 1998 Apr.

Article in English | MEDLINE | ID: mdl-9548972

ABSTRACT

We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.

Subject(s)

Gene Expression/genetics , Genome, Human , Base Sequence/genetics , Cloning, Molecular , Computational Biology/methods , DNA, Complementary/analysis , Databases, Factual , Exons , False Positive Reactions , Humans , Introns

14.

GAIA: framework annotation of genomic sequence.

Bailey, L C; Fischer, S; Schug, J; Crabtree, J; Gibson, M; Overton, G C.

Genome Res ; 8(3): 234-50, 1998 Mar.

Article in English | MEDLINE | ID: mdl-9521927

ABSTRACT

As increasing amounts of genomic sequence from many organisms become available, and as DNA sequences become a primary reagent in biologic investigations, the role of annotation as a prospective guide for laboratory experiments will expand rapidly. Here we describe a process of high-throughput, reliable annotation, called framework annotation, which is designed to provide a foundation for initial biologic characterization of previously unexamined sequence. To examine this concept in practice, we have constructed Genome Annotation and Information Analysis (GAIA), a prototype software architecture that implements several elements important for framework annotation. The center of GAIA consists of an annotation database and the associated data management subsystem that forms the software bus along which other components communicate. The schema for this database defines three principal concepts: (1) Entries, consisting of sequence and associated historical data; (2) Features, comprising information of biologic interest; and (3) Experiments, describing the evidence that supports Features. The database permits tracking of annotation results over time, as well as assessment of the reliability of particular results. New framework annotation is produced by CARTA, a set of autonomous sensors that perform automatic analyses and assert results into the annotation database. These results are available via a Web-based query interface that uses graphical Java applets as well as text-based HTML pages to display data at different levels of resolution and permit interactive exploration of annotation. We present results for initial application of framework annotation to a set of test sequences, demonstrating its effectiveness in providing a starting point for biologic investigation, and discuss ways in which the current prototype can be improved. The prototype is available for public use and comment at http://www.cbil.upenn.edu/gaia.

Subject(s)

Base Sequence , Computational Biology/methods , Human Genome Project , Online Systems , Software , Amino Acid Sequence , Animals , Databases, Factual , Humans , Molecular Sequence Data , Sequence Analysis, DNA/methods

15.

Gene discovery by EST sequencing in Toxoplasma gondii reveals sequences restricted to the Apicomplexa.

Ajioka, J W; Boothroyd, J C; Brunk, B P; Hehl, A; Hillier, L; Manger, I D; Marra, M; Overton, G C; Roos, D S; Wan, K L; Waterston, R; Sibley, L D.

Genome Res ; 8(1): 18-28, 1998 Jan.

Article in English | MEDLINE | ID: mdl-9445484

ABSTRACT

To accelerate gene discovery and facilitate genetic mapping in the protozoan parasite Toxoplasma gondii, we have generated >7000 new ESTs from the 5' ends of randomly selected tachyzoite cDNAs. Comparison of the ESTs with the existing gene databases identified possible functions for more than 500 new T. gondii genes by virtue of sequence motifs shared with conserved protein families, including factors involved in transcription, translation, protein secretion, signal transduction, cytoskeleton organization, and metabolism. Despite this success in identifying new genes, more than 50% of the ESTs correspond to genes of unknown function, reflecting the divergent evolutionary status of this parasite. A newly recognized class of genes was identified based on its similarity to sequences known only from other members of the same phylum, therefore identifying sequences that are apparently restricted to the Apicomplexa. Such genes may underlie pathways common to this group of medically important parasites, therefore identifying potential targets for intervention.

Subject(s)

Apicomplexa/genetics , Gene Expression , Genes, Protozoan , Multigene Family , Toxoplasma/genetics , Animals , Computational Biology/methods , Conserved Sequence , DNA, Complementary/analysis , Humans , Protozoan Proteins/classification , Protozoan Proteins/genetics , Sequence Homology, Nucleic Acid

16.

EpoDB: a database of genes expressed during vertebrate erythropoiesis.

Salas, F; Haas, J; Brunk, B; Stoeckert, C J; Overton, G C.

Nucleic Acids Res ; 26(1): 288-9, 1998 Jan 01.

Article in English | MEDLINE | ID: mdl-9399855

ABSTRACT

EpoDB is a database designed for the study of gene regulation during differentiation and development of vertebrate red blood cells. In building EpoDB, we have taken the in advance approach to the data integration problem: we have extracted data relevant to red blood cells from GenBank, SWISS-PROT, TRRD (transcriptional regulation data) and GERD (expression levels data) to create a single integrated, highly curated view. Tools have been developed to automate data extraction from online resources, cleanse data of errors, enter information manually from the primary literature, generate a uniform, canonical representation of information and maintain data currency. The database is organized around biological features, e.g., genes, rather than sequences, which are supported by a controlled and consistent vocabulary for gene names and gene family names. Beyond the standard database queries, the functionality of EpoDB includes the ability to extract features and subsequences, display sequences and features graphically using bioWidget viewers and integrated analysis tools. EpoDB may be accessed at: http://cbil.humgen.upenn.edu/epodb/

Subject(s)

Databases, Factual , Erythropoiesis/genetics , Gene Expression Regulation, Developmental , Animals , Computer Communication Networks , Software , Vertebrates/genetics

17.

Modeling transcription factor binding sites with Gibbs Sampling and Minimum Description Length encoding.

Schug, J; Overton, G C.

Proc Int Conf Intell Syst Mol Biol ; 5: 268-71, 1997.

Article in English | MEDLINE | ID: mdl-9322048

ABSTRACT

Transcription factors, proteins required for the regulation of gene expression, recognize and bind short stretches of DNA on the order of 4 to 10 bases in length. In general, each factor recognizes a family of "similar" sequences rather than a single unique sequence. Ultimately, the transcriptional state of a gene is determined by the cooperative interaction of several bound factors. We have developed a method using Gibbs Sampling and the Minimum Description Length principle for automatically and reliably creating weight matrix models of binding sites from a database (TRANSFAC) of known binding site sequences. Determining the relationship between sequence and binding affinity for a particular factor is an important first step in predicting whether a given uncharacterized sequence is part of a promoter site or other control region. Here we describe the foundation for the methods we will use to develop weight matrix models for transcription factor binding sites.

Subject(s)

Algorithms , Models, Biological , Transcription Factors/metabolism , Base Sequence , Binding Sites/genetics , DNA/genetics , DNA/metabolism , Databases, Factual , Markov Chains , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Stochastic Processes

18.

Functional characterization of the human factor VII 5'-flanking region.

Pollak, E S; Hung, H L; Godin, W; Overton, G C; High, K A.

J Biol Chem ; 271(3): 1738-47, 1996 Jan 19.

Article in English | MEDLINE | ID: mdl-8576177

ABSTRACT

Factor VII is a vitamin K-dependent coagulation protein essential for proper hemostasis. The human Factor VII gene spans 13 kilobase pairs and is located on chromosome 13 just 2.8 kilobase pairs 5' to the Factor X gene. In this report, we show that Factor VII transcripts are restricted to the liver and that steady state levels of mRNA are much lower than those of Factor X. The major transcription start site is mapped at -51 by RNase protection assay and primer extension experiments. The first 185 base pairs 5' of the translation start site are sufficient to confer maximal promoter activity in HepG2 cells. Protein binding sites are identified at nucleotides -51 to -32, -63 to -58, -108 to -84, and -233 to -215 by DNase I footprint analysis and gel mobility shift assays. A liver-enriched transcription factor, hepatocyte nuclear factor-4 (HNF-4), and a ubiquitous transcription factor, Spl, are shown to bind within the first 108 base pairs of the promoter region at nucleotide sequences ACTTTG and CCCCTCCCCC, respectively. The importance of these binding sites in promoter activity is demonstrated through independent functional mutagenesis experiments, which show dramatically reduced promoter activity. Transactivation studies with an HNF-4 expression plasmid in HeLa cells also demonstrate the importance of HNF-4 in promoting transcription in non-hepatocyte derived cells. Additionally, the sequence of a naturally occurring allele containing a previously described decanucleotide insert polymorphism at -323 is shown to reduce promoter activity by 33% compared with the more common allelic sequence.

Subject(s)

Chromosomes, Human, Pair 13 , Factor VII/genetics , Promoter Regions, Genetic , Regulatory Sequences, Nucleic Acid , Amino Acid Sequence , Base Sequence , Basic Helix-Loop-Helix Leucine Zipper Transcription Factors , Cell Nucleus/metabolism , Chromosome Mapping , Consensus Sequence , DNA/chemistry , DNA/metabolism , DNA Footprinting , DNA Primers , DNA-Binding Proteins/metabolism , Deoxyribonuclease I , Factor VII/biosynthesis , Factor X/genetics , Gene Expression , HeLa Cells , Hepatocyte Nuclear Factor 4 , Humans , Liver/metabolism , Molecular Sequence Data , Phosphoproteins/metabolism , RNA, Messenger/analysis , RNA, Messenger/biosynthesis , Sp1 Transcription Factor/metabolism , Transcription Factors/metabolism

19.

SORTEZ: a relational translator for NCBI's ASN.1 database.

Hart, K W; Searls, D B; Overton, G C.

Comput Appl Biosci ; 10(4): 369-78, 1994 Jul.

Article in English | MEDLINE | ID: mdl-7804870

ABSTRACT

The National Center for Biotechnology Information (NCBI) has created a database collection that includes several protein and nucleic acid sequence databases, a biosequence-specific subset of MEDLINE, as well as value-added information such as links between similar sequences. Information in the NCBI database is modeled in Abstract Syntax Notation 1 (ASN.1) an Open Systems Interconnection protocol designed for the purpose of exchanging structured data between software applications rather than as a data model for database systems. While the NCBI database is distributed with an easy-to-use information retrieval system, ENTREZ, the ASN.1 data model currently lacks an ad hoc query language for general-purpose data access. For that reason, we have developed a software package, SORTEZ, that transforms the ASN.1 database (or other databases with nested data structures) to a relational data model and subsequently to a relational database management system (Sybase) where information can be accessed through the relational query language, SQL. Because the need to transform data from one data model and schema to another arises naturally in several important contexts, including efficient execution of specific applications, access to multiple databases and adaptation to database evolution this work also serves as a practical study of the issues involved in the various stages of database transformation. We show that transformation from the ASN.1 data model to a relational data model can be largely automated, but that schema transformation and data conversion require considerable domain expertise and would greatly benefit from additional support tools.

Subject(s)

Databases, Factual , Software , Algorithms , Amino Acid Sequence , Base Sequence , Database Management Systems , Humans , National Library of Medicine (U.S.) , Software Design , United States

20.

QGB: a system for querying sequence database fields and features.

Overton, G C; Aaronson, J S; Haas, J; Adams, J.

J Comput Biol ; 1(1): 3-14, 1994.

Article in English | MEDLINE | ID: mdl-8790449

ABSTRACT

We have developed a general system, QGB, for performing complex queries on the information in the DDBJ/EMBL/GenBank databases, including queries over the structural features of sequences implied in the FEATURE TABLE. Queries are formed in a Structured Query Language (SQL)-like syntax with language extensions to support complex types (e.g., sets, ordered sets, and records) appropriate for representing and querying sequence data. A novel aspect of QGB is its ability to deduce missing features and infer relationships among features as a consequence of constructing a parse tree of sequence structure from information described in the FEATURE TABLE. The grammar for the parse tree is implemented in a customized form of the Definite Clause Grammar syntax of the logic programming language Prolog. The logic grammar formalism was chosen because it provides a perspicuous representation for features and constraints, and Prolog provides an execution model for the grammar rules. Construction of the parse tree also identifies inconsistencies and errors in the FEATURE TABLE that can in some cases be corrected automatically and used to generate an augmented version of the table.

Subject(s)

Base Sequence , Database Management Systems , Databases, Factual , Information Storage and Retrieval , Hemoglobins/genetics , Humans , Karyotyping , Molecular Sequence Data , Programming Languages

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL