Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
1.
PLoS One ; 6(6): e20410, 2011.
Article in English | MEDLINE | ID: mdl-21738574

ABSTRACT

BACKGROUND: Estrogen is a known growth promoter for estrogen receptor (ER)-positive breast cancer cells. Paradoxically, in breast cancer cells that have been chronically deprived of estrogen stimulation, re-introduction of the hormone can induce apoptosis. METHODOLOGY/PRINCIPAL FINDINGS: Here, we sought to identify signaling networks that are triggered by estradiol (E2) in isogenic MCF-7 breast cancer cells that undergo apoptosis (MCF-7:5C) versus cells that proliferate upon exposure to E2 (MCF-7). The nuclear receptor co-activator AIB1 (Amplified in Breast Cancer-1) is known to be rate-limiting for E2-induced cell survival responses in MCF-7 cells and was found here to also be required for the induction of apoptosis by E2 in the MCF-7:5C cells. Proteins that interact with AIB1 as well as complexes that contain tyrosine phosphorylated proteins were isolated by immunoprecipitation and identified by mass spectrometry (MS) at baseline and after a brief exposure to E2 for two hours. Bioinformatic network analyses of the identified protein interactions were then used to analyze E2 signaling pathways that trigger apoptosis versus survival. Comparison of MS data with a computationally-predicted AIB1 interaction network showed that 26 proteins identified in this study are within this network, and are involved in signal transduction, transcription, cell cycle regulation and protein degradation. CONCLUSIONS: G-protein-coupled receptors, PI3 kinase, Wnt and Notch signaling pathways were most strongly associated with E2-induced proliferation or apoptosis and are integrated here into a global AIB1 signaling network that controls qualitatively distinct responses to estrogen.


Subject(s)
Apoptosis/drug effects , Breast Neoplasms/metabolism , Estradiol/pharmacology , Proteomics/methods , Apoptosis/genetics , Female , Humans , Immunoprecipitation , Mass Spectrometry , Nuclear Receptor Coactivator 3/genetics , Nuclear Receptor Coactivator 3/metabolism , Phosphatidylinositol 3-Kinases/genetics , Phosphatidylinositol 3-Kinases/metabolism , Protein Binding , Receptors, G-Protein-Coupled/genetics , Receptors, G-Protein-Coupled/metabolism , Receptors, Notch/genetics , Receptors, Notch/metabolism , Signal Transduction , Tumor Cells, Cultured , Wnt Proteins/genetics , Wnt Proteins/metabolism
2.
BMC Bioinformatics ; 12: 91, 2011 Apr 06.
Article in English | MEDLINE | ID: mdl-21466708

ABSTRACT

BACKGROUND: Protein O-GlcNAcylation (or O-GlcNAc-ylation) is an O-linked glycosylation involving the transfer of ß-N-acetylglucosamine to the hydroxyl group of serine or threonine residues of proteins. Growing evidences suggest that protein O-GlcNAcylation is common and is analogous to phosphorylation in modulating broad ranges of biological processes. However, compared to phosphorylation, the amount of protein O-GlcNAcylation data is relatively limited and its annotation in databases is scarce. Furthermore, a bioinformatics resource for O-GlcNAcylation is lacking, and an O-GlcNAcylation site prediction tool is much needed. DESCRIPTION: We developed a database of O-GlcNAcylated proteins and sites, dbOGAP, primarily based on literature published since O-GlcNAcylation was first described in 1984. The database currently contains ~800 proteins with experimental O-GlcNAcylation information, of which ~61% are of humans, and 172 proteins have a total of ~400 O-GlcNAcylation sites identified. The O-GlcNAcylated proteins are primarily nucleocytoplasmic, including membrane- and non-membrane bounded organelle-associated proteins. The known O-GlcNAcylated proteins exert a broad range of functions including transcriptional regulation, macromolecular complex assembly, intracellular transport, translation, and regulation of cell growth or death. The database also contains ~365 potential O-GlcNAcylated proteins inferred from known O-GlcNAcylated orthologs. Additional annotations, including other protein posttranslational modifications, biological pathways and disease information are integrated into the database. We developed an O-GlcNAcylation site prediction system, OGlcNAcScan, based on Support Vector Machine and trained using protein sequences with known O-GlcNAcylation sites from dbOGAP. The site prediction system achieved an area under ROC curve of 74.3% in five-fold cross-validation. The dbOGAP website was developed to allow for performing search and query on O-GlcNAcylated proteins and associated literature, as well as for browsing by gene names, organisms or pathways, and downloading of the database. Also available from the website, the OGlcNAcScan tool presents a list of predicted O-GlcNAcylation sites for given protein sequences. CONCLUSIONS: dbOGAP is the first public bioinformatics resource to allow systematic access to the O-GlcNAcylated proteins, and related functional information and bibliography, as well as to an O-GlcNAcylation site prediction tool. The resource will facilitate research on O-GlcNAcylation and its proteomic identification.


Subject(s)
Computational Biology/methods , Acetylglucosamine/metabolism , Glycosylation , Humans , Phosphorylation , Protein Processing, Post-Translational , Proteins/metabolism , Proteomics
3.
Methods Mol Biol ; 719: 547-71, 2011.
Article in English | MEDLINE | ID: mdl-21370102

ABSTRACT

Genomic, proteomic, and other omic-based approaches are now broadly used in biomedical research to facilitate the understanding of disease mechanisms and identification of molecular targets and biomarkers for therapeutic and diagnostic development. While the Omics technologies and bioinformatics tools for analyzing Omics data are rapidly advancing, the functional analysis and interpretation of the data remain challenging due to the inherent nature of the generally long workflows of Omics experiments. We adopt a strategy that emphasizes the use of curated knowledge resources coupled with expert-guided examination and interpretation of Omics data for the selection of potential molecular targets. We describe a downstream workflow and procedures for functional analysis that focus on biological pathways, from which molecular targets can be derived and proposed for experimental validation.


Subject(s)
Computational Biology/methods , Animals , Biomarkers/metabolism , Data Interpretation, Statistical , Data Mining , Humans , Information Management , Literature, Modern , Mice , Molecular Sequence Annotation , Phenotype , Protein Interaction Mapping , Rats , Software
4.
AMIA Annu Symp Proc ; 2009: 640-4, 2009 Nov 14.
Article in English | MEDLINE | ID: mdl-20351933

ABSTRACT

Glycosylation is a common and complex protein post-translational modification (PTM). In particular, mucin-type O-linked glycosylation is abundant and plays important biological functions. The number of determined glycosylation sites is still small and there remains the need of accurate computational prediction for annotation and functional understanding of proteins. PTM site prediction can be formulated as a machine learning task. An important step in applying machine learning to this task is encoding protein fragments as feature vectors. Here we assess existing encoding methods as well as an enhanced encoding method named composition of monomer spectrum (CMS) using support vector machines (SVMs). SVMs employing the existing encoding methods achieved AUC (area under ROC curve) of 90.3-91.3%, and ones employing CMS achieved AUC of 92.4%. Analysis of different encoding methods suggests the potential in further improving the prediction.


Subject(s)
Artificial Intelligence , Computational Biology/methods , Glycosylation , Mucins/metabolism , Algorithms , Area Under Curve , Binding Sites , Protein Processing, Post-Translational
5.
J Proteomics Bioinform ; 1(2): 47-60, 2008 May.
Article in English | MEDLINE | ID: mdl-19088860

ABSTRACT

Functional analysis and interpretation of large-scale proteomics and gene expression data require effective use of bioinformatics tools and public knowledge resources coupled with expert-guided examination. An integrated bioinformatics approach was used to analyze cellular pathways in response to ionizing radiation. ATM, or ataxia-telangiectasia mutated , a serine-threonine protein kinase, plays critical roles in radiation responses, including cell cycle arrest and DNA repair. We analyzed radiation responsive pathways based on 2D-gel/MS proteomics and microarray gene expression data from fibroblasts expressing wild type or mutant ATM gene. The analysis showed that metabolism was significantly affected by radiation in an ATM dependent manner. In particular, purine metabolic pathways were differentially changed in the two cell lines. The expression of ribonucleoside-diphosphate reductase subunit M2 (RRM2) was increased in ATM-wild type cells at both mRNA and protein levels, but no changes were detected in ATM-mutated cells. Increased expression of p53 was observed 30min after irradiation of the ATM-wild type cells. These results suggest that RRM2 is a downstream target of the ATM-p53 pathway that mediates radiation-induced DNA repair. We demonstrated that the integrated bioinformatics approach facilitated pathway analysis, hypothesis generation and target gene/protein identification.

6.
BMC Bioinformatics ; 8 Suppl 9: S5, 2007 Nov 27.
Article in English | MEDLINE | ID: mdl-18047706

ABSTRACT

MOTIVATION: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. METHOD: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. RESULTS: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. AVAILABILITY: The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.


Subject(s)
Algorithms , Artificial Intelligence , Database Management Systems , MEDLINE , Natural Language Processing , Periodicals as Topic , Cluster Analysis , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Semantics , User-Computer Interface
7.
Front Biosci ; 12: 5071-88, 2007 Sep 01.
Article in English | MEDLINE | ID: mdl-17569631

ABSTRACT

In the post-genome era, researchers are systematically tackling gene functions and complex regulatory processes by studying organisms on a global scale; however, a major challenge lies in the voluminous, complex, and dynamic data being maintained in heterogeneous sources, especially from proteomics experiments. Advanced computational methods are needed for integration, mining, comparative analysis, and functional interpretation of high-throughput proteomic data. In the first part of this review, we discuss aspects of data integration important for capturing all data relevant to functional analysis. We provide a list of databases commonly used in genomics and proteomics and explain strategies to connect the source data, with especial emphasis on our ID mapping service. Next, we describe iProClass, a central data infrastructure that supports both data integration and functional annotation of proteins, and give a brief introduction to the data search/retrieval and analysis tools currently available at our website (http://pir.georgetown.edu) that researchers can use for large-scale functional analysis. In the last part, we introduce iProXpress (integrated Protein eXpression), an integrated research and discovery platform for large-scale expression data analysis, and we show a prototype that has been useful for organelle proteome analysis.


Subject(s)
Computational Biology/methods , Gene Expression Profiling/methods , Proteomics , Databases, Genetic , Genomics , Humans , Internet , Melanosomes/metabolism , Organelles/metabolism , Peptide Mapping , Proteome , Software
8.
Virus Genes ; 35(2): 175-86, 2007 Oct.
Article in English | MEDLINE | ID: mdl-17508277

ABSTRACT

We have identified 72 completely conserved amino acid residues in the E protein of major groups of the Flavivirus genus by computational analyses. In the dengue species we have identified 12 highly conserved sequence regions, 186 negatively selected sites, and many dengue serotype-specific negatively selected sites. The flavivirus-conserved sites included residues involved in forming six disulfide bonds crucial for the structural integrity of the protein, the fusion motif involved in viral infectivity, and the interface residues of the oligomers. The structural analysis of the E protein showed 19 surface-exposed non-conserved residues, 128 dimer or trimer interface residues, and regions, which undergo major conformational change during trimerization. Eleven consensus T(h)-cell epitopes common to all four dengue serotypes were predicted. Most of these corresponded to dengue-conserved regions or negatively selected sites. Of special interest are six singular sites (N(37), Q(211), D(215), P(217), H(244), K(246)) in dengue E protein that are conserved, are part of the predicted consensus T(h)-cell epitopes and are exposed in the dimer or trimer. We propose these sites and corresponding epitopic regions as potential candidates for prioritization by experimental biologists for development of diagnostics and vaccines that may be difficult to circumvent by natural or man-made alteration of dengue virus.


Subject(s)
Amino Acids/genetics , Computational Biology , Dengue Vaccines/immunology , Dengue Virus/genetics , Dengue/diagnosis , Dengue/virology , Sequence Analysis, Protein , Viral Envelope Proteins/genetics , Amino Acid Sequence , Amino Acids/physiology , Conserved Sequence , Dengue/prevention & control , Dengue Vaccines/administration & dosage , Dengue Vaccines/genetics , Dengue Virus/immunology , Dengue Virus/physiology , Gene Targeting , Molecular Sequence Data , Sequence Alignment , Viral Envelope Proteins/administration & dosage , Viral Envelope Proteins/chemistry
9.
Int J Mass Spectrom ; 259(1-3): 147-160, 2007 Jan 01.
Article in English | MEDLINE | ID: mdl-17375895

ABSTRACT

Complete and accurate profiling of cellular organelle proteomes, while challenging, is important for the understanding of detailed cellular processes at the organelle level. Mass spectrometry technologies coupled with bioinformatics analysis provide an effective approach for protein identification and functional interpretation of organelle proteomes. In this study, we have compiled human organelle reference datasets from large-scale proteomic studies and protein databases for 7 lysosome-related organelles (LROs), as well as the endoplasmic reticulum and mitochondria, for comparative organelle proteome analysis. Heterogeneous sources of human organelle proteins and rodent homologs are mapped to human UniProtKB protein entries based on ID and/or peptide mappings, followed by functional annotation and categorization using the iProXpress proteomic expression analysis system. Cataloging organelle proteomes allows close examination of both shared and unique proteins among various LROs and reveals their functional relevance. The proteomic comparisons show that LROs are a closely related family of organelles. The shared proteins indicate the dynamic and hybrid nature of LROs, while the unique transmembrane proteins may represent additional candidate marker proteins for LROs. This comparative analysis, therefore, provides a basis for hypothesis formulation and experimental validation of organelle proteins and their functional roles.

10.
Bioinformatics ; 23(2): 198-206, 2007 Jan 15.
Article in English | MEDLINE | ID: mdl-17077095

ABSTRACT

MOTIVATION: Our purpose is to develop a statistical modeling approach for cancer biomarker discovery and provide new insights into early cancer detection. We propose the concept of dependence network, apply it for identifying cancer biomarkers, and study the difference between the protein or gene samples from cancer and non-cancer subjects based on mass-spectrometry (MS) and microarray data. RESULTS: Three MS and two gene microarray datasets are studied. Clear differences are observed in the dependence networks for cancer and non-cancer samples. Protein/gene features are examined three at one time through an exhaustive search. Dependence networks are constructed by binding triples identified by the eigenvalue pattern of the dependence model, and are further compared to identify cancer biomarkers. Such dependence-network-based biomarkers show much greater consistency under 10-fold cross-validation than the classification-performance-based biomarkers. Furthermore, the biological relevance of the dependence-network-based biomarkers using microarray data is discussed. The proposed scheme is shown promising for cancer diagnosis and prediction. AVAILABILITY: See supplements: http://dsplab.eng.umd.edu/~genomics/dependencenetwork/


Subject(s)
Biomarkers, Tumor/analysis , Diagnosis, Computer-Assisted/methods , Mass Spectrometry/methods , Neoplasm Proteins/analysis , Neoplasms/diagnosis , Neoplasms/metabolism , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Computer Simulation , Gene Expression Profiling/methods , Humans , Models, Biological , Signal Transduction
11.
J Proteome Res ; 5(11): 3135-44, 2006 Nov.
Article in English | MEDLINE | ID: mdl-17081065

ABSTRACT

Melanin, which is responsible for virtually all visible skin, hair, and eye pigmentation in humans, is synthesized, deposited, and distributed in subcellular organelles termed melanosomes. A comprehensive determination of the protein composition of this organelle has been obstructed by the melanin present. Here, we report a novel method of removing melanin that includes in-solution digestion and immobilized metal affinity chromatography (IMAC). Together with in-gel digestion, this method has allowed us to characterize melanosome proteomes at various developmental stages by tandem mass spectrometry. Comparative profiling and functional characterization of the melanosome proteomes identified approximately 1500 proteins in melanosomes of all stages, with approximately 600 in any given stage. These proteins include 16 homologous to mouse coat color genes and many associated with human pigmentary diseases. Approximately 100 proteins shared by melanosomes from pigmented and nonpigmented melanocytes define the essential melanosome proteome. Proteins validated by confirming their intracellular localization include PEDF (pigment-epithelium derived factor) and SLC24A5 (sodium/potassium/calcium exchanger 5, NCKX5). The sharing of proteins between melanosomes and other lysosome-related organelles suggests a common evolutionary origin. This work represents a model for the study of the biogenesis of lysosome-related organelles.


Subject(s)
Melanosomes/physiology , Proteomics/methods , Animals , Cell Line, Tumor , Chromatography, Affinity , Computational Biology/methods , Eye Color , Gene Expression Profiling , Hair Color , Humans , Melanoma , Melanosomes/chemistry , Mice , Neoplasm Proteins/chemistry , Neoplasm Proteins/genetics , Neoplasm Proteins/isolation & purification , Organelle Biogenesis , Peptide Fragments/chemistry , Peptide Fragments/isolation & purification , Reproducibility of Results , Trypsin
12.
Bioinformatics ; 22(17): 2136-42, 2006 Sep 01.
Article in English | MEDLINE | ID: mdl-16837530

ABSTRACT

MOTIVATION: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. RESULTS: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92-0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.


Subject(s)
Abstracting and Indexing/methods , Database Management Systems , Documentation/methods , Information Storage and Retrieval/methods , MEDLINE , Natural Language Processing , Periodicals as Topic , Algorithms , Artificial Intelligence , Vocabulary, Controlled
13.
J Am Med Inform Assoc ; 13(5): 497-507, 2006.
Article in English | MEDLINE | ID: mdl-16799122

ABSTRACT

OBJECTIVE: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. METHODS: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. RESULTS: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. CONCLUSION: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.


Subject(s)
Natural Language Processing , Proteins , Vocabulary, Controlled , Dictionaries as Topic , Genes , Names
14.
Beijing Da Xue Xue Bao Yi Xue Ban ; 38(2): 218-21, 2006 Apr 18.
Article in Chinese | MEDLINE | ID: mdl-16617371

ABSTRACT

A critical factor in the advancement of biomedical research is the ease with which data can be integrated, redistributed and analyzed both within and across domains. This paper summarizes the Biomedical Information Core Infrastructure built by National Cancer Institute Center for Bioinformatics in America (NCICB). The main product from the Core Infrastructure is caCORE--cancer Common Ontologic Reference Environment, which is the infrastructure backbone supporting data management and application development at NCICB. The paper explains the structure and function of caCORE: (1) Enterprise Vocabulary Services (EVS). They provide controlled vocabulary, dictionary and thesaurus services, and EVS produces the NCI Thesaurus and the NCI Metathesaurus; (2) The Cancer Data Standards Repository (caDSR). It provides a metadata registry for common data elements. (3) Cancer Bioinformatics Infrastructure Objects (caBIO). They provide Java, Simple Object Access Protocol and HTTP-XML application programming interfaces. The vision for caCORE is to provide a common data management framework that will support the consistency, clarity, and comparability of biomedical research data and information. In addition to providing facilities for data management and redistribution, caCORE helps solve problems of data integration. All NCICB-developed caCORE components are distributed under open-source licenses that support unrestricted usage by both non-profit and commercial entities, and caCORE has laid the foundation for a number of scientific and clinical applications. Based on it, the paper expounds caCORE-base applications simply in several NCI projects, of which one is CMAP (Cancer Molecular Analysis Project), and the other is caBIG (Cancer Biomedical Informatics Grid). In the end, the paper also gives good prospects of caCORE, and while caCORE was born out of the needs of the cancer research community, it is intended to serve as a general resource. Cancer research has historically contributed to many areas beyond tumor biology. At the same time, the paper makes some suggestions about the study at the present time on biomedical informatics in China.


Subject(s)
Computational Biology , Database Management Systems , Information Storage and Retrieval , Medical Informatics , National Cancer Institute (U.S.) , Software , United States
15.
Bioinformatics ; 22(1): 103-5, 2006 Jan 01.
Article in English | MEDLINE | ID: mdl-16267085

ABSTRACT

UNLABELLED: BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY: BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus


Subject(s)
Computational Biology/methods , Vocabulary, Controlled , Animals , Databases, Factual , Databases, Genetic , Databases, Protein , Genome , Humans , Information Storage and Retrieval , Internet , Models, Genetic , Names , Proteins , Terminology as Topic
16.
Beijing Da Xue Xue Bao Yi Xue Ban ; 37(4): 445-7, 2005 Aug 18.
Article in Chinese | MEDLINE | ID: mdl-16086073

ABSTRACT

National Institutes of Health (NIH) released the biomedical research project NIH Roadmap Initiatives, including 3 themes, new pathways to discovery, research teams of the future, and re-engineering the clinical research enterprise. The purpose of the project is to catalyze to transform our new scientific knowledge into tangible benefits for people. Now, mostly of the project have begin to carry into practice.


Subject(s)
Biomedical Research/trends , Health Promotion/methods , Forecasting , Humans , National Institutes of Health (U.S.) , Organizational Objectives , Research Support as Topic , United States
17.
BMC Bioinformatics ; 6: 201, 2005 Aug 09.
Article in English | MEDLINE | ID: mdl-16091147

ABSTRACT

BACKGROUND: A large volume of data and information about genes and gene products has been stored in various molecular biology databases. A major challenge for knowledge discovery using these databases is to identify related genes and gene products in disparate databases. The development of Gene Ontology (GO) as a common vocabulary for annotation allows integrated queries across multiple databases and identification of semantically related genes and gene products (i.e., genes and gene products that have similar GO annotations). Meanwhile, dozens of tools have been developed for browsing, mining or editing GO terms, their hierarchical relationships, or their "associated" genes and gene products (i.e., genes and gene products annotated with GO terms). Tools that allow users to directly search and inspect relations among all GO terms and their associated genes and gene products from multiple databases are needed. RESULTS: We present a standalone package called DynGO, which provides several advanced functionalities in addition to the standard browsing capability of the official GO browsing tool (AmiGO). DynGO allows users to conduct batch retrieval of GO annotations for a list of genes and gene products, and semantic retrieval of genes and gene products sharing similar GO annotations. The result are shown in an association tree organized according to GO hierarchies and supported with many dynamic display options such as sorting tree nodes or changing orientation of the tree. For GO curators and frequent GO users, DynGO provides fast and convenient access to GO annotation data. DynGO is generally applicable to any data set where the records are annotated with GO terms, as illustrated by two examples. CONCLUSION: We have presented a standalone package DynGO that provides functionalities to search and browse GO and its association databases as well as several additional functions such as batch retrieval and semantic retrieval. The complete documentation and software are freely available for download from the website http://biocreative.ifsm.umbc.edu/dyngo.


Subject(s)
Data Display , Databases, Genetic , Information Storage and Retrieval/methods , Software , User-Computer Interface , Computer Graphics , Documentation/methods , Genes , Semantics , Vocabulary
18.
Comput Biol Chem ; 28(5-6): 409-16, 2004 Dec.
Article in English | MEDLINE | ID: mdl-15556482

ABSTRACT

The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining--iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.


Subject(s)
Databases, Protein , Information Services , Proteins/chemistry , Computational Biology , Databases, Bibliographic , Internet , Proteins/classification , Proteins/genetics , PubMed , Systems Integration
19.
Nucleic Acids Res ; 32(Database issue): D112-4, 2004 Jan 01.
Article in English | MEDLINE | ID: mdl-14681371

ABSTRACT

The Protein Information Resource (PIR) is an integrated public resource of protein informatics. To facilitate the sensible propagation and standardization of protein annotation and the systematic detection of annotation errors, PIR has extended its superfamily concept and developed the SuperFamily (PIRSF) classification system. Based on the evolutionary relationships of whole proteins, this classification system allows annotation of both specific biological and generic biochemical functions. The system adopts a network structure for protein classification from superfamily to subfamily levels. Protein family members are homologous (sharing common ancestry) and homeomorphic (sharing full-length sequence similarity with common domain architecture). The PIRSF database consists of two data sets, preliminary clusters and curated families. The curated families include family name, protein membership, parent-child relationship, domain architecture, and optional description and bibliography. PIRSF is accessible from the website at http://pir.georgetown.edu/pirsf/ for report retrieval and sequence classification. The report presents family annotation, membership statistics, cross-references to other databases, graphical display of domain architecture, and links to multiple sequence alignments and phylogenetic trees for curated families. PIRSF can be utilized to analyze phylogenetic profiles, to reveal functional convergence and divergence, and to identify interesting relationships between homeomorphic families, domains and structural classes.


Subject(s)
Computational Biology , Databases, Protein , Proteins/chemistry , Proteins/classification , Amino Acid Motifs , Animals , Evolution, Molecular , Humans , Information Storage and Retrieval , Internet , Protein Structure, Tertiary
20.
J Steroid Biochem Mol Biol ; 82(2-3): 263-8, 2002 Oct.
Article in English | MEDLINE | ID: mdl-12477494

ABSTRACT

Human prolactin receptor (hPRLR) expression is regulated by estradiol-17beta (E(2)) in vivo in animal tissues, and in vitro in normal human endometrial cells and in MCF7 human breast cancer cells. The objective of this study was to determine the effect of E(2) on the expression of two recently described hPRLR isoforms with distinct exons-1, hE1(3) and hE1(N1) that are transcribed from the generic hPIII promoter, also present in the rat and mouse, and the human-specific promoter hP(N1), respectively. Also, to determine the effect of estradiol on the hPIII promoter activity in cancer cells. T47D breast cancer cells were examined using quantitative competitive RT-PCR for the level of expression of two alternative non-coding exon-1 transcripts, hE1(3) and hE1(N1) following incubation with E(2) in presence or absence of the E(2) receptor antagonist ICI 182,780. The effects of estradiol were also evaluated in cells transiently transfected with constructs of hPIII promoter luciferase reporter gene. E(2) significantly increased the expression of both hPRLR mRNA transcripts, hE1(3) and hE1(N1). In transfection studies E(2) activated the hPIII promoter. This effect of estradiol was markedly inhibited by coincubation with the E(2) receptor antagonist. Our results demonstrate a stimulatory effect of estradiol on the expression of hPRLR mRNA species with alternative exons-1, hE1(3) and hE1(N1) possibly through activation of their corresponding promoters. The lack of a formal ERE in these promoters suggested that the effect of estradiol is mediated through association of the activated ER with relevant DNA binding transfactor(s). These findings support the role of E(2) in the regulation of hPRLR expression in human breast cancer cell lines.


Subject(s)
Alternative Splicing , Estradiol/metabolism , Exons , Protein Isoforms/metabolism , Receptors, Prolactin/metabolism , Animals , Breast Neoplasms , Female , Gene Expression Regulation, Neoplastic , Genes, Reporter , Humans , Promoter Regions, Genetic , Protein Isoforms/genetics , Receptors, Estrogen/genetics , Receptors, Estrogen/metabolism , Receptors, Prolactin/genetics , Tumor Cells, Cultured
SELECTION OF CITATIONS
SEARCH DETAIL
...