Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 27
Filter
Add more filters










Publication year range
1.
J Adv Res ; 2023 Dec 29.
Article in English | MEDLINE | ID: mdl-38159844

ABSTRACT

INTRODUCTION: The population of Taiwan has a long history of ethno-cultural evolution. The Taiwanese population was isolated from other large populations such as the European, Han Chinese, and Japanese population. The Taiwan Biobank (TWB) project has built a nationwide database, particularly for personal whole-genome sequence (WGS) to facilitate basic and clinical collaboration nationally and internationally, making it one of the most valuable public datasets of the East Asian population. OBJECTIVES: This study provides comprehensive medical genomic findings from TWB WGS data, for better characterization of disease susceptibility and the choice of ideal treatment regimens in Taiwanese population. METHODS: We reanalyzed 1496 WGS using a PrecisionFDA Truth challenge winner method Sentieon DNAscope. Single nucleotide variants (SNV) and small insertions/deletions (INDEL) were benchmarked. We also analyzed pharmacogenomic (PGx) drug-associated alleles, and copy number variants (CNV). Multiple practicing clinicians reviewed and curated the clinically significant variants. Variant annotations can be browsed at TaiwanGenomes (https://genomes.tw). RESULTS: We found that each participant had an average of 6,870.7 globally novel variants and 75.3% (831/1103) of the participants harbored at least one PharmGKB-selected high evidence level human leukocyte antigen (HLA) risk allele. 54 PharmGKB-reported high-level instances of evidence of Cytochrome P450 variant-drug pairs, with a population frequency of over 13.2%. We also identified 23 variants in the ACMG secondary finding V3 gene list from 25 participants, suggesting that 1.67% (25/1496) of the population is harboring at least one medical actionable variant. Our carrier status analyses suggest that one in 25 couples (3.94%) would risk having offspring with at least one pathogenic variant, which is in line with rates found in Japan and Singapore. For pathogenic CNV, we detected 6.88% and 2.02% carrier rates for alpha thalassemia and spinal muscular atrophy, respectively. CONCLUSION: Our study highlights the overall medical insights of a complete Taiwanese genomic profile.

2.
BMC Bioinformatics ; 20(Suppl 24): 677, 2019 Dec 20.
Article in English | MEDLINE | ID: mdl-31861981

ABSTRACT

BACKGROUND: Signal peptides play an important role in protein sorting, which is the mechanism whereby proteins are transported to their destination. Recognition of signal peptides is an important first step in determining the active locations and functions of proteins. Many computational methods have been proposed to facilitate signal peptide recognition. In recent years, the development of deep learning methods has seen significant advances in many research fields. However, most existing models for signal peptide recognition use one-hidden-layer neural networks or hidden Markov models, which are relatively simple in comparison with the deep neural networks that are used in other fields. RESULTS: This study proposes a convolutional neural network without fully connected layers, which is an important network improvement in computer vision. The proposed network is more complex in comparison with current signal peptide predictors. The experimental results show that the proposed network outperforms current signal peptide predictors on eukaryotic data. This study also demonstrates how model reduction and data augmentation helps the proposed network to predict bacterial data. CONCLUSIONS: The study makes three contributions to this subject: (a) an accurate signal peptide recognizer is developed, (b) the potential to leverage advanced networks from other fields is demonstrated and (c) important modifications are proposed while adopting complex networks on signal peptide recognition.


Subject(s)
Semantics , Deep Learning , Neural Networks, Computer , Protein Sorting Signals , Software
3.
BMC Bioinformatics ; 19(1): 169, 2018 05 09.
Article in English | MEDLINE | ID: mdl-29743010

ABSTRACT

BACKGROUND: Zebrafish is a widely used model organism for studying heart development and cardiac-related pathogenesis. With the ability of surviving without a functional circulation at larval stages, strong genetic similarity between zebrafish and mammals, prolific reproduction and optically transparent embryos, zebrafish is powerful in modeling mammalian cardiac physiology and pathology as well as in large-scale high throughput screening. However, an economical and convenient tool for rapid evaluation of fish cardiac function is still in need. There have been several image analysis methods to assess cardiac functions in zebrafish embryos/larvae, but they are still improvable to reduce manual intervention in the entire process. This work developed a fully automatic method to calculate heart rate, an important parameter to analyze cardiac function, from videos. It contains several filters to identify the heart region, to reduce video noise and to calculate heart rates. RESULTS: The proposed method was evaluated with 32 zebrafish larval cardiac videos that were recording at three-day post-fertilization. The heart rate measured by the proposed method was comparable to that determined by manual counting. The experimental results show that the proposed method does not lose accuracy while largely reducing the labor cost and uncertainty of manual counting. CONCLUSIONS: With the proposed method, researchers do not have to manually select a region of interest before analyzing videos. Moreover, filters designed to reduce video noise can alleviate background fluctuations during the video recording stage (e.g. shifting), which makes recorders generate usable videos easily and therefore reduce manual efforts while recording.


Subject(s)
Heart Rate/physiology , Larva/physiology , Videotape Recording/methods , Zebrafish/physiology , Animals
4.
Article in English | MEDLINE | ID: mdl-27242036

ABSTRACT

In eukaryotic cells, transcriptional regulation of gene expression is usually accomplished by cooperative Transcription Factors (TFs). Therefore, knowing cooperative TFs is helpful for uncovering the mechanisms of transcriptional regulation. In yeast, many cooperative TF pairs have been predicted by various algorithms in the literature. However, until now, there is still no database which collects the predicted yeast cooperative TFs from existing algorithms. This prompts us to construct Cooperative Transcription Factors Database (CoopTFD), which has a comprehensive collection of 2622 predicted cooperative TF pairs (PCTFPs) in yeast from 17 existing algorithms. For each PCTFP, our database also provides five types of validation information: (i) the algorithms which predict this PCTFP, (ii) the publications which experimentally show that this PCTFP has physical or genetic interactions, (iii) the publications which experimentally study the biological roles of both TFs of this PCTFP, (iv) the common Gene Ontology (GO) terms of this PCTFP and (v) the common target genes of this PCTFP. Based on the provided validation information, users can judge the biological plausibility of a PCTFP of interest. We believe that CoopTFD will be a valuable resource for yeast biologists to study the combinatorial regulation of gene expression controlled by cooperative TFs.Database URL: http://cosbi.ee.ncku.edu.tw/CoopTFD/ or http://cosbi2.ee.ncku.edu.tw/CoopTFD/.


Subject(s)
Computational Biology/methods , Databases, Genetic , Gene Expression Regulation, Fungal/genetics , Saccharomyces cerevisiae Proteins/genetics , Transcription Factors/genetics , Algorithms
5.
Article in English | MEDLINE | ID: mdl-27016699

ABSTRACT

In many biological processes, proteins have important interactions with various molecules such as proteins, ions or ligands. Many proteins undergo conformational changes upon these interactions, where regions with large conformational changes are critical to the interactions. This work presents the CCProf platform, which provides conformational changes of entire proteins, named conformational change profile (CCP) in the context. CCProf aims to be a platform where users can study potential causes of novel conformational changes. It provides 10 biological features, including conformational change, potential binding target site, secondary structure, conservation, disorder propensity, hydropathy propensity, sequence domain, structural domain, phosphorylation site and catalytic site. All these information are integrated into a well-aligned view, so that researchers can capture important relevance between different biological features visually. The CCProf contains 986,187 protein structure pairs for 3123 proteins. In addition, CCProf provides a 3D view in which users can see the protein structures before and after conformational changes as well as binding targets that induce conformational changes. All information (e.g. CCP, binding targets and protein structures) shown in CCProf, including intermediate data are available for download to expedite further analyses. Database URL:http://zoro.ee.ncku.edu.tw/ccprof/.


Subject(s)
Databases, Protein , Proteins/chemistry , Binding Sites , Conserved Sequence , Ligands , Protein Conformation , Search Engine , User-Computer Interface
6.
BMC Bioinformatics ; 16 Suppl 18: S11, 2015.
Article in English | MEDLINE | ID: mdl-26680734

ABSTRACT

BACKGROUND: Next-generation sequencing (NGS) technologies has brought an unprecedented amount of genomic data for analysis. Unlike array-based profiling technologies, NGS can reveal the expression profile across a transcript at the base level. Such a base-level read coverage provides further insights for alternative mRNA splicing, single-nucleotide polymorphism (SNP), novel transcript discovery, etc. However, to our best knowledge, none of existing NGS viewers can timely visualize genome-wide base-level read coverages in an interactive environment. RESULTS: This study proposes an efficient visualization pipeline and implements a lightweight read coverage viewer, Light-RCV, with the proposed pipeline. Light-RCV consists of four featured designs on the path from raw NGS data to the final visualized read coverage: i) read coverage construction algorithm, ii) multi-resolution profiles, iii) two-stage architecture and iv) storage format. With these designs, Light-RCV achieves a < 0.5s response time on any scale of genomic ranges, including whole chromosomes. Finally, a case study was performed to demonstrate the importance of visualizing base-level read coverage and the value of Light-RCV. CONCLUSIONS: Compared with multi-functional genome viewers such as Artemis, Savant, Tablet and Integrative Genomics Viewer (IGV), Light-RCV is designed only for visualization. Therefore, it does not provide advanced analyses. However, its backend technology provides an efficient kernel of base-level visualization that can be easily embedded to other viewers. This viewer is the first to provide timely visualization of genome-wide read coverage at the base level in an interactive environment. The software is available for free at http://lightrcv.ee.ncku.edu.tw.


Subject(s)
Algorithms , Genomics , Genome, Fungal , High-Throughput Nucleotide Sequencing , Internet , Polymorphism, Single Nucleotide , RNA Splicing , Saccharomyces cerevisiae/genetics , Sequence Analysis, DNA , User-Computer Interface
7.
BMC Syst Biol ; 8 Suppl 5: S9, 2014.
Article in English | MEDLINE | ID: mdl-25560196

ABSTRACT

BACKGROUND: Defining a measure for regulatory similarity (RS) of two genes is an important step toward identifying co-regulated genes. To date, transcription factor binding sites (TFBSs) have been widely used to measure the RS of two genes because transcription factors (TFs) binding to TFBSs in promoters is the most crucial and well understood step in gene regulation. However, existing TFBS-based RS measures consider the relation of a TFBS to a gene as a Boolean (either 'presence' or 'absence') without utilizing the information of TFBS locations in promoters. RESULTS: Functional TFBSs of many TFs in yeast are known to have a strong positional preference to occur in a small region in the promoters. This biological knowledge prompts us to develop a novel RS measure that exploits the TFBS location information. The performances of different RS measures are evaluated by the fraction of gene pairs that are co-regulated (validated by literature evidence) by at least one common TF under different RS scores. The experimental results show that the proposed RS measure is the best co-regulation indicator among the six compared RS measures. In addition, the co-regulated genes identified by the proposed RS measure are also shown to be able to benefit three co-regulation-based applications: detecting gene co-function, gene co-expression and protein-protein interactions. CONCLUSIONS: The proposed RS measure provides a good indicator for gene co-regulation. Besides, its good performance reveals the importance of the location information in TFBS-based RS measures.


Subject(s)
Algorithms , DNA/genetics , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae/genetics , Sequence Analysis, DNA/methods , Base Sequence , Binding Sites , Molecular Sequence Data , Promoter Regions, Genetic/genetics , Protein Binding , Protein Interaction Mapping/methods , Transcription Factors
8.
PLoS One ; 8(9): e75940, 2013.
Article in English | MEDLINE | ID: mdl-24069454

ABSTRACT

Annotating protein functions and linking proteins with similar functions are important in systems biology. The rapid growth rate of newly sequenced genomes calls for the development of computational methods to help experimental techniques. Phylogenetic profiling (PP) is a method that exploits the evolutionary co-occurrence pattern to identify functional related proteins. However, PP-based methods delivered satisfactory performance only on prokaryotes but not on eukaryotes. This study proposed a two-stage framework to predict protein functional linkages, which successfully enhances a PP-based method with machine learning. The experimental results show that the proposed two-stage framework achieved the best overall performance in comparison with three PP-based methods.


Subject(s)
Artificial Intelligence , Phylogeny , Proteins/genetics , Proteins/metabolism , Algorithms , Area Under Curve , Computational Biology/methods , Escherichia coli/genetics , Escherichia coli/metabolism , Eukaryota/genetics , Eukaryota/metabolism , Molecular Sequence Annotation , Proteins/chemistry , Reproducibility of Results , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism
9.
Gene ; 518(1): 78-83, 2013 Apr 10.
Article in English | MEDLINE | ID: mdl-23276706

ABSTRACT

This work presents the Protein Association Analyzer (PRASA) (http://zoro.ee.ncku.edu.tw/prasa/) that predicts protein interactions as well as interaction types. Protein interactions are essential to most biological functions. The existence of diverse interaction types, such as physically contacted or functionally related interactions, makes protein interactions complex. Different interaction types are distinct and should not be confused. However, most existing tools focus on a specific interaction type or mix different interaction types. This work collected 7234058 associations with experimentally verified interaction types from five databases and compiled individual probabilistic models for different interaction types. The PRASA result page shows predicted associations and their related references by interaction type. Experimental results demonstrate the performance difference when distinguishing between different interaction types. The PRASA provides a centralized and organized platform for easy browsing, downloading and comparing of interaction types, which helps reveal insights into the complex roles that proteins play in organisms.


Subject(s)
Computational Biology/methods , Protein Interaction Mapping/methods , Artificial Intelligence , Humans , Internet , Metabolic Networks and Pathways , Models, Statistical , Proteins/genetics , Proteins/metabolism , Receptors, Notch/genetics , Receptors, Notch/metabolism , Smad Proteins/genetics , Smad Proteins/metabolism , User-Computer Interface , Yeasts/metabolism
11.
Gene ; 518(1): 26-34, 2013 Apr 10.
Article in English | MEDLINE | ID: mdl-23266802

ABSTRACT

The advance of high-throughput experimental technologies generates many gene sets with different biological meanings, where many important insights can only be extracted by identifying the biological (regulatory/functional) features that are distinct between different gene sets (e.g. essential vs. non-essential genes, TATA box-containing vs. TATA box-less genes, induced vs. repressed genes under certain biological conditions). Although many servers have been developed to identify enriched features in a gene set, most of them were designed to analyze one gene set at a time but cannot compare two gene sets. Moreover, the features used in existing servers were mainly focused on functional annotations (GO terms), pathways, transcription factor binding sites (TFBSs) and/or protein-protein interactions (PPIs). In yeast, various important regulatory features, including promoter bendability, nucleosome occupancy, 5'-UTR length, and TF-gene regulation evidence, are available but have not been used in any enrichment analysis servers. This motivates us to develop the Yeast Genes Analyzer (YGA), a web server that simultaneously analyzes various biological (regulatory/functional) features of two gene sets and performs statistical tests to identify the distinct features between them. Many well-studied gene sets such as essential, stress-response, TATA box-containing and cell cycle genes were pre-compiled in YGA for users, if they have only one gene set, to compare with. In comparison with the existing enrichment analysis servers, YGA tests more comprehensive regulatory features (e.g. promoter bendability, nucleosome occupancy, 5'-UTR length, experimental evidence of TF-gene binding and TF-gene regulation) and functional features (e.g. PPI, GO terms, pathways and functional groups of genes, including essential/non-essential genes, stress-induced/-repressed genes, TATA box-containing/-less genes, occupied/depleted proximal-nucleosome genes and cell cycle genes). Furthermore, YGA uses various statistical tests to provide objective comparison measures. The two major contributions of YGA, comprehensive features and statistical comparison, help to mine important information that cannot be obtained from other servers. The sophisticated analysis tools of YGA can identify distinct biological features between two gene sets, which help biologists to form new hypotheses about the underlying biological mechanisms responsible for the observed difference between these two gene sets. YGA can be accessed from the following web pages: http://cosbi.ee.ncku.edu.tw/yga/ and http://yga.ee.ncku.edu.tw/.


Subject(s)
Gene Expression Profiling/methods , Genes, Fungal , Software , Transcription Factors/genetics , Yeasts/genetics , 5' Untranslated Regions , Data Interpretation, Statistical , Genes, Essential , Nucleosomes/genetics , Promoter Regions, Genetic , TATA Box
12.
Bioinformatics ; 28(16): 2162-8, 2012 Aug 15.
Article in English | MEDLINE | ID: mdl-22753780

ABSTRACT

MOTIVATION: Determination of the binding affinity of a protein-ligand complex is important to quantitatively specify whether a particular small molecule will bind to the target protein. Besides, collection of comprehensive datasets for protein-ligand complexes and their corresponding binding affinities is crucial in developing accurate scoring functions for the prediction of the binding affinities of previously unknown protein-ligand complexes. In the past decades, several databases of protein-ligand-binding affinities have been created via visual extraction from literature. However, such approaches are time-consuming and most of these databases are updated only a few times per year. Hence, there is an immediate demand for an automatic extraction method with high precision for binding affinity collection. RESULT: We have created a new database of protein-ligand-binding affinity data, AutoBind, based on automatic information retrieval. We first compiled a collection of 1586 articles where the binding affinities have been marked manually. Based on this annotated collection, we designed four sentence patterns that are used to scan full-text articles as well as a scoring function to rank the sentences that match our patterns. The proposed sentence patterns can effectively identify the binding affinities in full-text articles. Our assessment shows that AutoBind achieved 84.22% precision and 79.07% recall on the testing corpus. Currently, 13 616 protein-ligand complexes and the corresponding binding affinities have been deposited in AutoBind from 17 221 articles. AVAILABILITY: AutoBind is automatically updated on a monthly basis, and it is freely available at http://autobind.csie.ncku.edu.tw/ and http://autobind.mc.ntu.edu.tw/. All of the deposited binding affinities have been refined and approved manually before being released.


Subject(s)
Databases, Factual , Information Storage and Retrieval/methods , Ligands , Protein Binding , Software , Algorithms , Computational Biology/methods
13.
Nucleic Acids Res ; 40(Web Server issue): W173-9, 2012 Jul.
Article in English | MEDLINE | ID: mdl-22693214

ABSTRACT

By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein-DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD-DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein-DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies. Available at: http://dbd2bs.csie.ntu.edu.tw/, http://dbd2bs.csbb.ntu.edu.tw/, and http://dbd2bs.ee.ncku.edu.tw.


Subject(s)
DNA-Binding Proteins/chemistry , Software , Binding Sites , Cyclic AMP Receptor Protein/chemistry , Cyclic AMP Receptor Protein/metabolism , DNA/chemistry , DNA/metabolism , DNA-Binding Proteins/metabolism , Internet , Position-Specific Scoring Matrices , Protein Structure, Tertiary , User-Computer Interface
14.
BMC Genomics ; 13 Suppl 1: S11, 2012.
Article in English | MEDLINE | ID: mdl-22369481

ABSTRACT

BACKGROUND: Head-to-head (h2h) genes are prone to have association in expression and in functionality and have been shown conserved in evolution. Currently there are many studies on such h2h gene pairs. We found that the previous studies extremely focused on human genome. Furthermore, they only focused on analyses that require only gene or protein sequences but not conducted a systematic investigation on other promoter features such as the binding evidence of specific transcription factors (TFs). This is mainly because of the incomplete resources of higher organisms, though they are relatively of interest, than model organisms such as Saccharomyces cerevisiae. The authors of this study recently integrated nine promoter features of 6603 genes of S. cerevisiae from six databases and five papers. These resources are suitable to conduct a comprehensive analysis of h2h genes in S. cerevisiae. RESULTS: This study analyzed various promoter features, including transcription boundaries (TSS, 5'UTR and 3'UTR), TATA box, TF binding evidence, TF regulation evidence, DNA bendability and nucleosome occupancy. The expression profiles and gene ontology (GO) annotations were used to measure if two genes are associated. Based on these promoter features, we found that i) the frequency of h2h genes was close to the expectation, namely they were not relatively frequent in genome; ii) the distance between the TSSs of most h2h genes fell into the range of 0-600 bps and was more centralized in 0-200 bps of the highly associated ones; iii) the number of TFs that regulate both h2h genes influenced the co-expression and co-function of the genes, while the number of TFs that bind both h2h genes influenced only the co-expression of the genes; iv) the association of two h2h genes was influenced by the existence of specific TFs such as STP2; v) the association of h2h genes whose bidirectional promoters have no TATA box was slightly higher than those who have TATA boxes; vi) the association of two h2h genes was not influenced by the DNA bendability and nucleosome occupancy. CONCLUSIONS: This study analyzed h2h genes with various promoter features that have not been used in analyzing h2h genes. The results can be applied to other genomes to confirm if the observations of this study are limited to S. cerevisiae or universal in most organisms.


Subject(s)
Promoter Regions, Genetic/genetics , Saccharomyces cerevisiae Proteins/genetics , Genome, Fungal/genetics , Transcription Factors/genetics
15.
PLoS One ; 7(2): e30446, 2012.
Article in English | MEDLINE | ID: mdl-22312425

ABSTRACT

DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. Accurate prediction of such target sequences, often represented by position weight matrices (PWMs), is an important step to understand many biological processes. Recent studies have shown that knowledge-based potential functions can be applied on protein-DNA co-crystallized structures to generate PWMs that are considerably consistent with experimental data. However, this success has not been extended to DNA-binding proteins lacking co-crystallized structures. This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins' unbound structures (structures of the unbound state). Given an unbound query protein and a template complex, the proposed method first employs structure alignment to generate synthetic protein-DNA complexes for the query protein. Once a complex is available, an atomic-level knowledge-based potential function is employed to predict PWMs characterizing the sequences to which the query protein can bind. The evaluation of the proposed method is based on seven DNA-binding proteins, which have structures of both DNA-bound and unbound forms for prediction as well as annotated PWMs for validation. Since this work is the first attempt to predict target sequences of DNA-binding proteins from their unbound structures, three types of structural variations that presumably influence the prediction accuracy were examined and discussed. Based on the analyses conducted in this study, the conformational change of proteins upon binding DNA was shown to be the key factor. This study sheds light on the challenge of predicting the target DNA sequences of a protein lacking co-crystallized structures, which encourages more efforts on the structure alignment-based approaches in addition to docking- and homology modeling-based approaches for generating synthetic complexes.


Subject(s)
Computational Biology/methods , DNA-Binding Proteins/metabolism , DNA/genetics , DNA/metabolism , Animals , Base Sequence , DNA/chemistry , DNA-Binding Proteins/chemistry , Databases, Protein , Humans , Internet , Reproducibility of Results
16.
Nucleic Acids Res ; 40(Database issue): D472-8, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22084200

ABSTRACT

This work presents the Apo-Holo DataBase (AH-DB, http://ahdb.ee.ncku.edu.tw/ and http://ahdb.csbb.ntu.edu.tw/), which provides corresponding pairs of protein structures before and after binding. Conformational transitions are commonly observed in various protein interactions that are involved in important biological functions. For example, copper-zinc superoxide dismutase (SOD1), which destroys free superoxide radicals in the body, undergoes a large conformational transition from an 'open' state (apo structure) to a 'closed' state (holo structure). Many studies have utilized collections of apo-holo structure pairs to investigate the conformational transitions and critical residues. However, the collection process is usually complicated, varies from study to study and produces a small-scale data set. AH-DB is designed to provide an easy and unified way to prepare such data, which is generated by identifying/mapping molecules in different Protein Data Bank (PDB) entries. Conformational transitions are identified based on a refined alignment scheme to overcome the challenge that many structures in the PDB database are only protein fragments and not complete proteins. There are 746,314 apo-holo pairs in AH-DB, which is about 30 times those in the second largest collection of similar data. AH-DB provides sophisticated interfaces for searching apo-holo structure pairs and exploring conformational transitions from apo structures to the corresponding holo structures.


Subject(s)
Databases, Protein , Protein Conformation , Models, Molecular , Protein Binding , Superoxide Dismutase/chemistry , Superoxide Dismutase-1 , User-Computer Interface
17.
BMC Bioinformatics ; 12 Suppl 1: S32, 2011 Feb 15.
Article in English | MEDLINE | ID: mdl-21342563

ABSTRACT

BACKGROUND: A common assumption about enzyme active sites is that their structures are highly conserved to specifically distinguish between closely similar compounds. However, with the discovery of distinct enzymes with similar reaction chemistries, more and more studies discussing the structural flexibility of the active site have been conducted. RESULTS: Most of the existing works on the flexibility of active sites focuses on a set of pre-selected active sites that were already known to be flexible. This study, on the other hand, proposes an analysis framework composed of a new data collecting strategy, a local structure alignment tool and several physicochemical measures derived from the alignments. The method proposed to identify flexible active sites is highly automated and robust so that more extensive studies will be feasible in the future. The experimental results show the proposed method is (a) consistent with previous works based on manually identified flexible active sites and (b) capable of identifying potentially new flexible active sites. CONCLUSIONS: This proposed analysis framework and the former analyses on flexibility have their own advantages and disadvantage, depending on the cause of the flexibility. In this regard, this study proposes an alternative that complements previous studies and helps to construct a more comprehensive view of the flexibility of enzyme active sites.


Subject(s)
Catalytic Domain , Enzymes/chemistry , Algorithms , Binding Sites , Computational Biology/methods , Protein Conformation , Sequence Alignment , Sequence Analysis, Protein , Structure-Activity Relationship
18.
Nucleic Acids Res ; 39(Database issue): D647-52, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21045055

ABSTRACT

This study presents the Yeast Promoter Atlas (YPA, http://ypa.ee.ncku.edu.tw/ or http://ypa.csbb.ntu.edu.tw/) database, which aims to collect comprehensive promoter features in Saccharomyces cerevisiae. YPA integrates nine kinds of promoter features including promoter sequences, genes' transcription boundaries-transcription start sites (TSSs), five prime untranslated regions (5'-UTRs) and three prime untranslated regions (3'UTRs), TATA boxes, transcription factor binding sites (TFBSs), nucleosome occupancy, DNA bendability, transcription factor (TF) binding, TF knockout expression and TF-TF physical interaction. YPA is designed to present data in a unified manner as many important observations are revealed only when these promoter features are considered altogether. For example, DNA rigidity can prevent nucleosome packaging, thereby making TFBSs in the rigid DNA regions more accessible to TFs. Integrating nucleosome occupancy, DNA bendability, TF binding, TF knockout expression and TFBS data helps to identify which TFBS is actually functional. In YPA, various promoter features can be accessed in a centralized and organized platform. Researchers can easily view if the TFBSs in an interested promoter are occupied by nucleosomes or located in a rigid DNA segment and know if the expression of the downstream gene responds to the knockout of the corresponding TFs. Compared to other established yeast promoter databases, YPA collects not only TFBSs but also many other promoter features to help biologists study transcriptional regulation.


Subject(s)
Databases, Nucleic Acid , Promoter Regions, Genetic , Saccharomyces cerevisiae/genetics , Binding Sites , Systems Integration , Transcription Factors/metabolism , User-Computer Interface
19.
BMC Bioinformatics ; 11: 167, 2010 Apr 02.
Article in English | MEDLINE | ID: mdl-20361868

ABSTRACT

BACKGROUND: Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Most of these approaches have achieved satisfactory performance on datasets comprising equal number of interacting and non-interacting protein pairs. However, this ratio is highly unbalanced in nature, and these techniques have not been comprehensively evaluated with respect to the effect of the large number of non-interacting pairs in realistic datasets. Moreover, since highly unbalanced distributions usually lead to large datasets, more efficient predictors are desired when handling such challenging tasks. RESULTS: This study presents a method for PPI prediction based only on sequence information, which contributes in three aspects. First, we propose a probability-based mechanism for transforming protein sequences into feature vectors. Second, the proposed predictor is designed with an efficient classification algorithm, where the efficiency is essential for handling highly unbalanced datasets. Third, the proposed PPI predictor is assessed with several unbalanced datasets with different positive-to-negative ratios (from 1:1 to 1:15). This analysis provides solid evidence that the degree of dataset imbalance is important to PPI predictors. CONCLUSIONS: Dealing with data imbalance is a key issue in PPI prediction since there are far fewer interacting protein pairs than non-interacting ones. This article provides a comprehensive study on this issue and develops a practical tool that achieves both good prediction performance and efficiency using only protein sequence information.


Subject(s)
Protein Interaction Mapping/methods , Proteins/chemistry , Proteomics/methods , Amino Acid Sequence , Binding Sites , Databases, Protein , Proteins/metabolism , Sequence Analysis, Protein
20.
BMC Bioinformatics ; 11 Suppl 1: S3, 2010 Jan 18.
Article in English | MEDLINE | ID: mdl-20122202

ABSTRACT

BACKGROUND: Many biological functions involve various protein-protein interactions (PPIs). Elucidating such interactions is crucial for understanding general principles of cellular systems. Previous studies have shown the potential of predicting PPIs based on only sequence information. Compared to approaches that require other auxiliary information, these sequence-based approaches can be applied to a broader range of applications. RESULTS: This study presents a novel sequence-based method based on the assumption that protein-protein interactions are more related to amino acids at the surface than those at the core. The present method considers surface information and maintains the advantage of relying on only sequence data by including an accessible surface area (ASA) predictor recently proposed by the authors. This study also reports the experiments conducted to evaluate a) the performance of PPI prediction achieved by including the predicted surface and b) the quality of the predicted surface in comparison with the surface obtained from structures. The experimental results show that surface information helps to predict interacting protein pairs. Furthermore, the prediction performance achieved by using the surface estimated with the ASA predictor is close to that using the surface obtained from protein structures. CONCLUSION: This work presents a sequence-based method that takes into account surface information for predicting PPIs. The proposed procedure of surface identification improves the prediction performance with an F-measure of 5.1%. The extracted surfaces are also valuable in other biomedical applications that require similar information.


Subject(s)
Amino Acid Sequence , Protein Interaction Mapping/methods , Proteins/chemistry , Proteins/metabolism , Binding Sites , Databases, Protein , Models, Molecular , Proteomics/methods , Sequence Analysis, Protein/methods , Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL
...