Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
1.
Protein Pept Lett ; 27(3): 178-186, 2020.
Article in English | MEDLINE | ID: mdl-31577193

ABSTRACT

BACKGROUND: N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism. OBJECTIVE: In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences. METHODS: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites. RESULT: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate. CONCLUSION: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Sequence Analysis, Protein/methods , Animals , Databases, Protein , Glycosylation , Humans , Mice , Software
2.
Oncotarget ; 7(33): 52832-52848, 2016 08 16.
Article in English | MEDLINE | ID: mdl-27391159

ABSTRACT

The actin-binding protein, gelsolin, is a well known regulator of cancer cell invasion. However, the mechanisms by which gelsolin promotes invasion are not well established. As reactive oxygen species (ROS) have been shown to promote cancer cell invasion, we investigated on the hypothesis that gelsolin-induced changes in ROS levels may mediate the invasive capacity of colon cancer cells.Herein, we show that increased gelsolin enhances the invasive capacity of colon cancer cells, and this is mediated via gelsolin's effects in elevating intracellular superoxide (O2.-) levels. We also provide evidence for a novel physical interaction between gelsolin and Cu/ZnSOD, that inhibits the enzymatic activity of Cu/ZnSOD, thereby resulting in a sustained elevation of intracellular O2.-. Using microarray data of human colorectal cancer tissues from Gene Omnibus, we found that gelsolin gene expression positively correlates with urokinase plasminogen activator (uPA), an important matrix-degrading protease invovled in cancer invasion. Consistent with the in vivo evidence, we show that increased levels of O2.- induced by gelsolin overexpression triggers the secretion of uPA. We further observed reduction in invasion and intracellular O2.- levels in colon cancer cells, as a consequence of gelsolin knockdown using two different siRNAs. In these cells, concurrent repression of Cu/ZnSOD restored intracellular O2.- levels and rescued invasive capacity.Our study therefore identified gelsolin as a novel regulator of intracellular O2.- in cancer cells via interacting with Cu/ZnSOD and inhibiting its enzymatic activity. Taken together, these findings provide insight into a novel function of gelsolin in promoting tumor invasion by directly impacting the cellular redox milieu.


Subject(s)
Gelsolin/metabolism , Reactive Oxygen Species/metabolism , Superoxide Dismutase-1/metabolism , Urokinase-Type Plasminogen Activator/metabolism , Caco-2 Cells , Cell Line, Tumor , Colonic Neoplasms/genetics , Colonic Neoplasms/metabolism , Colonic Neoplasms/pathology , Gelsolin/chemistry , Gelsolin/genetics , Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic , HCT116 Cells , HeLa Cells , Hep G2 Cells , Humans , Models, Molecular , Neoplasm Invasiveness , Protein Binding , Protein Domains , RNA Interference , Superoxide Dismutase-1/chemistry , Superoxide Dismutase-1/genetics , Urokinase-Type Plasminogen Activator/genetics
3.
FEBS Open Bio ; 4: 533-41, 2014.
Article in English | MEDLINE | ID: mdl-25009768

ABSTRACT

Sugarcane is an important tropical cash crop meeting 75% of world sugar demand and it is fast becoming an energy crop for the production of bio-fuel ethanol. A considerable area under sugarcane is prone to waterlogging which adversely affects both cane productivity and quality. In an effort to elucidate the genes underlying plant responses to waterlogging, a subtractive cDNA library was prepared from leaf tissue. cDNA clones were sequenced and annotated for their putative functions. Major groups of ESTs were related to stress (15%), catalytic activity (13%), cell growth (10%) and transport related proteins (6%). A few stress-related genes were identified, including senescence-associated protein, dehydration-responsive family protein, and heat shock cognate 70 kDa protein. A bioinformatics search was carried out to discover novel microRNAs (miRNAs) that can be regulated in sugarcane plants subjected to waterlogging stress. Taking advantage of the presence of miRNA precursors in the related sorghum genome, seven candidate mature miRNAs were identified in sugarcane. The application of subtraction technology allowed the identification of differentially expressed sequences and novel miRNAs in sugarcane under waterlogging stress. The comparative global transcript profiling in sugarcane plants undertaken in this study suggests that proteins associated with stress response, signal transduction, metabolic activity and ion transport play important role in conferring waterlogging tolerance in sugarcane.

4.
Int J Immunogenet ; 41(1): 74-80, 2014 Feb.
Article in English | MEDLINE | ID: mdl-23800159

ABSTRACT

Granulocyte-macrophage colony-stimulating factor (GM-CSF) is a cytokine that is essential for growth and development of progenitors of granulocytes and monocytes/macrophages. In this study, we report molecular cloning, sequencing and characterization of GM-CSF from Indian water buffalo, Bubalus bubalis. In addition, we performed sequence and structural analysis for buffalo GM-CSF. Buffalo GM-CSF has been compared with 17 mammalian GM-CSFs using multiple sequence alignment and phylogenetic tree. Three-dimensional model for buffalo GM-CSF and human receptor complex was built using homology modelling to study cross-reactivity between two species. Detailed analysis was performed to study GM-CSF interface and various interactions at the interface.


Subject(s)
Buffaloes/genetics , Granulocyte-Macrophage Colony-Stimulating Factor/genetics , Animals , Cloning, Molecular , Granulocyte-Macrophage Colony-Stimulating Factor/chemistry , Sequence Analysis, DNA
5.
PLoS One ; 8(4): e60774, 2013.
Article in English | MEDLINE | ID: mdl-23593307

ABSTRACT

Although RNA silencing has been studied primarily in model plants, advances in high-throughput sequencing technologies have enabled profiling of the small RNA components of many more plant species, providing insights into the ubiquity and conservatism of some miRNA-based regulatory mechanisms. Small RNAs of 20 to 24 nucleotides (nt) are important regulators of gene transcript levels by either transcriptional or by posttranscriptional gene silencing, contributing to genome maintenance and controlling a variety of developmental and physiological processes. Here, we used deep sequencing and molecular methods to create an inventory of the small RNAs in the mangrove species, Avicennia marina. We identified 26 novel mangrove miRNAs and 193 conserved miRNAs belonging to 36 families. We determined that 2 of the novel miRNAs were produced from known miRNA precursors and 4 were likely to be species-specific by the criterion that we found no homologs in other plant species. We used qRT-PCR to analyze the expression of miRNAs and their target genes in different tissue sets and some demonstrated tissue-specific expression. Furthermore, we predicted potential targets of these putative miRNAs based on a sequence homology and experimentally validated through endonucleolytic cleavage assays. Our results suggested that expression profiles of miRNAs and their predicted targets could be useful in exploring the significance of the conservation patterns of plants, particularly in response to abiotic stress. Because of their well-developed abilities in this regard, mangroves and other extremophiles are excellent models for such exploration.


Subject(s)
Avicennia/genetics , Avicennia/physiology , High-Throughput Nucleotide Sequencing , MicroRNAs/genetics , RNA, Plant/genetics , Sequence Analysis, RNA , Stress, Physiological/genetics , Base Sequence , Conserved Sequence , Molecular Sequence Annotation , Molecular Sequence Data , RNA Cleavage , RNA, Small Interfering/genetics , Transcriptome
6.
J Theor Biol ; 317: 377-83, 2013 Jan 21.
Article in English | MEDLINE | ID: mdl-23123454

ABSTRACT

The extracellular matrix (ECM) is a major component of tissues of multicellular organisms. It consists of secreted macromolecules, mainly polysaccharides and glycoproteins. Malfunctions of ECM proteins lead to severe disorders such as marfan syndrome, osteogenesis imperfecta, numerous chondrodysplasias, and skin diseases. In this work, we report a random forest approach, EcmPred, for the prediction of ECM proteins from protein sequences. EcmPred was trained on a dataset containing 300 ECM and 300 non-ECM and tested on a dataset containing 145 ECM and 4187 non-ECM proteins. EcmPred achieved 83% accuracy on the training and 77% on the test dataset. EcmPred predicted 15 out of 20 experimentally verified ECM proteins. By scanning the entire human proteome, we predicted novel ECM proteins validated with gene ontology and InterPro. The dataset and standalone version of the EcmPred software is available at http://www.inb.uni-luebeck.de/tools-demos/Extracellular_matrix_proteins/EcmPred.


Subject(s)
Algorithms , Computational Biology/methods , Extracellular Matrix Proteins/metabolism , Artificial Intelligence , Databases, Protein , Humans , Proteome/metabolism , ROC Curve
7.
Protein Pept Lett ; 19(1): 50-6, 2012 Jan.
Article in English | MEDLINE | ID: mdl-21919860

ABSTRACT

Prediction of protein structure from its amino acid sequence is still a challenging problem. The complete physicochemical understanding of protein folding is essential for the accurate structure prediction. Knowledge of residue solvent accessibility gives useful insights into protein structure prediction and function prediction. In this work, we propose a random forest method, RSARF, to predict residue accessible surface area from protein sequence information. The training and testing was performed using 120 proteins containing 22006 residues. For each residue, buried and exposed state was computed using five thresholds (0%, 5%, 10%, 25%, and 50%). The prediction accuracy for 0%, 5%, 10%, 25%, and 50% thresholds are 72.9%, 78.25%, 78.12%, 77.57% and 72.07% respectively. Further, comparison of RSARF with other methods using a benchmark dataset containing 20 proteins shows that our approach is useful for prediction of residue solvent accessibility from protein sequence without using structural information. The RSARF program, datasets and supplementary data are available at http://caps.ncbs.res.in/download/pugal/RSARF/.


Subject(s)
Proteins/chemistry , Sequence Analysis, Protein/methods , Software , Algorithms , Amino Acid Sequence , Computational Biology , Computer Simulation , Crystallography, X-Ray , Databases, Protein , Hydrophobic and Hydrophilic Interactions , Molecular Sequence Data , Predictive Value of Tests , Protein Conformation , Protein Folding , Solvents/chemistry
8.
Database (Oxford) ; 2011: bar042, 2011.
Article in English | MEDLINE | ID: mdl-21959866

ABSTRACT

Three-dimensional domain swapping is a unique protein structural phenomenon where two or more protein chains in a protein oligomer share a common structural segment between individual chains. This phenomenon is observed in an array of protein structures in oligomeric conformation. Protein structures in swapped conformations perform diverse functional roles and are also associated with deposition diseases in humans. We have performed in-depth literature curation and structural bioinformatics analyses to develop an integrated knowledgebase of proteins involved in 3D domain swapping. The hallmark of 3D domain swapping is the presence of distinct structural segments such as the hinge and swapped regions. We have curated the literature to delineate the boundaries of these regions. In addition, we have defined several new concepts like 'secondary major interface' to represent the interface properties arising as a result of 3D domain swapping, and a new quantitative measure for the 'extent of swapping' in structures. The catalog of proteins reported in 3DSwap knowledgebase has been generated using an integrated structural bioinformatics workflow of database searches, literature curation, by structure visualization and sequence-structure-function analyses. The current version of the 3DSwap knowledgebase reports 293 protein structures, the analysis of such a compendium of protein structures will further the understanding molecular factors driving 3D domain swapping.


Subject(s)
Computational Biology/methods , Database Management Systems , Databases, Protein , Protein Structure, Tertiary , Proteins/chemistry , Animals , Cattle , Humans , Models, Molecular , Molecular Sequence Annotation , Protein Conformation , User-Computer Interface
9.
BMC Bioinformatics ; 12: 345, 2011 Aug 17.
Article in English | MEDLINE | ID: mdl-21849049

ABSTRACT

BACKGROUND: Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures, but some insects, plants, fungi etc, also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence. RESULTS: In this paper, we propose a novel predictive method that uses a Support Vector Machine (SVM) and physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non-bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non-bioluminescent proteins. To identify the most prominent features, we carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. We selected five different feature subsets by decreasing the number of features, and the performance of each feature subset was evaluated. CONCLUSION: BLProt achieves 80% accuracy from training (5 fold cross-validations) and 80.06% accuracy from testing. The performance of BLProt was compared with BLAST and HMM. High prediction accuracy and successful prediction of hypothetical proteins suggests that BLProt can be a useful approach to identify bioluminescent proteins from sequence information, irrespective of their sequence similarity. The BLProt software is available at http://www.inb.uni-luebeck.de/tools-demos/bioluminescent%20protein/BLProt.


Subject(s)
Luminescent Proteins/chemistry , Software , Support Vector Machine , Animals , Humans , Markov Chains
10.
Protein Pept Lett ; 18(10): 1010-20, 2011 Oct.
Article in English | MEDLINE | ID: mdl-21592079

ABSTRACT

3D domain swapping is a protein structural phenomenon that mediates the formation of the higher order oligomers in a variety of proteins with different structural and functional properties. 3D domain swapping is associated with a variety of biological functions ranging from oligomerization to pathological conformational diseases. 3D domain swapping is realised subsequent to structure determination where the protein is observed in the swapped conformation in the oligomeric state. This is a limiting step to understand this important structural phenomenon in a large scale from the growing sequence data. A new machine learning approach, 3dswap-pred, has been developed for the prediction of 3D domain swapping in protein structures from mere sequence data using the Random Forest approach. 3Dswap-pred is implemented using a positive sequence dataset derived from literature based structural curation of 297 structures. A negative sequence dataset is obtained from 462 SCOP domains using a new sequence data mining approach and a set of 126 sequencederived features. Statistical validation using an independent dataset of 68 positive sequences and 313 negative sequences revealed that 3dswap-pred achieved an accuracy of 63.8%. A webserver is also implemented using the 3dswap-pred Random Forest model. The server is available from the URL: http://caps.ncbs.res.in/3dswap-pred.


Subject(s)
Proteins/chemistry , Algorithms , Protein Structure, Secondary , Protein Structure, Tertiary
11.
J Theor Biol ; 270(1): 56-62, 2011 Feb 07.
Article in English | MEDLINE | ID: mdl-21056045

ABSTRACT

Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.


Subject(s)
Algorithms , Amino Acid Sequence/genetics , Antifreeze Proteins/analysis , Computational Biology/methods , Proteins/classification , Amino Acids/chemistry , Antifreeze Proteins/genetics , Artificial Intelligence , Chemical Phenomena , Protein Structure, Secondary/genetics , Protein Structure, Tertiary/genetics , Proteins/genetics , ROC Curve
12.
J Biomol Struct Dyn ; 28(3): 405-14, 2010 Dec.
Article in English | MEDLINE | ID: mdl-20919755

ABSTRACT

Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.


Subject(s)
Amino Acid Motifs , Databases, Protein , Evolution, Molecular , Protein Conformation , Proteins/chemistry , Software , Amino Acid Sequence , Models, Molecular , Proteins/classification , Proteins/genetics , Sequence Alignment/methods
13.
Protein Pept Lett ; 17(12): 1473-9, 2010 Dec.
Article in English | MEDLINE | ID: mdl-20666727

ABSTRACT

Apoptosis is an essential process for controlling tissue homeostasis by regulating a physiological balance between cell proliferation and cell death. The subcellular locations of proteins performing the cell death are determined by mostly independent cellular mechanisms. The regular bioinformatics tools to predict the subcellular locations of such apoptotic proteins do often fail. This work proposes a model for the sorting of proteins that are involved in apoptosis, allowing us to both the prediction of their subcellular locations as well as the molecular properties that contributed to it. We report a novel hybrid Genetic Algorithm (GA)/Support Vector Machine (SVM) approach to predict apoptotic protein sequences using 119 sequence derived properties like frequency of amino acid groups, secondary structure, and physicochemical properties. GA is used for selecting a near-optimal subset of informative features that is most relevant for the classification. Jackknife cross-validation is applied to test the predictive capability of the proposed method on 317 apoptosis proteins. Our method achieved 85.80% accuracy using all 119 features and 89.91% accuracy for 25 features selected by GA. Our models were examined by a test dataset of 98 apoptosis proteins and obtained an overall accuracy of 90.34%. The results show that the proposed approach is promising; it is able to select small subsets of features and still improves the classification accuracy. Our model can contribute to the understanding of programmed cell death and drug discovery. The software and dataset are available at http://www.inb.uni-luebeck.de/tools-demos/apoptosis/GASVM.


Subject(s)
Apoptosis Regulatory Proteins/chemistry , Algorithms , Artificial Intelligence , Protein Transport
14.
Bioinform Biol Insights ; 4: 33-42, 2010 Jun 17.
Article in English | MEDLINE | ID: mdl-20634983

ABSTRACT

3-dimensional domain swapping is a mechanism where two or more protein molecules form higher order oligomers by exchanging identical or similar subunits. Recently, this phenomenon has received much attention in the context of prions and neurodegenerative diseases, due to its role in the functional regulation, formation of higher oligomers, protein misfolding, aggregation etc. While 3-dimensional domain swap mechanism can be detected from three-dimensional structures, it remains a formidable challenge to derive common sequence or structural patterns from proteins involved in swapping. We have developed a SVM-based classifier to predict domain swapping events using a set of features derived from sequence and structural data. The SVM classifier was trained on features derived from 150 proteins reported to be involved in 3D domain swapping and 150 proteins not known to be involved in swapped conformation or related to proteins involved in swapping phenomenon. The testing was performed using 63 proteins from the positive dataset and 63 proteins from the negative dataset. We obtained 76.33% accuracy from training and 73.81% accuracy from testing. Due to high diversity in the sequence, structure and functions of proteins involved in domain swapping, availability of such an algorithm to predict swapping events from sequence and structure-derived features will be an initial step towards identification of more putative proteins that may be involved in swapping or proteins involved in deposition disease. Further, the top features emerging in our feature selection method may be analysed further to understand their roles in the mechanism of domain swapping.

15.
Amino Acids ; 39(5): 1385-91, 2010 Nov.
Article in English | MEDLINE | ID: mdl-20411285

ABSTRACT

Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.


Subject(s)
Algorithms , Amino Acids/chemistry , Catalytic Domain , Chemistry, Physical , Databases, Factual , Molecular Structure , Molecular Weight
16.
Amino Acids ; 39(3): 777-83, 2010 Aug.
Article in English | MEDLINE | ID: mdl-20186553

ABSTRACT

Lipocalins are functionally diverse proteins that are composed of 120-180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew's correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/lipopred.htm.


Subject(s)
Lipocalins/chemistry , Sequence Alignment/methods , Databases, Protein , Humans , Protein Structure, Tertiary , Sequence Alignment/instrumentation , Sequence Homology, Amino Acid
17.
Protein Pept Lett ; 17(4): 423-30, 2010 Apr.
Article in English | MEDLINE | ID: mdl-20044918

ABSTRACT

X-ray crystallography is the most widely used method for protein 3-dimensional structure determination. Selection of target protein that can yield high quality crystal for X-ray crystallography is a challenging task. Prediction of protein crystallization propensity from sequence information is useful for the selection of target protein for crystallization. Recently, support vector machines have been widely used to solve various biological problems. In this work, we present a SVMCRYS method which use support vector machine to classify protein sequence into 'amenable to crystallization' and 'resistant to crystallization'. SVMCRYS was trained on a dataset containing 728 sequences that gave diffraction quality crystal and 728 sequences where work had been stopped before obtaining crystal. The performance of SVMCRYS method was compared with other sequence-based crystallization prediction methods such as SECRET, CRYSTALP, OB-Score, ParCrys and XtalPred using three different datasets. SVMCRYS achieved better prediction rate with higher sensitivity and specificity. Our analysis suggests that SVMCRYS can be used to predict proteins which are amenable to crystallization and proteins which are difficult for crystallization. The SVMCRYS software, dataset and feature set can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/svmcrys.htm.


Subject(s)
Algorithms , Amino Acid Sequence , Artificial Intelligence , Crystallography, X-Ray/methods , Proteins/chemistry , Databases, Protein , Nuclear Magnetic Resonance, Biomolecular , Proteins/metabolism , ROC Curve , Reproducibility of Results , Structure-Activity Relationship
18.
Biochem Biophys Res Commun ; 391(3): 1306-11, 2010 Jan 15.
Article in English | MEDLINE | ID: mdl-19995554

ABSTRACT

Eukaryotic protein secretion generally occurs via the classical secretory pathway that traverses the ER and Golgi apparatus. Secreted proteins usually contain a signal sequence with all the essential information required to target them for secretion. However, some proteins like fibroblast growth factors (FGF-1, FGF-2), interleukins (IL-1 alpha, IL-1 beta), galectins and thioredoxin are exported by an alternative pathway. This is known as leaderless or non-classical secretion and works without a signal sequence. Most computational methods for the identification of secretory proteins use the signal peptide as indicator and are therefore not able to identify substrates of non-classical secretion. In this work, we report a random forest method, SPRED, to identify secretory proteins from protein sequences irrespective of N-terminal signal peptides, thus allowing also correct classification of non-classical secretory proteins. Training was performed on a dataset containing 600 extracellular proteins and 600 cytoplasmic and/or nuclear proteins. The algorithm was tested on 180 extracellular proteins and 1380 cytoplasmic and/or nuclear proteins. We obtained 85.92% accuracy from training and 82.18% accuracy from testing. Since SPRED does not use N-terminal signals, it can detect non-classical secreted proteins by filtering those secreted proteins with an N-terminal signal by using SignalP. SPRED predicted 15 out of 19 experimentally verified non-classical secretory proteins. By scanning the entire human proteome we identified 566 protein sequences potentially undergoing non-classical secretion. The dataset and standalone version of the SPRED software is available at http://www.inb.uni-luebeck.de/tools-demos/spred/spred.


Subject(s)
Artificial Intelligence , Genome, Human , Proteins/metabolism , Proteome , Sequence Analysis, Protein/methods , Animals , Humans , Proteins/chemistry , Proteins/genetics
19.
J Biomol Struct Dyn ; 26(6): 679-86, 2009 Jun.
Article in English | MEDLINE | ID: mdl-19385697

ABSTRACT

DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achieved better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.


Subject(s)
Algorithms , DNA-Binding Proteins/analysis , Databases, Protein , Amino Acids/metabolism , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/metabolism , Hydrophobic and Hydrophilic Interactions , Reproducibility of Results
20.
Biochem Biophys Res Commun ; 384(2): 155-9, 2009 Jun 26.
Article in English | MEDLINE | ID: mdl-19394310

ABSTRACT

Identification of functionally important sites (FIS) in proteins is a critical problem and can have profound importance where protein structural information is limited. Machine learning techniques have been very useful in successful classification of many important biological problems. In this paper, we adopt the sparse kernel least squares classifiers (SKLSC) approach for classification and/or prediction of FIS using protein sequence derived features. The SKLSC algorithm was applied to 5435 FIS that have been extracted from 312 reliable alignments for a wide range of protein families. We obtained 68.28% sensitivity and 68.66% specificity for training dataset and 65.34% sensitivity and 66.88% specificity for testing dataset. Further, large scale benchmarking study using alignments of 101 protein families containing 1899 FIS showed that our method achieved an average approximately 70% sensitivity in predicting different types of FIS, such as active sites, metal, ligand or protein binding sites. Our findings also indicate that active sites and metal binding sites are comparably easier to predict compared to the ligand and protein binding sites. Despite moderate success, our results suggest the usefulness and potential of SKLSC approach in prediction of FIS using only protein sequence derived information.


Subject(s)
Binding Sites , Proteins/chemistry , Sequence Analysis, Protein/methods , Amino Acid Sequence , Catalytic Domain , Least-Squares Analysis , Proteins/classification
SELECTION OF CITATIONS
SEARCH DETAIL
...