Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 60
Filter
1.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36617463

ABSTRACT

DNA and RNA sequencing technologies have revolutionized biology and biomedical sciences, sequencing full genomes and transcriptomes at very high speeds and reasonably low costs. RNA sequencing (RNA-Seq) enables transcript identification and quantification, but once sequencing has concluded researchers can be easily overwhelmed with questions such as how to go from raw data to differential expression (DE), pathway analysis and interpretation. Several pipelines and procedures have been developed to this effect. Even though there is no unique way to perform RNA-Seq analysis, it usually follows these steps: 1) raw reads quality check, 2) alignment of reads to a reference genome, 3) aligned reads' summarization according to an annotation file, 4) DE analysis and 5) gene set analysis and/or functional enrichment analysis. Each step requires researchers to make decisions, and the wide variety of options and resulting large volumes of data often lead to interpretation challenges. There also seems to be insufficient guidance on how best to obtain relevant information and derive actionable knowledge from transcription experiments. In this paper, we explain RNA-Seq steps in detail and outline differences and similarities of different popular options, as well as advantages and disadvantages. We also discuss non-coding RNA analysis, multi-omics, meta-transcriptomics and the use of artificial intelligence methods complementing the arsenal of tools available to researchers. Lastly, we perform a complete analysis from raw reads to DE and functional enrichment analysis, visually illustrating how results are not absolute truths and how algorithmic decisions can greatly impact results and interpretation.


Subject(s)
Artificial Intelligence , Gene Expression Profiling , Gene Expression Profiling/methods , Transcriptome , Sequence Analysis, RNA/methods , Genome , High-Throughput Nucleotide Sequencing/methods , RNA/genetics
2.
Sci Rep ; 12(1): 11791, 2022 07 11.
Article in English | MEDLINE | ID: mdl-35821038

ABSTRACT

Progesterone receptor (PR) transcriptional activity is a key factor in the differentiation of the uterine endometrium. By consequence, progestin has been identified as an important treatment modality for endometrial cancer. PR transcriptional activity is controlled by extracellular-signal-regulated kinase (ERK) mediated phosphorylation, downstream of growth factor receptors such as EGFR. However, phosphorylation of PR also targets it for ubiquitination and destruction in the proteasome. Quantitative studies of these opposing roles are much needed toward validation of potential new progestin-based therapeutics. In this work, we propose a spatial stochastic model to study the effects of the opposing roles for PR phosphorylation on the levels of active transcription factor. Our numerical simulations confirm earlier in vitro experiments in endometrial cancer cell lines, identifying clustering as a mechanism that amplifies the ability of progesterone receptors to influence gene transcription. We additionally show the usefulness of a statistical method we developed to quantify and control variations in stochastic simulations in general biochemical systems, assisting modelers in defining minimal but meaningful numbers of simulations while guaranteeing outputs remain within a pre-defined confidence level.


Subject(s)
Endometrial Neoplasms , Receptors, Progesterone , Cluster Analysis , Endometrial Neoplasms/genetics , Extracellular Signal-Regulated MAP Kinases/metabolism , Female , Humans , Progesterone/pharmacology , Progesterone Congeners , Progestins/pharmacology , Receptors, Progesterone/genetics , Receptors, Progesterone/metabolism , Signal Transduction , Translocation, Genetic
3.
Methods Mol Biol ; 2499: 205-219, 2022.
Article in English | MEDLINE | ID: mdl-35696083

ABSTRACT

Among various types of protein post-translational modifications (PTMs), lysine PTMs play an important role in regulating a wide range of functions and biological processes. Due to the generation and accumulation of enormous amount of protein sequence data by ongoing whole-genome sequencing projects, systematic identification of different types of lysine PTM substrates and their specific PTM sites in the entire proteome is increasingly important and has therefore received much attention. Accordingly, a variety of computational methods for lysine PTM identification have been developed based on the combination of various handcrafted sequence features and machine-learning techniques. In this chapter, we first briefly review existing computational methods for lysine PTM identification and then introduce a recently developed deep learning-based method, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs). Specifically, MUscADEL employs bidirectional long short-term memory (BiLSTM) recurrent neural networks and is capable of predicting eight major types of lysine PTMs in both the human and mouse proteomes. The web server of MUscADEL is publicly available at http://muscadel.erc.monash.edu/ for the research community to use.


Subject(s)
Lysine , Protein Processing, Post-Translational , Amino Acid Sequence , Animals , Lysine/metabolism , Machine Learning , Mice , Proteome/metabolism
4.
Brief Bioinform ; 22(5)2021 09 02.
Article in English | MEDLINE | ID: mdl-33774670

ABSTRACT

Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.


Subject(s)
Algorithms , Computational Biology/methods , Machine Learning , Pore Forming Cytotoxic Proteins/pharmacology , Bacteria/classification , Bacteria/drug effects , Biofilms/drug effects , Biofilms/growth & development , Databases, Factual , Fungi/classification , Fungi/drug effects , Pore Forming Cytotoxic Proteins/classification , Pore Forming Cytotoxic Proteins/metabolism , Support Vector Machine , Viruses/drug effects
5.
Brief Bioinform ; 22(3)2021 05 20.
Article in English | MEDLINE | ID: mdl-32599617

ABSTRACT

Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.


Subject(s)
Bacteria , Bacterial Proteins/genetics , Databases, Protein , Deep Learning , Genome, Bacterial , Virulence Factors/genetics , Bacteria/genetics , Bacteria/pathogenicity
6.
Front Big Data ; 4: 727216, 2021.
Article in English | MEDLINE | ID: mdl-35118375

ABSTRACT

BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

7.
Brief Bioinform ; 22(4)2021 07 20.
Article in English | MEDLINE | ID: mdl-33212503

ABSTRACT

Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.


Subject(s)
Computational Biology , Databases, Protein , Deep Learning , Proteome , Software , beta-Lactamases/genetics
8.
Nucleic Acids Res ; 49(D1): D651-D659, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33084862

ABSTRACT

Gram-negative bacteria utilize secretion systems to export substrates into their surrounding environment or directly into neighboring cells. These substrates are proteins that function to promote bacterial survival: by facilitating nutrient collection, disabling competitor species or, for pathogens, to disable host defenses. Following a rapid development of computational techniques, a growing number of substrates have been discovered and subsequently validated by wet lab experiments. To date, several online databases have been developed to catalogue these substrates but they have limited user options for in-depth analysis, and typically focus on a single type of secreted substrate. We therefore developed a universal platform, BastionHub, that incorporates extensive functional modules to facilitate substrate analysis and integrates the five major Gram-negative secreted substrate types (i.e. from types I-IV and VI secretion systems). To our knowledge, BastionHub is not only the most comprehensive online database available, it is also the first to incorporate substrates secreted by type I or type II secretion systems. By providing the most up-to-date details of secreted substrates and state-of-the-art prediction and visualized relationship analysis tools, BastionHub will be an important platform that can assist biologists in uncovering novel substrates and formulating new hypotheses. BastionHub is freely available at http://bastionhub.erc.monash.edu/.


Subject(s)
Databases as Topic , Gram-Negative Bacteria/metabolism , Data Curation , Molecular Sequence Annotation , Substrate Specificity
9.
J Bioinform Comput Biol ; 18(4): 2050018, 2020 08.
Article in English | MEDLINE | ID: mdl-32501138

ABSTRACT

Background: Phosphorylation of histidine residues plays crucial roles in signaling pathways and cell metabolism in prokaryotes such as bacteria. While evidence has emerged that protein histidine phosphorylation also occurs in more complex organisms, its role in mammalian cells has remained largely uncharted. Thus, it is highly desirable to develop computational tools that are able to identify histidine phosphorylation sites. Result: Here, we introduce PROSPECT that enables fast and accurate prediction of proteome-wide histidine phosphorylation substrates and sites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neural network (CNN)-based classifiers and a random forest-based classifier. Three features, including the one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers, respectively. Our results show that it is able to accurately predict histidine phosphorylation sites from sequence information. Our PROSPECT web server is user-friendly and publicly available at http://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHis predictors in both the running speed and prediction accuracy and we anticipate that the PROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.


Subject(s)
Histidine/metabolism , Proteome/metabolism , Software , Computational Biology/methods , Escherichia coli Proteins/metabolism , Neural Networks, Computer , Phosphorylation
10.
Genomics Proteomics Bioinformatics ; 18(1): 52-64, 2020 02.
Article in English | MEDLINE | ID: mdl-32413515

ABSTRACT

Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.


Subject(s)
Computational Biology/methods , Peptide Hydrolases/metabolism , Software , Algorithms , Benchmarking , Catalytic Domain , Humans , Peptide Hydrolases/chemistry , Protein Conformation , Proteolysis , Proteome/metabolism , Structure-Activity Relationship , Substrate Specificity
11.
Mol Ther Nucleic Acids ; 20: 739-753, 2020 Jun 05.
Article in English | MEDLINE | ID: mdl-32408052

ABSTRACT

Significant advances in biotechnology have led to the development of a number of different mutation-directed therapies. Some of these techniques have matured to a level that has allowed testing in clinical trials, but few have made it to approval by drug-regulatory bodies for the treatment of specific diseases. While there are still various hurdles to be overcome, recent success stories have proven the potential power of mutation-directed therapies and have fueled the hope of finding therapeutics for other genetic disorders. In this review, we summarize the state-of-the-art of various therapeutic approaches and assess their applicability to the genetic disorder neurofibromatosis type I (NF1). NF1 is caused by the loss of function of neurofibromin, a tumor suppressor and downregulator of the Ras signaling pathway. The condition is characterized by a variety of phenotypes and includes symptoms such as skin spots, nervous system tumors, skeletal dysplasia, and others. Hence, depending on the patient, therapeutics may need to target different tissues and cell types. While we also discuss the delivery of therapeutics, in particular via viral vectors and nanoparticles, our main focus is on therapeutic techniques that reconstitute functional neurofibromin, most notably cDNA replacement, CRISPR-based DNA repair, RNA repair, antisense oligonucleotide therapeutics including exon skipping, and nonsense suppression.

12.
AMIA Annu Symp Proc ; 2020: 1325-1334, 2020.
Article in English | MEDLINE | ID: mdl-33936509

ABSTRACT

Recent research in predicting protein secondary structure populations (SSP) based on Nuclear Magnetic Resonance (NMR) chemical shifts has helped quantitatively characterise the structural conformational properties of intrinsically disordered proteins and regions (IDP/IDR). Different from protein secondary structure (SS) prediction, the SSP prediction assumes a dynamic assignment of secondary structures that seem correlate with disordered states. In this study, we designed a single-task deep learning framework to predict IDP/IDR and SSP respectively; and multitask deep learning frameworks to allow quantitative predictions of IDP/IDR evidenced by the simultaneously predicted SSP. According to independent test results, single-task deep learning models improve the prediction performance of shallow models for SSP and IDP/IDR. Also, the prediction performance was further improved for IDP/IDR prediction when SSP prediction was simultaneously predicted in multitask models. With p53 as a use case, we demonstrate how predicted SSP is used to explain the IDP/IDR predictions for each functional region.


Subject(s)
Deep Learning , Intrinsically Disordered Proteins/chemistry , Protein Structure, Secondary
13.
J Immunother Cancer ; 8(2)2020 12.
Article in English | MEDLINE | ID: mdl-33427691

ABSTRACT

BACKGROUND: Obesity is a major risk factor for renal cancer, yet our understanding of its effects on antitumor immunity and immunotherapy outcomes remains incomplete. Deciphering these associations is critical, given the growing clinical use of immune checkpoint inhibitors for metastatic disease and mounting evidence for an obesity paradox in the context of cancer immunotherapies, wherein obese patients with cancer have improved outcomes. METHODS: We investigated associations between host obesity and anti-programmed cell death (PD-1)-based outcomes in both renal cell carcinoma (RCC) subjects and orthotopic murine renal tumors. Overall survival (OS) and progression-free survival (PFS) were determined for advanced RCC subjects receiving standard of care anti-PD-1 who had ≥6 months of follow-up from treatment initiation (n=73). Renal tumor tissues were collected from treatment-naive subjects categorized as obese (body mass index, 'BMI' ≥30 kg/m2) or non-obese (BMI <30 kg/m2) undergoing partial or full nephrectomy (n=19) then used to evaluate the frequency and phenotype of intratumoral CD8+ T cells, including PD-1 status, by flow cytometry. In mice, antitumor immunity and excised renal tumor weights were evaluated ±administration of a combinatorial anti-PD-1 therapy. For a subset of murine renal tumors, immunophenotyping was performed by flow cytometry and immunogenetic profiles were evaluated via nanoString. RESULTS: With obesity, RCC patients receiving anti-PD-1 administration exhibited shorter PFS (p=0.0448) and OS (p=0.0288). Treatment-naive renal cancer subjects had decreased frequencies of tumor-infiltrating PD-1highCD8+ T cells, a finding recapitulated in our murine model. Following anti-PD-1-based immunotherapy, both lean and obese mice possessed distinct populations of treatment responders versus non-responders; however, obesity reduced the frequency of treatment responders (73% lean vs 44% obese). Tumors from lean and obese treatment responders displayed similar immunogenetic profiles, robust infiltration by PD-1int interferon (IFN)γ+CD8+ T cells and reduced myeloid-derived suppressor cells (MDSC), yielding favorable CD44+CD8+ T cell to MDSC ratios. Neutralizing interleukin (IL)-1ß in obese mice improved treatment response rates to 58% and reduced MDSC accumulation in tumors. CONCLUSIONS: We find that obesity is associated with diminished efficacy of anti-PD-1-based therapies in renal cancer, due in part to increased inflammatory IL-1ß levels, highlighting the need for continued study of this critical issue.


Subject(s)
Immunotherapy/methods , Kidney Neoplasms/drug therapy , Obesity/complications , Animals , Female , Humans , Kidney Neoplasms/immunology , Male , Mice , Prospective Studies , Retrospective Studies
14.
Brief Bioinform ; 21(4): 1119-1135, 2020 07 15.
Article in English | MEDLINE | ID: mdl-31204427

ABSTRACT

Human leukocyte antigen class I (HLA-I) molecules are encoded by major histocompatibility complex (MHC) class I loci in humans. The binding and interaction between HLA-I molecules and intracellular peptides derived from a variety of proteolytic mechanisms play a crucial role in subsequent T-cell recognition of target cells and the specificity of the immune response. In this context, tools that predict the likelihood for a peptide to bind to specific HLA class I allotypes are important for selecting the most promising antigenic targets for immunotherapy. In this article, we comprehensively review a variety of currently available tools for predicting the binding of peptides to a selection of HLA-I allomorphs. Specifically, we compare their calculation methods for the prediction score, employed algorithms, evaluation strategies and software functionalities. In addition, we have evaluated the prediction performance of the reviewed tools based on an independent validation data set, containing 21 101 experimentally verified ligands across 19 HLA-I allotypes. The benchmarking results show that MixMHCpred 2.0.1 achieves the best performance for predicting peptides binding to most of the HLA-I allomorphs studied, while NetMHCpan 4.0 and NetMHCcons 1.1 outperform the other machine learning-based and consensus-based tools, respectively. Importantly, it should be noted that a peptide predicted with a higher binding score for a specific HLA allotype does not necessarily imply it will be immunogenic. That said, peptide-binding predictors are still very useful in that they can help to significantly reduce the large number of epitope candidates that need to be experimentally verified. Several other factors, including susceptibility to proteasome cleavage, peptide transport into the endoplasmic reticulum and T-cell receptor repertoire, also contribute to the immunogenicity of peptide antigens, and some of them can be considered by some predictors. Therefore, integrating features derived from these additional factors together with HLA-binding properties by using machine-learning algorithms may increase the prediction accuracy of immunogenic peptides. As such, we anticipate that this review and benchmarking survey will assist researchers in selecting appropriate prediction tools that best suit their purposes and provide useful guidelines for the development of improved antigen predictors in the future.


Subject(s)
Computational Biology/methods , Histocompatibility Antigens Class I/metabolism , Algorithms , Datasets as Topic , Histocompatibility Antigens Class I/chemistry , Humans , Machine Learning , Reproducibility of Results
15.
Brief Bioinform ; 21(3): 1069-1079, 2020 05 21.
Article in English | MEDLINE | ID: mdl-31161204

ABSTRACT

Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence-structural-functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.


Subject(s)
Databases, Protein , Mutation , Protein Processing, Post-Translational , Proteins/chemistry , Protein Conformation
16.
Bioinformatics ; 36(3): 704-712, 2020 02 01.
Article in English | MEDLINE | ID: mdl-31393553

ABSTRACT

MOTIVATION: Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. RESULTS: In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. AVAILABILITY AND IMPLEMENTATION: http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Machine Learning , Computational Biology , Peptides , Proteins
17.
Brief Bioinform ; 21(3): 1047-1057, 2020 05 21.
Article in English | MEDLINE | ID: mdl-31067315

ABSTRACT

With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.


Subject(s)
DNA/chemistry , Machine Learning , Proteins/chemistry , RNA/chemistry , Sequence Analysis/methods , Algorithms , Internet
18.
Bioinformatics ; 36(4): 1057-1065, 2020 02 15.
Article in English | MEDLINE | ID: mdl-31566664

ABSTRACT

MOTIVATION: Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the 'life and death' cellular processes, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events. RESULTS: We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites. AVAILABILITY AND IMPLEMENTATION: The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Caspases , Metalloproteases , Software , Substrate Specificity
19.
Bioinformatics ; 35(17): 2957-2965, 2019 09 01.
Article in English | MEDLINE | ID: mdl-30649179

ABSTRACT

MOTIVATION: Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS: In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION: The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Promoter Regions, Genetic , Software , Bayes Theorem , Transcription Initiation Site
20.
Immunity ; 50(1): 225-240.e4, 2019 01 15.
Article in English | MEDLINE | ID: mdl-30635238

ABSTRACT

Infants have a higher risk of developing allergic asthma than adults. However, the underlying mechanism remains unknown. We show here that sensitization of mice with house-dust mites (HDMs) in the presence of low-dose lipopolysaccharide (LPS) prevented T helper 2 (Th2) cell allergic responses in adult, but not infant, mice. Mechanistically, adult CD11b+ migratory dendritic cells (mDCs) upregulated the transcription factor T-bet in response to tumor necrosis factor-α (TNF-α), which was rapidly induced after HDM + LPS sensitization. Consequently, adult CD11b+ mDCs produced interleukin-12 (IL-12), which prevented Th2 cell development by promoting T-bet upregulation in responding T cells. Conversely, infants failed to induce TNF-α after HDM + LPS sensitization. Therefore, CD11b+ mDCs failed to upregulate T-bet and did not secrete IL-12 and Th2 cell responses normally developed in infant mice. Thus, the availability of TNF-α dictates the ability of CD11b+ mDCs to suppress allergic Th2-cell responses upon dose-dependent endotoxin sensitization and is a key mediator governing susceptibility to allergic airway inflammation in infant mice.


Subject(s)
Dendritic Cells/physiology , Hypersensitivity/immunology , Inflammation/immunology , Th2 Cells/immunology , Tumor Necrosis Factor-alpha/metabolism , Adult , Animals , Animals, Newborn , Antigens, Dermatophagoides , Cell Differentiation , Humans , Immunization , Infant , Lipopolysaccharides/immunology , Mice , Mice, Inbred C57BL , Mice, Knockout , Pyroglyphidae/immunology , T-Box Domain Proteins/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL
...