Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 62
Filter
1.
Bioinformatics ; 40(1)2024 01 02.
Article in English | MEDLINE | ID: mdl-38147362

ABSTRACT

MOTIVATION: Up-to-date pathway knowledge is usually presented in scientific publications for human reading, making it difficult to utilize these resources for semantic integration and computational analysis of biological pathways. We here present an approach to mining knowledge graphs by combining manual curation with automated named entity recognition and automated relation extraction. This approach allows us to study pathway-related questions in detail, which we here show using the ketamine pathway, aiming to help improve understanding of the role of gut microbiota in the antidepressant effects of ketamine. RESULTS: The thus devised ketamine pathway 'KetPath' knowledge graph comprises five parts: (i) manually curated pathway facts from images; (ii) recognized named entities in biomedical texts; (iii) identified relations between named entities; (iv) our previously constructed microbiota and pre-/probiotics knowledge bases; and (v) multiple community-accepted public databases. We first assessed the performance of automated extraction of relations between named entities using the specially designed state-of-the-art tool BioKetBERT. The query results show that we can retrieve drug actions, pathway relations, co-occurring entities, and their relations. These results uncover several biological findings, such as various gut microbes leading to increased expression of BDNF, which may contribute to the sustained antidepressant effects of ketamine. We envision that the methods and findings from this research will aid researchers who wish to integrate and query data and knowledge from multiple biomedical databases and literature simultaneously. AVAILABILITY AND IMPLEMENTATION: Data and query protocols are available in the KetPath repository at https://dx.doi.org/10.5281/zenodo.8398941 and https://github.com/tingcosmos/KetPath.


Subject(s)
Gastrointestinal Microbiome , Ketamine , Humans , Ketamine/pharmacology , Databases, Factual , Antidepressive Agents/pharmacology , Neurotransmitter Agents , Data Mining/methods
2.
PLoS Comput Biol ; 18(12): e1010669, 2022 12.
Article in English | MEDLINE | ID: mdl-36454728

ABSTRACT

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.


Subject(s)
Benchmarking , Machine Learning , Amino Acid Sequence , Chromosome Mapping , Knowledge
3.
Sci Rep ; 12(1): 18977, 2022 11 08.
Article in English | MEDLINE | ID: mdl-36347868

ABSTRACT

Scientific publications present biological relationships but are structured for human reading, making it difficult to use this resource for semantic integration and querying. Existing databases, on the other hand, are well structured for automated analysis, but do not contain comprehensive biological knowledge. We devised an approach for constructing comprehensive knowledge graphs from these two types of resources and applied it to investigate relationships between pre-/probiotics and microbiota-gut-brain axis diseases. To this end, we created (i) a knowledge base, dubbed ppstatement, containing manually curated detailed annotations, and (ii) a knowledge base, called ppconcept, containing automatically annotated concepts. The resulting Pre-/Probiotics Knowledge Graph (PPKG) combines these two knowledge bases with three other public databases (i.e. MeSH, UMLS and SNOMED CT). To validate the performance of PPKG and to demonstrate the added value of integrating two knowledge bases, we created four biological query cases. The query cases demonstrate that we can retrieve co-occurring concepts of interest, and also that combining the two knowledge bases leads to more comprehensive query results than utilizing them separately. The PPKG enables users to pose research queries such as "which pre-/probiotics combinations may benefit depression?", potentially leading to novel biological insights.


Subject(s)
Microbiota , Probiotics , Humans , Brain-Gut Axis , Pattern Recognition, Automated , Knowledge Bases
4.
Sci Rep ; 12(1): 16047, 2022 09 26.
Article in English | MEDLINE | ID: mdl-36163232

ABSTRACT

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue .


Subject(s)
Benchmarking , Proteins , Amino Acid Sequence , Amino Acids/chemistry , Natural Language Processing
5.
Sci Rep ; 12(1): 10487, 2022 06 21.
Article in English | MEDLINE | ID: mdl-35729253

ABSTRACT

Protein protein interactions (PPI) are crucial for protein functioning, nevertheless predicting residues in PPI interfaces from the protein sequence remains a challenging problem. In addition, structure-based functional annotations, such as the PPI interface annotations, are scarce: only for about one-third of all protein structures residue-based PPI interface annotations are available. If we want to use a deep learning strategy, we have to overcome the problem of limited data availability. Here we use a multi-task learning strategy that can handle missing data. We start with the multi-task model architecture, and adapted it to carefully handle missing data in the cost function. As related learning tasks we include prediction of secondary structure, solvent accessibility, and buried residue. Our results show that the multi-task learning strategy significantly outperforms single task approaches. Moreover, only the multi-task strategy is able to effectively learn over a dataset extended with structural feature data, without additional PPI annotations. The multi-task setup becomes even more important, if the fraction of PPI annotations becomes very small: the multi-task learner trained on only one-eighth of the PPI annotations-with data extension-reaches the same performances as the single-task learner on all PPI annotations. Thus, we show that the multi-task learning strategy can be beneficial for a small training dataset where the protein's functional properties of interest are only partially annotated.


Subject(s)
Algorithms , Proteins , Proteins/metabolism
6.
Bioinformatics ; 38(8): 2111-2118, 2022 04 12.
Article in English | MEDLINE | ID: mdl-35150231

ABSTRACT

MOTIVATION: The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly and challenging task, while protein sequence data are ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different Deep Learning (DL) architectures and learning strategies for protein-protein, protein-nucleotide and protein-small molecule interface prediction has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six DL architectures and various learning strategies with sequence-derived input features. RESULTS: We constructed a large dataset dubbed BioDL, comprising protein-protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang and Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction dataset. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein-protein, 0.823 for protein-nucleotide and 0.842 for protein-small molecule. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are available at https://github.com/ibivu/pipenn/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Machine Learning , Proteins , Proteins/chemistry , Software , Amino Acid Sequence , Nucleotides , Computational Biology/methods
7.
Bioinformatics ; 37(20): 3421-3427, 2021 Oct 25.
Article in English | MEDLINE | ID: mdl-33974039

ABSTRACT

MOTIVATION: Antibodies play an important role in clinical research and biotechnology, with their specificity determined by the interaction with the antigen's epitope region, as a special type of protein-protein interaction (PPI) interface. The ubiquitous availability of sequence data, allows us to predict epitopes from sequence in order to focus time-consuming wet-lab experiments toward the most promising epitope regions. Here, we extend our previously developed sequence-based predictors for homodimer and heterodimer PPI interfaces to predict epitope residues that have the potential to bind an antibody. RESULTS: We collected and curated a high quality epitope dataset from the SAbDab database. Our generic PPI heterodimer predictor obtained an AUC-ROC of 0.666 when evaluated on the epitope test set. We then trained a random forest model specifically on the epitope dataset, reaching AUC 0.694. Further training on the combined heterodimer and epitope datasets, improves our final predictor to AUC 0.703 on the epitope test set. This is better than the best state-of-the-art sequence-based epitope predictor BepiPred-2.0. On one solved antibody-antigen structure of the COVID19 virus spike receptor binding domain, our predictor reaches AUC 0.778. We added the SeRenDIP-CE Conformational Epitope predictors to our webserver, which is simple to use and only requires a single antigen sequence as input, which will help make the method immediately applicable in a wide range of biomedical and biomolecular research. AVAILABILITY AND IMPLEMENTATION: Webserver, source code and datasets at www.ibi.vu.nl/programs/serendipwww/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

8.
BMC Mol Cell Biol ; 22(1): 23, 2021 Apr 23.
Article in English | MEDLINE | ID: mdl-33892639

ABSTRACT

BACKGROUND: The SARS-CoV-2 virus, the causative agent of COVID-19, consists of an assembly of proteins that determine its infectious and immunological behavior, as well as its response to therapeutics. Major structural biology efforts on these proteins have already provided essential insights into the mode of action of the virus, as well as avenues for structure-based drug design. However, not all of the SARS-CoV-2 proteins, or regions thereof, have a well-defined three-dimensional structure, and as such might exhibit ambiguous, dynamic behaviour that is not evident from static structure representations, nor from molecular dynamics simulations using these structures. MAIN: We present a website ( https://bio2byte.be/sars2/ ) that provides protein sequence-based predictions of the backbone and side-chain dynamics and conformational propensities of these proteins, as well as derived early folding, disorder, ß-sheet aggregation, protein-protein interaction and epitope propensities. These predictions attempt to capture the inherent biophysical propensities encoded in the sequence, rather than context-dependent behaviour such as the final folded state. In addition, we provide the biophysical variation that is observed in homologous proteins, which gives an indication of the limits of their functionally relevant biophysical behaviour. CONCLUSION: The https://bio2byte.be/sars2/ website provides a range of protein sequence-based predictions for 27 SARS-CoV-2 proteins, enabling researchers to form hypotheses about their possible functional modes of action.


Subject(s)
SARS-CoV-2/chemistry , Viral Proteins/chemistry , Databases, Protein , Humans , Internet Access , Sequence Alignment , Sequence Analysis, Protein , Software , Viral Proteins/metabolism
9.
Health Inf Sci Syst ; 9(1): 3, 2021 Dec.
Article in English | MEDLINE | ID: mdl-33262885

ABSTRACT

Gut microbiota produce and modulate the production of neurotransmitters which have been implicated in mental disorders. Neurotransmitters may act as 'matchmaker' between gut microbiota imbalance and mental disorders. Most of the relevant research effort goes into the relationship between gut microbiota and neurotransmitters and the other between neurotransmitters and mental disorders, while few studies collect and analyze the dispersed research results in systematic ways. We therefore gather the dispersed results that in the existing studies into a structured knowledge base for identifying and predicting the potential relationships between gut microbiota and mental disorders. In this study, we propose to construct a gut microbiota knowledge graph for mental disorder, which named as MiKG4MD. It is extendable by linking to future ontologies by just adding new relationships between existing information and new entities. This extendibility is emphasized for the integration with existing popular ontologies/terminologies, e.g. UMLS, MeSH, and KEGG. We demonstrate the performance of MiKG4MD with three SPARQL query test cases. Results show that the MiKG4MD knowledge graph is an effective method to predict the relationships between gut microbiota and mental disorders.

10.
F1000Res ; 92020.
Article in English | MEDLINE | ID: mdl-32566135

ABSTRACT

Structural bioinformatics provides the scientific methods and tools to analyse, archive, validate, and present the biomolecular structure data generated by the structural biology community. It also provides an important link with the genomics community, as structural bioinformaticians also use the extensive sequence data to predict protein structures and their functional sites. A very broad and active community of structural bioinformaticians exists across Europe, and 3D-Bioinfo will establish formal platforms to address their needs and better integrate their activities and initiatives. Our mission will be to strengthen the ties with the structural biology research communities in Europe covering life sciences, as well as chemistry and physics and to bridge the gap between these researchers in order to fully realize the potential of structural bioinformatics. Our Community will also undertake dedicated educational, training and outreach efforts to facilitate this, bringing new insights and thus facilitating the development of much needed innovative applications e.g. for human health, drug and protein design. Our combined efforts will be of critical importance to keep the European research efforts competitive in this respect. Here we highlight the major European contributions to the field of structural bioinformatics, the most pressing challenges remaining and how Europe-wide interactions, enabled by ELIXIR and its platforms, will help in addressing these challenges and in coordinating structural bioinformatics resources across Europe. In particular, we present recent activities and future plans to consolidate an ELIXIR 3D-Bioinfo Community in structural bioinformatics and propose means to develop better links across the community. These include building new consortia, organising workshops to establish data standards and seeking community agreement on benchmark data sets and strategies. We also highlight existing and planned collaborations with other ELIXIR Communities and other European infrastructures, such as the structural biology community supported by Instruct-ERIC, with whom we have synergies and overlapping common interests.


Subject(s)
Biological Science Disciplines , Computational Biology/organization & administration , Europe , Genomics , Humans , Proteins
11.
Bioinformatics ; 36(7): 2142-2149, 2020 04 01.
Article in English | MEDLINE | ID: mdl-31845959

ABSTRACT

MOTIVATION: Genetic interaction (GI) patterns are characterized by the phenotypes of interacting single and double mutated gene pairs. Uncovering the regulatory mechanisms of GIs would provide a better understanding of their role in biological processes, diseases and drug response. Computational analyses can provide insights into the underpinning mechanisms of GIs. RESULTS: In this study, we present a framework for exhaustive modelling of GI patterns using Petri nets (PN). Four-node models were defined and generated on three levels with restrictions, to enable an exhaustive approach. Simulations suggest ∼5 million models of GIs. Generalizing these we propose putative mechanisms for the GI patterns, inversion and suppression. We demonstrate that exhaustive PN modelling enables reasoning about mechanisms of GIs when only the phenotypes of gene pairs are known. The framework can be applied to other GI or genetic regulatory datasets. AVAILABILITY AND IMPLEMENTATION: The framework is available at http://www.ibi.vu.nl/programs/ExhMod. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

12.
Oral Oncol ; 98: 8-12, 2019 11.
Article in English | MEDLINE | ID: mdl-31521885

ABSTRACT

In this era of information technology, big data analysis is entering biomedical sciences. But what is big data, where do they come from and what can we do with it? In this commentary, the main sources of big data are explained, especially in (head and neck) oncology. It also touches upon the need to integrate various sources of clinical, pathological and quality-of-life data. It discusses some initiatives in linking of such datasets on a nation-wide scale in the Netherlands. Finally, it touches upon important issues regarding governance, FAIRness of data and the need to bring into place the necessary infrastructures needed to fully exploit the full potential of big data sets in head and neck cancer.


Subject(s)
Big Data , Medical Informatics/methods , Medical Oncology , Databases, Factual , Head and Neck Neoplasms/epidemiology , Humans , Information Dissemination , Medical Oncology/methods , Netherlands/epidemiology , Precision Medicine/methods , Quality of Health Care
13.
Bioinformatics ; 35(24): 5315-5317, 2019 12 15.
Article in English | MEDLINE | ID: mdl-31368486

ABSTRACT

SUMMARY: PRALINE 2 is a toolkit for custom multiple sequence alignment workflows. It can be used to incorporate sequence annotations, such as secondary structure or (DNA) motifs, into the alignment scoring, as well as to customize many other aspects of a progressive multiple alignment workflow. AVAILABILITY AND IMPLEMENTATION: PRALINE 2 is implemented in Python and available as open source software on GitHub: https://github.com/ibivu/PRALINE/.


Subject(s)
Software , DNA , Protein Structure, Secondary , Sequence Alignment
14.
PLoS Comput Biol ; 15(5): e1007061, 2019 05.
Article in English | MEDLINE | ID: mdl-31083661

ABSTRACT

Genetic interactions, a phenomenon whereby combinations of mutations lead to unexpected effects, reflect how cellular processes are wired and play an important role in complex genetic diseases. Understanding the molecular basis of genetic interactions is crucial for deciphering pathway organization as well as understanding the relationship between genetic variation and disease. Several hypothetical molecular mechanisms have been linked to different genetic interaction types. However, differences in genetic interaction patterns and their underlying mechanisms have not yet been compared systematically between different functional gene classes. Here, differences in the occurrence and types of genetic interactions are compared for two classes, gene-specific transcription factors (GSTFs) and signaling genes (kinases and phosphatases). Genome-wide gene expression data for 63 single and double deletion mutants in baker's yeast reveals that the two most common genetic interaction patterns are buffering and inversion. Buffering is typically associated with redundancy and is well understood. In inversion, genes show opposite behavior in the double mutant compared to the corresponding single mutants. The underlying mechanism is poorly understood. Although both classes show buffering and inversion patterns, the prevalence of inversion is much stronger in GSTFs. To decipher potential mechanisms, a Petri Net modeling approach was employed, where genes are represented as nodes and relationships between genes as edges. This allowed over 9 million possible three and four node models to be exhaustively enumerated. The models show that a quantitative difference in interaction strength is a strict requirement for obtaining inversion. In addition, this difference is frequently accompanied with a second gene that shows buffering. Taken together, these results provide a mechanistic explanation for inversion. Furthermore, the ability of transcription factors to differentially regulate expression of their targets provides a likely explanation why inversion is more prevalent for GSTFs compared to kinases and phosphatases.


Subject(s)
Gene Expression Regulation , Models, Genetic , Transcription Factors/metabolism , Chromosome Inversion , Computational Biology , Computer Simulation , Databases, Genetic , Epistasis, Genetic , Genes, Fungal , Genetic Association Studies , Mutation , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/growth & development , Saccharomyces cerevisiae/metabolism , Signal Transduction/genetics
15.
Bioinformatics ; 35(22): 4794-4796, 2019 11 01.
Article in English | MEDLINE | ID: mdl-31116381

ABSTRACT

MOTIVATION: Interpretation of ubiquitous protein sequence data has become a bottleneck in biomolecular research, due to a lack of structural and other experimental annotation data for these proteins. Prediction of protein interaction sites from sequence may be a viable substitute. We therefore recently developed a sequence-based random forest method for protein-protein interface prediction, which yielded a significantly increased performance than other methods on both homomeric and heteromeric protein-protein interactions. Here, we present a webserver that implements this method efficiently. RESULTS: With the aim of accelerating our previous approach, we obtained sequence conservation profiles by re-mastering the alignment of homologous sequences found by PSI-BLAST. This yielded a more than 10-fold speedup and at least the same accuracy, as reported previously for our method; these results allowed us to offer the method as a webserver. The web-server interface is targeted to the non-expert user. The input is simply a sequence of the protein of interest, and the output a table with scores indicating the likelihood of having an interaction interface at a certain position. As the method is sequence-based and not sensitive to the type of protein interaction, we expect this webserver to be of interest to many biological researchers in academia and in industry. AVAILABILITY AND IMPLEMENTATION: Webserver, source code and datasets are available at www.ibi.vu.nl/programs/serendipwww/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Software , Algorithms , Amino Acid Sequence , Proteins , Sequence Analysis, Protein
16.
PLoS Comput Biol ; 14(11): e1006547, 2018 11.
Article in English | MEDLINE | ID: mdl-30383764

ABSTRACT

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.


Subject(s)
Amino Acid Motifs , DNA/chemistry , Proteins/chemistry , Sequence Alignment/standards , Algorithms , Amino Acid Sequence , Conserved Sequence , HIV-1/chemistry , Sequence Homology, Amino Acid , env Gene Products, Human Immunodeficiency Virus/chemistry
17.
Antiviral Res ; 158: 213-225, 2018 10.
Article in English | MEDLINE | ID: mdl-30121196

ABSTRACT

BACKGROUND: We aimed to identify HBc amino acid differences between subgroups of chronic hepatitis B (CHB) patients. METHODS: Deep sequencing of HBc was performed in samples of 89 CHB patients (42 HBeAg positive, 47 HBeAg negative). Amino acid types were compared using Sequence Harmony to identify subgroup specific sites between HBeAg-positive and -negative patients, and between patients with combined response and non-response to peginterferon/adefovir combination therapy. RESULTS: We identified 54 positions in HBc where the frequency of appearing amino acids was significantly different between HBeAg-positive and -negative patients. In HBeAg negative patients, 22 positions in HBc were identified which differed between patients with treatment response and those with non-response. The fraction non-consensus sequence on selected positions was significantly higher in HBeAg-negative patients, and was negatively correlated with HBV DNA and HBsAg levels. CONCLUSIONS: Sequence Harmony identified a number of amino acid changes associated with HBeAg-status and response to peginterferon/adefovir combination therapy.


Subject(s)
Hepatitis B virus/genetics , Hepatitis B, Chronic/virology , High-Throughput Nucleotide Sequencing/methods , Viral Core Proteins/genetics , Adenine/analogs & derivatives , Adenine/therapeutic use , Adult , Antiviral Agents/therapeutic use , DNA, Viral , Drug Therapy, Combination , Female , Hepatitis B Surface Antigens , Hepatitis B e Antigens , Hepatitis B, Chronic/drug therapy , Humans , Interferon-alpha/therapeutic use , Linear Models , Male , Middle Aged , Models, Molecular , Organophosphonates/therapeutic use , Polyethylene Glycols/therapeutic use , Protein Conformation , Recombinant Proteins/therapeutic use , Sequence Alignment , Sequence Analysis, Protein , Sequence Homology , Viral Core Proteins/chemistry
18.
Bioinformatics ; 34(13): i4-i12, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29950011

ABSTRACT

Motivation: Our society has become data-rich to the extent that research in many areas has become impossible without computational approaches. Educational programmes seem to be lagging behind this development. At the same time, there is a growing need not only for strong data science skills, but foremost for the ability to both translate between tools and methods on the one hand, and application and problems on the other. Results: Here we present our experiences with shaping and running a masters' programme in bioinformatics and systems biology in Amsterdam. From this, we have developed a comprehensive philosophy on how translation in training may be achieved in a dynamic and multidisciplinary research area, which is described here. We furthermore describe two requirements that enable translation, which we have found to be crucial: sufficient depth and focus on multidisciplinary topic areas, coupled with a balanced breadth from adjacent disciplines. Finally, we present concrete suggestions on how this may be implemented in practice, which may be relevant for the effectiveness of life science and data science curricula in general, and of particular interest to those who are in the process of setting up such curricula. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology/education , Curriculum , Data Science/education , Humans
19.
Sci Rep ; 8(1): 7522, 2018 05 14.
Article in English | MEDLINE | ID: mdl-29760449

ABSTRACT

Hyperactivation of Wnt and Ras-MAPK signalling are common events in development of colorectal adenomas. Further progression from adenoma-to-carcinoma is frequently associated with 20q gain and overexpression of Aurora kinase A (AURKA). Interestingly, AURKA has been shown to further enhance Wnt and Ras-MAPK signalling. However, the molecular details of these interactions in driving colorectal carcinogenesis remain poorly understood. Here we first performed differential expression analysis (DEA) of AURKA knockdown in two colorectal cancer (CRC) cell lines with 20q gain and AURKA overexpression. Next, using an exact algorithm, Heinz, we computed the largest connected protein-protein interaction (PPI) network module of significantly deregulated genes in the two CRC cell lines. The DEA and the Heinz analyses suggest 20 Wnt and Ras-MAPK signalling genes being deregulated by AURKA, whereof ß-catenin and KRAS occurred in both cell lines. Finally, shortest path analysis over the PPI network revealed eight 'connecting genes' between AURKA and these Wnt and Ras-MAPK signalling genes, of which UBE2D1, DICER1, CDK6 and RACGAP1 occurred in both cell lines. This study, first, confirms that AURKA influences deregulation of Wnt and Ras-MAPK signalling genes, and second, suggests mechanisms in CRC cell lines describing these interactions.


Subject(s)
Aurora Kinase A/genetics , Aurora Kinase A/metabolism , Colorectal Neoplasms/metabolism , Gene Expression Profiling/methods , Gene Regulatory Networks , Algorithms , Caco-2 Cells , Cell Line, Tumor , Chromosomes, Human, Pair 20/genetics , Colorectal Neoplasms/genetics , Gene Expression Regulation, Neoplastic , Gene Knockdown Techniques , Humans , MAP Kinase Signaling System , Protein Interaction Maps , Wnt Signaling Pathway , ras Proteins/metabolism
20.
Bioinformatics ; 33(10): 1479-1487, 2017 May 15.
Article in English | MEDLINE | ID: mdl-28073761

ABSTRACT

MOTIVATION: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains the prediction of the amino acids in a given protein sequence that are involved in protein-protein interactions. Such predictions are typically based on machine learning methods that take advantage of the properties and sequence positions of amino acids that are known to be involved in interaction. In this paper, we evaluate the importance of various features using Random Forest (RF), and include as a novel feature backbone flexibility predicted from sequences to further optimise protein interface prediction. RESULTS: We observe that there is no single sequence feature that enables pinpointing interacting sites in our Random Forest models. However, combining different properties does increase the performance of interface prediction. Our homomeric-trained RF interface predictor is able to distinguish interface from non-interface residues with an area under the ROC curve of 0.72 in a homomeric test-set. The heteromeric-trained RF interface predictor performs better than existing predictors on a independent heteromeric test-set. We trained a more general predictor on the combined homomeric and heteromeric dataset, and show that in addition to predicting homomeric interfaces, it is also able to pinpoint interface residues in heterodimers. This suggests that our random forest model and the features included capture common properties of both homodimer and heterodimer interfaces. AVAILABILITY AND IMPLEMENTATION: The predictors and test datasets used in our analyses are freely available ( http://www.ibi.vu.nl/downloads/RF_PPI/ ). CONTACT: k.a.feenstra@vu.nl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Models, Statistical , Protein Interaction Domains and Motifs , Protein Interaction Mapping/methods , Protein Multimerization , Computational Biology/methods , ROC Curve , Sequence Analysis, Protein/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...