Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Chem Res Toxicol ; 36(8): 1238-1247, 2023 08 21.
Article in English | MEDLINE | ID: mdl-37556769

ABSTRACT

Drug-induced liver injury (DILI) is an important safety concern and a major reason to remove a drug from the market. Advancements in recent machine learning methods have led to a wide range of in silico models for DILI predictive methods based on molecule chemical structures (fingerprints). Existing publicly available DILI data sets used for model building are based on the interpretation of drug labels or patient case reports, resulting in a typical binary clinical DILI annotation. We developed a novel phenotype-based annotation to process hepatotoxicity information extracted from repeated dose in vivo preclinical toxicology studies using INHAND annotation to provide a more informative and reliable data set for machine learning algorithms. This work resulted in a data set of 430 unique compounds covering diverse liver pathology findings which were utilized to develop multiple DILI prediction models trained on the publicly available data (TG-GATEs) using the compound's fingerprint. We demonstrate that the TG-GATEs compounds DILI labels can be predicted well and how the differences between TG-GATEs and the external test compounds (Johnson & Johnson) impact the model generalization performance.


Subject(s)
Chemical and Drug Induced Liver Injury , Drug-Related Side Effects and Adverse Reactions , Humans , Algorithms , Machine Learning , Computer Simulation
2.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36477794

ABSTRACT

MOTIVATION: T cells use T cell receptors (TCRs) to recognize small parts of antigens, called epitopes, presented by major histocompatibility complexes. Once an epitope is recognized, an immune response is initiated and T cell activation and proliferation by clonal expansion begin. Clonal populations of T cells with identical TCRs can remain in the body for years, thus forming immunological memory and potentially mappable immunological signatures, which could have implications in clinical applications including infectious diseases, autoimmunity and tumor immunology. RESULTS: We introduce TCRconv, a deep learning model for predicting recognition between TCRs and epitopes. TCRconv uses a deep protein language model and convolutions to extract contextualized motifs and provides state-of-the-art TCR-epitope prediction accuracy. Using TCR repertoires from COVID-19 patients, we demonstrate that TCRconv can provide insight into T cell dynamics and phenotypes during the disease. AVAILABILITY AND IMPLEMENTATION: TCRconv is available at https://github.com/emmijokinen/tcrconv. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
COVID-19 , Humans , Epitopes , Receptors, Antigen, T-Cell , T-Lymphocytes , Antigens , Epitopes, T-Lymphocyte
3.
J Cheminform ; 14(1): 86, 2022 Dec 28.
Article in English | MEDLINE | ID: mdl-36578043

ABSTRACT

A de novo molecular design workflow can be used together with technologies such as reinforcement learning to navigate the chemical space. A bottleneck in the workflow that remains to be solved is how to integrate human feedback in the exploration of the chemical space to optimize molecules. A human drug designer still needs to design the goal, expressed as a scoring function for the molecules that captures the designer's implicit knowledge about the optimization task. Little support for this task exists and, consequently, a chemist usually resorts to iteratively building the objective function of multi-parameter optimization (MPO) in de novo design. We propose a principled approach to use human-in-the-loop machine learning to help the chemist to adapt the MPO scoring function to better match their goal. An advantage is that the method can learn the scoring function directly from the user's feedback while they browse the output of the molecule generator, instead of the current manual tuning of the scoring function with trial and error. The proposed method uses a probabilistic model that captures the user's idea and uncertainty about the scoring function, and it uses active learning to interact with the user. We present two case studies for this: In the first use-case, the parameters of an MPO are learned, and in the second use-case a non-parametric component of the scoring function to capture human domain knowledge is developed. The results show the effectiveness of the methods in two simulated example cases with an oracle, achieving significant improvement in less than 200 feedback queries, for the goals of a high QED score and identifying potent molecules for the DRD2 receptor, respectively. We further demonstrate the performance gains with a medicinal chemist interacting with the system.

4.
BMC Bioinformatics ; 23(1): 212, 2022 Jun 03.
Article in English | MEDLINE | ID: mdl-35659235

ABSTRACT

BACKGROUND: Transcription factors (TFs) bind regulatory DNA regions with sequence specificity, form complexes and regulate gene expression. In cooperative TF-TF binding, two transcription factors bind onto a shared DNA binding site as a pair. Previous work has demonstrated pairwise TF-TF-DNA interactions with position weight matrices (PWMs), which may however not sufficiently take into account the complexity and flexibility of pairwise binding. RESULTS: We propose two random forest (RF) methods for joint TF-TF binding site prediction: ComBind and JointRF. We train models with previously published large-scale CAP-SELEX DNA libraries, which comprise DNA sequences enriched for binding of a selected TF pair. JointRF builds a random forest with sub-sequences selected from CAP-SELEX DNA reads with previously proposed pairwise PWM. JointRF outperforms (area under receiver operating characteristics curve, AUROC, 0.75) the current state-of-the-art method i.e. orientation and spacing specific pairwise PWMs (AUROC 0.59). Thus, JointRF may be utilized to improve prediction accuracy for pre-determined binding preferences. However, pairwise TF binding is currently considered flexible; a pair may bind DNA with different orientations and amounts of dinucleotide gaps or overlap between the two motifs. Thus, we developed ComBind, which utilizes random forests by considering simultaneously multiple orientations and spacings of the two factors. Our approach outperforms (AUROC 0.78) PWMs, as well as JointRF (p<0.00195). ComBind provides an approach for predicting TF-TF binding sites without prior knowledge on pairwise binding preferences. However, more research is needed to assess ComBind eligibility for practical applications. CONCLUSIONS: Random forest is well suited for modeling pairwise TF-TF-DNA binding specificities, and ComBind provides an improvement to pairwise binding site prediction accuracy.


Subject(s)
DNA , Transcription Factors , Binding Sites/genetics , DNA/genetics , Position-Specific Scoring Matrices , Protein Binding , Transcription Factors/metabolism
5.
Vox Sang ; 117(4): 504-512, 2022 Apr.
Article in English | MEDLINE | ID: mdl-34825380

ABSTRACT

BACKGROUND AND OBJECTIVES: Deferral of blood donors due to low haemoglobin (Hb) is demotivating to donors, can be a sign for developing anaemia and incurs costs for blood establishments. The prediction of Hb deferral has been shown to be feasible in a number of studies based on demographic, Hb measurement and donation history data. The aim of this paper is to evaluate how state-of-the-art computational prediction tools can facilitate nationwide personalized donation intervals. MATERIALS AND METHODS: Using donation history data from the last 20 years in Finland, FinDonor blood donor cohort data and blood service Biobank genotyping data, we built linear and non-linear predictors of Hb deferral. Based on financial data from the Finnish Red Cross Blood Service, we then estimated the economic impacts of deploying such predictors. RESULTS: We discovered that while linear predictors generally predict Hb relatively well, they have difficulties in predicting low Hb values. Overall, we found that non-linear or linear predictors with or without genetic data performed only slightly better than a simple cutoff based on previous Hb. However, if any of our deferral prediction methods are used to assign temporary prolongations of donation intervals for females, then our calculations indicate cost savings while maintaining the blood supply. CONCLUSION: We find that even though the prediction accuracy is not very high, the actual use of any of our predictors in blood collection is still likely to bring benefits to blood donors and blood establishments alike.


Subject(s)
Anemia , Hematologic Diseases , Blood Donors , Female , Hematologic Tests , Hemoglobins/analysis , Hemoglobins/genetics , Humans
6.
PLoS Comput Biol ; 17(3): e1008814, 2021 03.
Article in English | MEDLINE | ID: mdl-33764977

ABSTRACT

Adaptive immune system uses T cell receptors (TCRs) to recognize pathogens and to consequently initiate immune responses. TCRs can be sequenced from individuals and methods analyzing the specificity of the TCRs can help us better understand individuals' immune status in different disorders. For this task, we have developed TCRGP, a novel Gaussian process method that predicts if TCRs recognize specified epitopes. TCRGP can utilize the amino acid sequences of the complementarity determining regions (CDRs) from TCRα and TCRß chains and learn which CDRs are important in recognizing different epitopes. Our comprehensive evaluation with epitope-specific TCR sequencing data shows that TCRGP achieves on average higher prediction accuracy in terms of AUROC score than existing state-of-the-art methods in epitope-specificity predictions. We also propose a novel analysis approach for combined single-cell RNA and TCRαß (scRNA+TCRαß) sequencing data by quantifying epitope-specific TCRs with TCRGP and identify HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients.


Subject(s)
Computational Biology/methods , Epitopes, T-Lymphocyte , Receptors, Antigen, T-Cell , Sequence Analysis, Protein/methods , Amino Acid Sequence , Complementarity Determining Regions , Epitopes, T-Lymphocyte/chemistry , Epitopes, T-Lymphocyte/genetics , Epitopes, T-Lymphocyte/metabolism , Humans , Normal Distribution , Receptors, Antigen, T-Cell/chemistry , Receptors, Antigen, T-Cell/genetics , Receptors, Antigen, T-Cell/metabolism
7.
Appl Microbiol Biotechnol ; 104(24): 10515-10529, 2020 Dec.
Article in English | MEDLINE | ID: mdl-33147349

ABSTRACT

In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. KEY POINTS: • DERA aldolases are promiscuous enzymes. • Synthetic utility of DERA aldolase was improved by protein engineering approaches. • Machine learning methods aid the protein engineering of DERA.


Subject(s)
Escherichia coli , Fructose-Bisphosphate Aldolase , Aldehyde-Lyases/genetics , Aldehyde-Lyases/metabolism , Escherichia coli/genetics , Escherichia coli/metabolism , Fructose-Bisphosphate Aldolase/genetics , Machine Learning , Protein Engineering , Substrate Specificity
8.
Bioinformatics ; 35(14): i548-i557, 2019 07 15.
Article in English | MEDLINE | ID: mdl-31510676

ABSTRACT

MOTIVATION: Metabolic flux balance analysis (FBA) is a standard tool in analyzing metabolic reaction rates compatible with measurements, steady-state and the metabolic reaction network stoichiometry. Flux analysis methods commonly place model assumptions on fluxes due to the convenience of formulating the problem as a linear programing model, while many methods do not consider the inherent uncertainty in flux estimates. RESULTS: We introduce a novel paradigm of Bayesian metabolic flux analysis that models the reactions of the whole genome-scale cellular system in probabilistic terms, and can infer the full flux vector distribution of genome-scale metabolic systems based on exchange and intracellular (e.g. 13C) flux measurements, steady-state assumptions, and objective function assumptions. The Bayesian model couples all fluxes jointly together in a simple truncated multivariate posterior distribution, which reveals informative flux couplings. Our model is a plug-in replacement to conventional metabolic balance methods, such as FBA. Our experiments indicate that we can characterize the genome-scale flux covariances, reveal flux couplings, and determine more intracellular unobserved fluxes in Clostridium acetobutylicum from 13C data than flux variability analysis. AVAILABILITY AND IMPLEMENTATION: The COBRA compatible software is available at github.com/markusheinonen/bamfa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Clostridium acetobutylicum , Metabolic Flux Analysis , Bayes Theorem , Metabolic Networks and Pathways , Models, Biological
9.
PLoS One ; 13(10): e0204960, 2018.
Article in English | MEDLINE | ID: mdl-30281653

ABSTRACT

The vascular endothelium is considered as a key cell compartment for the response to ionizing radiation of normal tissues and tumors, and as a promising target to improve the differential effect of radiotherapy in the future. Following radiation exposure, the global endothelial cell response covers a wide range of gene, miRNA, protein and metabolite expression modifications. Changes occur at the transcriptional, translational and post-translational levels and impact cell phenotype as well as the microenvironment by the production and secretion of soluble factors such as reactive oxygen species, chemokines, cytokines and growth factors. These radiation-induced dynamic modifications of molecular networks may control the endothelial cell phenotype and govern recruitment of immune cells, stressing the importance of clearly understanding the mechanisms which underlie these temporal processes. A wide variety of time series data is commonly used in bioinformatics studies, including gene expression, protein concentrations and metabolomics data. The use of clustering of these data is still an unclear problem. Here, we introduce kernels between Gaussian processes modeling time series, and subsequently introduce a spectral clustering algorithm. We apply the methods to the study of human primary endothelial cells (HUVECs) exposed to a radiotherapy dose fraction (2 Gy). Time windows of differential expressions of 301 genes involved in key cellular processes such as angiogenesis, inflammation, apoptosis, immune response and protein kinase were determined from 12 hours to 3 weeks post-irradiation. Then, 43 temporal clusters corresponding to profiles of similar expressions, including 49 genes out of 301 initially measured, were generated according to the proposed method. Forty-seven transcription factors (TFs) responsible for the expression of clusters of genes were predicted from sequence regulatory elements using the MotifMap system. Their temporal profiles of occurrences were established and clustered. Dynamic network interactions and molecular pathways of TFs and differential genes were finally explored, revealing key node genes and putative important cellular processes involved in tissue infiltration by immune cells following exposure to a radiotherapy dose fraction.


Subject(s)
Dose Fractionation, Radiation , Endothelial Cells/metabolism , Endothelial Cells/radiation effects , Transcriptome/radiation effects , Cluster Analysis , Humans , Multigene Family , Normal Distribution , Phenotype , Time Factors , Transcription Factors/metabolism
10.
Bioinformatics ; 34(13): i509-i518, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29949975

ABSTRACT

Motivation: Many inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g. drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs. Results: We introduce pairwiseMKL, the first method for time- and memory-efficient learning with multiple pairwise kernels. pairwiseMKL first determines the mixture weights of the input pairwise kernels, and then learns the pairwise prediction function. Both steps are performed efficiently without explicit computation of the massive pairwise matrices, therefore making the method applicable to solving large pairwise learning problems. We demonstrate the performance of pairwiseMKL in two related tasks of quantitative drug bioactivity prediction using up to 167 995 bioactivity measurements and 3120 pairwise kernels: (i) prediction of anticancer efficacy of drug compounds across a large panel of cancer cell lines; and (ii) prediction of target profiles of anticancer compounds across their kinome-wide target spaces. We show that pairwiseMKL provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem. Availability and implementation: Code is available at https://github.com/aalto-ics-kepaco. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Antineoplastic Agents/pharmacology , Computational Biology/methods , Drug Discovery/methods , Neoplasms/drug therapy , Support Vector Machine , Antineoplastic Agents/therapeutic use , Cell Line, Tumor , Humans , Neoplasms/enzymology , Neoplasms/metabolism , Protein Kinases/drug effects , Protein Kinases/metabolism , Signal Transduction , Software , Treatment Outcome
11.
Bioinformatics ; 34(13): i274-i283, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29949987

ABSTRACT

Motivation: Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins' properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data. Results: We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein's stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy. Availability and implementation: Software implementation and datasets are available at github.com/emmijokinen/mgpfusion. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology , Protein Stability , Proteins , Software , Bayes Theorem , Computational Biology/methods , Mutation/genetics , Proteins/chemistry , Proteins/genetics
12.
J Phys Chem B ; 122(21): 5389-5399, 2018 05 31.
Article in English | MEDLINE | ID: mdl-29401388

ABSTRACT

Computationally modeling changes in binding free energies upon mutation (interface ΔΔ G) allows large-scale prediction and perturbation of protein-protein interactions. Additionally, methods that consider and sample relevant conformational plasticity should be able to achieve higher prediction accuracy over methods that do not. To test this hypothesis, we developed a method within the Rosetta macromolecular modeling suite (flex ddG) that samples conformational diversity using "backrub" to generate an ensemble of models and then applies torsion minimization, side chain repacking, and averaging across this ensemble to estimate interface ΔΔ G values. We tested our method on a curated benchmark set of 1240 mutants, and found the method outperformed existing methods that sampled conformational space to a lesser degree. We observed considerable improvements with flex ddG over existing methods on the subset of small side chain to large side chain mutations, as well as for multiple simultaneous non-alanine mutations, stabilizing mutations, and mutations in antibody-antigen interfaces. Finally, we applied a generalized additive model (GAM) approach to the Rosetta energy function; the resulting nonlinear reweighting model improved the agreement with experimentally determined interface ΔΔ G values but also highlighted the necessity of future energy function improvements.


Subject(s)
Models, Molecular , Proteins/chemistry , Antigen-Antibody Complex , Entropy , Monte Carlo Method , Mutagenesis , Protein Binding , Protein Interaction Domains and Motifs , Proteins/genetics , Proteins/metabolism , Static Electricity
13.
Biotechnol Biofuels ; 9: 132, 2016.
Article in English | MEDLINE | ID: mdl-27354857

ABSTRACT

BACKGROUND: The filamentous fungus Trichoderma reesei (teleomorph Hypocrea jecorina) is a widely used industrial host organism for protein production. In industrial cultivations, it can produce over 100 g/l of extracellular protein, mostly constituting of cellulases and hemicellulases. In order to improve protein production of T. reesei the transcriptional regulation of cellulases and secretory pathway factors have been extensively studied. However, the metabolism of T. reesei under protein production conditions has not received much attention. RESULTS: To understand the physiology and metabolism of T. reesei under protein production conditions we carried out a well-controlled bioreactor experiment with extensive analysis. We used minimal media to make the data amenable for modelling and three strain pairs to cover different protein production levels. With RNA-sequencing transcriptomics we detected the concentration of the carbon source as the most important determinant of the transcriptome. As the major transcriptional response concomitant to protein production we detected the induction of selected genes that were putatively regulated by xyr1 and were related to protein transport, amino acid metabolism and transcriptional regulation. We found novel metabolic responses such as production of glycerol and a cellotriose-like compound. We then used this cultivation data for flux balance analysis of T. reesei metabolism and demonstrate for the first time the use of genome wide stoichiometric metabolic modelling for T. reesei. We show that our model can predict protein production rate and provides novel insight into the metabolism of protein production. We also provide this unprecedented cultivation and transcriptomics data set for future modelling efforts. CONCLUSIONS: The use of stoichiometric modelling can open a novel path for the improvement of protein production in T. reesei. Based on this we propose sulphur assimilation as a major limiting factor of protein production. As an organism with exceptional protein production capabilities modelling of T. reesei can provide novel insight also to other less productive organisms.

14.
Bioinformatics ; 31(5): 728-35, 2015 Mar 01.
Article in English | MEDLINE | ID: mdl-25355790

ABSTRACT

MOTIVATION: Identifying the set of genes differentially expressed along time is an important task in two-sample time course experiments. Furthermore, estimating at which time periods the differential expression is present can provide additional insight into temporal gene functions. The current differential detection methods are designed to detect difference along observation time intervals or on single measurement points, warranting dense measurements along time to characterize the full temporal differential expression patterns. RESULTS: We propose a novel Bayesian likelihood ratio test to estimate the differential expression time periods. Applying the ratio test to systems of genes provides the temporal response timings and durations of gene expression to a biological condition. We introduce a novel non-stationary Gaussian process as the underlying expression model, with major improvements on model fitness on perturbation and stress experiments. The method is robust to uneven or sparse measurements along time. We assess the performance of the method on realistically simulated dataset and compare against state-of-the-art methods. We additionally apply the method to the analysis of primary human endothelial cells under an ionizing radiation stress to study the transcriptional perturbations over 283 measured genes in an attempt to better understand the role of endothelium in both normal and cancer tissues during radiotherapy. As a result, using the cascade of differential expression periods, domain literature and gene enrichment analysis, we gain insights into the dynamic response of endothelial cells to irradiation. AVAILABILITY AND IMPLEMENTATION: R package 'nsgp' is available at www.ibisc.fr/en/logiciels_arobas.


Subject(s)
Gene Expression Profiling/methods , Gene Expression Regulation , Neoplasms/genetics , Oligonucleotide Array Sequence Analysis/methods , Radiotherapy , Bayes Theorem , Cells, Cultured , Dose-Response Relationship, Radiation , Human Umbilical Vein Endothelial Cells/metabolism , Human Umbilical Vein Endothelial Cells/radiation effects , Humans , Neoplasms/radiotherapy , Normal Distribution , Time Factors
15.
Metabolites ; 3(2): 484-505, 2013 Jun 06.
Article in English | MEDLINE | ID: mdl-24958002

ABSTRACT

Metabolite identification is a major bottleneck in metabolomics due to the number and diversity of the molecules. To alleviate this bottleneck, computational methods and tools that reliably filter the set of candidates are needed for further analysis by human experts. Recent efforts in assembling large public mass spectral databases such as MassBank have opened the door for developing a new genre of metabolite identification methods that rely on machine learning as the primary vehicle for identification. In this paper we describe the machine learning approach used in FingerID, its application to the CASMI challenges and some results that were not part of our challenge submission. In short, FingerID learns to predict molecular fingerprints from a large collection of MS/MS spectra, and uses the predicted fingerprints to retrieve and rank candidate molecules from a given large molecular database. Furthermore, we introduce a web server for FingerID, which was applied for the first time to the CASMI challenges. The challenge results show that the new machine learning framework produces competitive results on those challenge molecules that were found within the relatively restricted KEGG compound database. Additional experiments on the PubChem database confirm the feasibility of the approach even on a much larger database, although room for improvement still remains.

16.
Bioinformatics ; 28(18): 2333-41, 2012 Sep 15.
Article in English | MEDLINE | ID: mdl-22815355

ABSTRACT

MOTIVATION: Metabolite identification from tandem mass spectra is an important problem in metabolomics, underpinning subsequent metabolic modelling and network analysis. Yet, currently this task requires matching the observed spectrum against a database of reference spectra originating from similar equipment and closely matching operating parameters, a condition that is rarely satisfied in public repositories. Furthermore, the computational support for identification of molecules not present in reference databases is lacking. Recent efforts in assembling large public mass spectral databases such as MassBank have opened the door for the development of a new genre of metabolite identification methods. RESULTS: We introduce a novel framework for prediction of molecular characteristics and identification of metabolites from tandem mass spectra using machine learning with the support vector machine. Our approach is to first predict a large set of molecular properties of the unknown metabolite from salient tandem mass spectral signals, and in the second step to use the predicted properties for matching against large molecule databases, such as PubChem. We demonstrate that several molecular properties can be predicted to high accuracy and that they are useful in de novo metabolite identification, where the reference database does not contain any spectra of the same molecule. AVAILABILITY: An Matlab/Python package of the FingerID tool is freely available on the web at http://www.sourceforge.net/p/fingerid. CONTACT: markus.heinonen@cs.helsinki.fi.


Subject(s)
Artificial Intelligence , Metabolomics/methods , Databases, Chemical , Tandem Mass Spectrometry
17.
J Comput Biol ; 18(1): 43-58, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21210731

ABSTRACT

The ability to trace the fate of individual atoms through the metabolic pathways is needed in many applications of systems biology and drug discovery. However, this information is not immediately available from the most common metabolome studies and needs to be separately acquired. Automatic discovery of correspondence of atoms in biochemical reactions is called the "atom mapping problem." We suggest an efficient approach for solving the atom mapping problem exactly--finding mappings of minimum edge edit distance. The algorithm is based on A* search equipped with sophisticated heuristics for pruning the search space. This approach has clear advantages over the commonly used heuristic approach of iterative maximum common subgraph (MCS) algorithm: we explicitly minimize an objective function, and we produce solutions that typically require less manual curation. The two methods are similar in computational resource demands. We compare the performance of the proposed algorithm against several alternatives on data obtained from the KEGG LIGAND and RPAIR databases: greedy search, bi-partite graph matching, and the MCS approach. Our experiments show that alternative approaches often fail in finding mappings with minimum edit distance.


Subject(s)
Algorithms , Computer Simulation , Models, Biological , Models, Chemical , Amino Acids/chemistry , Humans , Metabolic Networks and Pathways , Systems Biology
18.
Rapid Commun Mass Spectrom ; 22(19): 3043-52, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18763276

ABSTRACT

We present FiD (Fragment iDentificator), a software tool for the structural identification of product ions produced with tandem mass spectrometric measurement of low molecular weight organic compounds. Tandem mass spectrometry (MS/MS) has proven to be an indispensable tool in modern, cell-wide metabolomics and fluxomics studies. In such studies, the structural information of the MS(n) product ions is usually needed in the downstream analysis of the measurement data. The manual identification of the structures of MS(n) product ions is, however, a nontrivial task requiring expertise, and calls for computer assistance. Commercial software tools, such as Mass Frontier and ACD/MS Fragmenter, rely on fragmentation rule databases for the identification of MS(n) product ions. FiD, on the other hand, conducts a combinatorial search over all possible fragmentation paths and outputs a ranked list of alternative structures. This gives the user an advantage in situations where the MS/MS data of compounds with less well-known fragmentation mechanisms are processed. FiD software implements two fragmentation models, the single-step model that ignores intermediate fragmentation states and the multi-step model, which allows for complex fragmentation pathways. The software works for MS/MS data produced both in positive- and negative-ion modes. The software has an easy-to-use graphical interface with built-in visualization capabilities for structures of product ions and fragmentation pathways. In our experiments involving amino acids and sugar-phosphates, often found, e.g., in the central carbon metabolism of yeasts, FiD software correctly predicted the structures of product ions on average in 85% of the cases. The FiD software is free for academic use and is available for download from www.cs.helsinki.fi/group/sysfys/software/fragid.


Subject(s)
Algorithms , Ions/chemistry , Models, Chemical , Software , Spectrometry, Mass, Electrospray Ionization/methods , Computer Simulation , Reproducibility of Results , Sensitivity and Specificity
SELECTION OF CITATIONS
SEARCH DETAIL
...