Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 74
Filter
2.
J Proteome Res ; 23(1): 418-429, 2024 01 05.
Article in English | MEDLINE | ID: mdl-38038272

ABSTRACT

The inherent diversity of approaches in proteomics research has led to a wide range of software solutions for data analysis. These software solutions encompass multiple tools, each employing different algorithms for various tasks such as peptide-spectrum matching, protein inference, quantification, statistical analysis, and visualization. To enable an unbiased comparison of commonly used bottom-up label-free proteomics workflows, we introduce WOMBAT-P, a versatile platform designed for automated benchmarking and comparison. WOMBAT-P simplifies the processing of public data by utilizing the sample and data relationship format for proteomics (SDRF-Proteomics) as input. This feature streamlines the analysis of annotated local or public ProteomeXchange data sets, promoting efficient comparisons among diverse outputs. Through an evaluation using experimental ground truth data and a realistic biological data set, we uncover significant disparities and a limited overlap in the quantified proteins. WOMBAT-P not only enables rapid execution and seamless comparison of workflows but also provides valuable insights into the capabilities of different software solutions. These benchmarking metrics are a valuable resource for researchers in selecting the most suitable workflow for their specific data sets. The modular architecture of WOMBAT-P promotes extensibility and customization. The software is available at https://github.com/wombat-p/WOMBAT-Pipelines.


Subject(s)
Benchmarking , Proteomics , Workflow , Software , Proteins , Data Analysis
3.
J Proteome Res ; 22(3): 681-696, 2023 03 03.
Article in English | MEDLINE | ID: mdl-36744821

ABSTRACT

In recent years machine learning has made extensive progress in modeling many aspects of mass spectrometry data. We brought together proteomics data generators, repository managers, and machine learning experts in a workshop with the goals to evaluate and explore machine learning applications for realistic modeling of data from multidimensional mass spectrometry-based proteomics analysis of any sample or organism. Following this sample-to-data roadmap helped identify knowledge gaps and define needs. Being able to generate bespoke and realistic synthetic data has legitimate and important uses in system suitability, method development, and algorithm benchmarking, while also posing critical ethical questions. The interdisciplinary nature of the workshop informed discussions of what is currently possible and future opportunities and challenges. In the following perspective we summarize these discussions in the hope of conveying our excitement about the potential of machine learning in proteomics and to inspire future research.


Subject(s)
Machine Learning , Proteomics , Proteomics/methods , Algorithms , Mass Spectrometry
5.
Methods Mol Biol ; 2499: 261-273, 2022.
Article in English | MEDLINE | ID: mdl-35696085

ABSTRACT

Post-translational modifications (PTMs) of proteins play crucial roles in defining protein function. They often do not occur alone, leading to a large variety of proteoforms that correspond to different combinations of multiple PTMs simultaneously decorating a protein. Changes of these proteoforms can be quantified via middle-down and top-down mass spectrometry experiments where the simultaneous PTM settings are obtained by measuring long peptides or entire proteins. Data from such experiments poses big challenges in identifying relevant features of biological and clinical importance. Generally, multiple data layers need to be considered such as proteoforms, individual PTMs, and PTM types. Therein, visualization methods are a crucial part of data analysis as they provide, if applied correctly, insights into both general behaviors as well as a deep view into fine-grained behavior. Here, we present a workflow to visualize histone proteins and their myriad of PTMs based on different R visualization modules applied to data from quantitative middle-down experiments. The procedure can be adapted to diverse experimental designs and is applicable to different proteins and PTMs.


Subject(s)
Histones , Protein Processing, Post-Translational , Cell Physiological Phenomena , Histones/metabolism , Mass Spectrometry/methods , Peptides/metabolism
6.
Bioinformatics ; 38(10): 2757-2764, 2022 05 13.
Article in English | MEDLINE | ID: mdl-35561162

ABSTRACT

MOTIVATION: In quantitative bottom-up mass spectrometry (MS)-based proteomics, the reliable estimation of protein concentration changes from peptide quantifications between different biological samples is essential. This estimation is not a single task but comprises the two processes of protein inference and protein abundance summarization. Furthermore, due to the high complexity of proteomics data and associated uncertainty about the performance of these processes, there is a demand for comprehensive visualization methods able to integrate protein with peptide quantitative data including their post-translational modifications. Hence, there is a lack of a suitable tool that provides post-identification quantitative analysis of proteins with simultaneous interactive visualization. RESULTS: In this article, we present VIQoR, a user-friendly web service that accepts peptide quantitative data of both labeled and label-free experiments and accomplishes the crucial components protein inference and summarization and interactive visualization modules, including the novel VIQoR plot. We implemented two different parsimonious algorithms to solve the protein inference problem, while protein summarization is facilitated by a well-established factor analysis algorithm called fast-FARMS followed by a weighted average summarization function that minimizes the effect of missing values. In addition, summarization is optimized by the so-called Global Correlation Indicator (GCI). We test the tool on three publicly available ground truth datasets and demonstrate the ability of the protein inference algorithms to handle shared peptides. We furthermore show that GCI increases the accuracy of the quantitative analysis in datasets with replicated design. AVAILABILITY AND IMPLEMENTATION: VIQoR is accessible at: http://computproteomics.bmb.sdu.dk/Apps/VIQoR/. The source code is available at: https://bitbucket.org/veitveit/viqor/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteins , Proteomics , Algorithms , Peptides/chemistry , Proteins/chemistry , Proteomics/methods , Software
7.
Elife ; 112022 03 08.
Article in English | MEDLINE | ID: mdl-35259090

ABSTRACT

Temporal molecular changes in ageing mammalian organs are of relevance to disease aetiology because many age-related diseases are linked to changes in the transcriptional and epigenetic machinery that regulate gene expression. We performed quantitative proteome analysis of chromatin-enriched protein extracts to investigate the dynamics of the chromatin proteomes of the mouse brain, heart, lung, kidney, liver, and spleen at 3, 5, 10, and 15 months of age. Each organ exhibited a distinct chromatin proteome and sets of unique proteins. The brain and spleen chromatin proteomes were the most extensive, diverse, and heterogenous among the six organs. The spleen chromatin proteome appeared static during the lifespan, presenting a young phenotype that reflects the permanent alertness state and important role of this organ in physiological defence and immunity. We identified a total of 5928 proteins, including 2472 nuclear or chromatin-associated proteins across the six mouse organs. Up to 3125 proteins were quantified in each organ, demonstrating distinct and organ-specific temporal protein expression timelines and regulation at the post-translational level. Bioinformatics meta-analysis of these chromatin proteomes revealed distinct physiological and ageing-related features for each organ. Our results demonstrate the efficiency of organelle-specific proteomics for in vivo studies of a model organism and consolidate the hypothesis that chromatin-associated proteins are involved in distinct and specific physiological functions in ageing organs.


Subject(s)
Chromatin , Proteome , Aging , Animals , Mammals/genetics , Mice , Protein Processing, Post-Translational , Proteome/metabolism , Proteomics/methods
8.
Gigascience ; 122022 Dec 28.
Article in English | MEDLINE | ID: mdl-37983748

ABSTRACT

BACKGROUND: Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs. RESULTS: We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. CONCLUSIONS: Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.


Subject(s)
Machine Learning , Tandem Mass Spectrometry , Chromatography, Liquid
9.
Bioinformatics ; 38(3): 875-877, 2022 01 12.
Article in English | MEDLINE | ID: mdl-34636883

ABSTRACT

MOTIVATION: Liquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (i) absence of balanced training data with large sample size; (ii) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (iii) lack of benchmarking of ML methods on specific LC-MS problems. RESULTS: We created the MS2AI pipeline that automates the process of gathering vast quantities of MS data for large-scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data are stored in a standardized format amenable for ML, encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides. AVAILABILITY AND IMPLEMENTATION: An open-source implementation of the software can be found at https://gitlab.com/roettgerlab/ms2ai. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Peptides , Tandem Mass Spectrometry , Chromatography, Liquid/methods , Tandem Mass Spectrometry/methods , Peptides/analysis , Software , Proteome/chemistry
10.
F1000Res ; 10: 897, 2021.
Article in English | MEDLINE | ID: mdl-34804501

ABSTRACT

Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.


Subject(s)
Biological Science Disciplines , Computational Biology , Benchmarking , Software , Workflow
11.
Nat Commun ; 12(1): 5854, 2021 10 06.
Article in English | MEDLINE | ID: mdl-34615866

ABSTRACT

The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.


Subject(s)
Data Analysis , Databases, Protein , Metadata , Proteomics , Big Data , Humans , Reproducibility of Results , Software , Transcriptome
14.
Methods Mol Biol ; 2361: 61-73, 2021.
Article in English | MEDLINE | ID: mdl-34236655

ABSTRACT

Isobaric labeling has become an essential method for quantitative mass spectrometry based experiments. This technique allows high-throughput proteomics while providing reasonable coverage of protein measurements across multiple samples. Here, the analysis of isobarically labeled mass spectrometry data with a special focus on quality control and potential pitfalls is discussed. The protocol is based on our fully integrated IsoProt workflow. The concepts discussed are nevertheless applicable to the analysis of any isobarically labeled experiment using alternative computational tools and algorithms.


Subject(s)
Proteomics , Algorithms , Proteome , Tandem Mass Spectrometry , Workflow
15.
Methods Mol Biol ; 2228: 433-451, 2021.
Article in English | MEDLINE | ID: mdl-33950508

ABSTRACT

Data clustering facilitates the identification of biologically relevant molecular features in quantitative proteomics experiments with thousands of measurements over multiple conditions. It finds groups of proteins or peptides with similar quantitative behavior across multiple experimental conditions. This co-regulatory behavior suggests that the proteins of such a group share their functional behavior and thus often can be mapped to the same biological processes and molecular subnetworks.While usual clustering approaches dismiss the variance of the measured proteins, VSClust combines statistical testing with pattern recognition into a common algorithm. Here, we show how to use the VSClust web service on a large proteomics data set and present further tools to assess the quantitative behavior of protein complexes.


Subject(s)
Breast Neoplasms/metabolism , Neoplasm Proteins/analysis , Proteome , Proteomics , Cluster Analysis , Data Interpretation, Statistical , Databases, Protein , Female , Humans , Multiprotein Complexes , Protein Binding , Proteomics/statistics & numerical data , Research Design , Software
16.
J Med Internet Res ; 23(6): e28253, 2021 06 02.
Article in English | MEDLINE | ID: mdl-33900934

ABSTRACT

BACKGROUND: Before the advent of an effective vaccine, nonpharmaceutical interventions, such as mask-wearing, social distancing, and lockdowns, have been the primary measures to combat the COVID-19 pandemic. Such measures are highly effective when there is high population-wide adherence, which requires information on current risks posed by the pandemic alongside a clear exposition of the rules and guidelines in place. OBJECTIVE: Here we analyzed online news media coverage of COVID-19. We quantified the total volume of COVID-19 articles, their sentiment polarization, and leading subtopics to act as a reference to inform future communication strategies. METHODS: We collected 26 million news articles from the front pages of 172 major online news sources in 11 countries (available online at SciRide). Using topic detection, we identified COVID-19-related content to quantify the proportion of total coverage the pandemic received in 2020. The sentiment analysis tool Vader was employed to stratify the emotional polarity of COVID-19 reporting. Further topic detection and sentiment analysis was performed on COVID-19 coverage to reveal the leading themes in pandemic reporting and their respective emotional polarizations. RESULTS: We found that COVID-19 coverage accounted for approximately 25.3% of all front-page online news articles between January and October 2020. Sentiment analysis of English-language sources revealed that overall COVID-19 coverage was not exclusively negatively polarized, suggesting wide heterogeneous reporting of the pandemic. Within this heterogenous coverage, 16% of COVID-19 news articles (or 4% of all English-language articles) can be classified as highly negatively polarized, citing issues such as death, fear, or crisis. CONCLUSIONS: The goal of COVID-19 public health communication is to increase understanding of distancing rules and to maximize the impact of governmental policy. The extent to which the quantity and quality of information from different communication channels (eg, social media, government pages, and news) influence public understanding of public health measures remains to be established. Here we conclude that a quarter of all reporting in 2020 covered COVID-19, which is indicative of information overload. In this capacity, our data and analysis form a quantitative basis for informing health communication strategies along traditional news media channels to minimize the risks of COVID-19 while vaccination is rolled out.


Subject(s)
COVID-19/epidemiology , Data Mining/methods , Mass Media/statistics & numerical data , Public Health/methods , Social Media/statistics & numerical data , Health Resources , Humans , Pandemics , SARS-CoV-2/isolation & purification
17.
Rapid Commun Mass Spectrom ; : e9087, 2021 Apr 16.
Article in English | MEDLINE | ID: mdl-33861485

ABSTRACT

The European Bioinformatics Community for Mass Spectrometry (EuBIC-MS; eubic-ms.org) was founded in 2014 to unite European computational mass spectrometry researchers and proteomics bioinformaticians working in academia and industry. EuBIC-MS maintains educational resources (proteomics-academy.org) and organises workshops at national and international conferences on proteomics and mass spectrometry. Furthermore, EuBIC-MS is actively involved in several community initiatives such as the Human Proteome Organization's Proteomics Standards Initiative (HUPO-PSI). Apart from these collaborations, EuBIC-MS has organised two Winter Schools and two Developers' Meetings that have contributed to the strengthening of the European mass spectrometry network and fostered international collaboration in this field, even beyond Europe. Moreover, EuBIC-MS is currently actively developing a community-driven standard dedicated to mass spectrometry data annotation (SDRF-Proteomics) that will facilitate data reuse and collaboration. This manuscript highlights what EuBIC-MS is, what it does, and what it already has achieved. A warm invitation is extended to new researchers at all career stages to join the EuBIC-MS community on its Slack channel (eubic.slack.com).

18.
J Proteome Res ; 20(4): 1821-1825, 2021 04 02.
Article in English | MEDLINE | ID: mdl-33720718

ABSTRACT

The large diversity of experimental methods in proteomics as well as their increasing usage across biological and clinical research has led to the development of hundreds if not thousands of software tools to aid in the analysis and interpretation of the resulting data. Detailed information about these tools needs to be collected, categorized, and validated to guarantee their optimal utilization. A tools registry like bio.tools enables users and developers to identify new tools with more powerful algorithms or to find tools with similar functions for comparison. Here we present the content of the registry, which now comprises more than 1000 proteomics tool entries. Furthermore, we discuss future applications and engagement with other community efforts resulting in a high impact on the bioinformatics landscape.


Subject(s)
Proteomics , Software , Algorithms , Computational Biology
19.
J Proteome Res ; 20(4): 2157-2165, 2021 04 02.
Article in English | MEDLINE | ID: mdl-33720735

ABSTRACT

The bio.tools registry is a main catalogue of computational tools in the life sciences. More than 17 000 tools have been registered by the international bioinformatics community. The bio.tools metadata schema includes semantic annotations of tool functions, that is, formal descriptions of tools' data types, formats, and operations with terms from the EDAM bioinformatics ontology. Such annotations enable the automated composition of tools into multistep pipelines or workflows. In this Technical Note, we revisit a previous case study on the automated composition of proteomics workflows. We use the same four workflow scenarios but instead of using a small set of tools with carefully handcrafted annotations, we explore workflows directly on bio.tools. We use the Automated Pipeline Explorer (APE), a reimplementation and extension of the workflow composition method previously used. Moving "into the wild" opens up an unprecedented wealth of tools and a huge number of alternative workflows. Automated composition tools can be used to explore this space of possibilities systematically. Inevitably, the mixed quality of semantic annotations in bio.tools leads to unintended or erroneous tool combinations. However, our results also show that additional control mechanisms (tool filters, configuration options, and workflow constraints) can effectively guide the exploration toward smaller sets of more meaningful workflows.


Subject(s)
Proteomics , Software , Computational Biology , Registries , Workflow
20.
Gigascience ; 10(1)2021 01 27.
Article in English | MEDLINE | ID: mdl-33506265

ABSTRACT

BACKGROUND: Life scientists routinely face massive and heterogeneous data analysis tasks and must find and access the most suitable databases or software in a jungle of web-accessible resources. The diversity of information used to describe life-scientific digital resources presents an obstacle to their utilization. Although several standardization efforts are emerging, no information schema has been sufficiently detailed to enable uniform semantic and syntactic description-and cataloguing-of bioinformatics resources. FINDINGS: Here we describe biotoolsSchema, a formalized information model that balances the needs of conciseness for rapid adoption against the provision of rich technical information and scientific context. biotoolsSchema results from a series of community-driven workshops and is deployed in the bio.tools registry, providing the scientific community with >17,000 machine-readable and human-understandable descriptions of software and other digital life-science resources. We compare our approach to related initiatives and provide alignments to foster interoperability and reusability. CONCLUSIONS: biotoolsSchema supports the formalized, rigorous, and consistent specification of the syntax and semantics of bioinformatics resources, and enables cataloguing efforts such as bio.tools that help scientists to find, comprehend, and compare resources. The use of biotoolsSchema in bio.tools promotes the FAIRness of research software, a key element of open and reproducible developments for data-intensive sciences.


Subject(s)
Biological Science Disciplines , Computational Biology , Databases, Factual , Humans , Semantics , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...