Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 40
Filter
1.
bioRxiv ; 2024 Jun 04.
Article in English | MEDLINE | ID: mdl-38895431

ABSTRACT

A pressing statistical challenge in the field of mass spectrometry proteomics is how to assess whether a given software tool provides accurate error control. Each software tool for searching such data uses its own internally implemented methodology for reporting and controlling the error. Many of these software tools are closed source, with incompletely documented methodology, and the strategies for validating the error are inconsistent across tools. In this work, we identify three different methods for validating false discovery rate (FDR) control in use in the field, one of which is invalid, one of which can only provide a lower bound rather than an upper bound, and one of which is valid but under-powered. The result is that the field has a very poor understanding of how well we are doing with respect to FDR control, particularly for the analysis of data-independent acquisition (DIA) data. We therefore propose a new, more powerful method for evaluating FDR control in this setting, and we then employ that method, along with an existing lower bounding technique, to characterize a variety of popular search tools. We find that the search tools for analysis of data-dependent acquisition (DDA) data generally seem to control the FDR at the peptide level, whereas none of the DIA search tools consistently controls the FDR at the peptide level across all the datasets we investigated. Furthermore, this problem becomes much worse when the latter tools are evaluated at the protein level. These results may have significant implications for various downstream analyses, since proper FDR control has the potential to reduce noise in discovery lists and thereby boost statistical power.

2.
J Proteome Res ; 23(6): 1894-1906, 2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38652578

ABSTRACT

Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides. We therefore developed a method, called CONGA (combining open and narrow searches with group-wise analysis), that takes into account results from both types of searches─a traditional "narrow window" search and an open modification search─while carrying out rigorous false discovery rate control. The result is an algorithm that provides the best of both worlds: the ability to detect unexpected PTMs without a concomitant loss of power to detect unmodified peptides.


Subject(s)
Algorithms , Databases, Protein , Protein Processing, Post-Translational , Proteomics , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Proteomics/methods , Peptides/analysis , Peptides/chemistry , Humans , Software , Amino Acid Sequence
3.
J Proteome Res ; 23(6): 1907-1914, 2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38687997

ABSTRACT

Traditional database search methods for the analysis of bottom-up proteomics tandem mass spectrometry (MS/MS) data are limited in their ability to detect peptides with post-translational modifications (PTMs). Recently, "open modification" database search strategies, in which the requirement that the mass of the database peptide closely matches the observed precursor mass is relaxed, have become popular as ways to find a wider variety of types of PTMs. Indeed, in one study, Kong et al. reported that the open modification search tool MSFragger can achieve higher statistical power to detect peptides than a traditional "narrow window" database search. We investigated this claim empirically and, in the process, uncovered a potential general problem with false discovery rate (FDR) control in the machine learning postprocessors Percolator and PeptideProphet. This problem might have contributed to Kong et al.'s report that their empirical results suggest that false discovery (FDR) control in the narrow window setting might generally be compromised. Indeed, reanalyzing the same data while using a more standard form of target-decoy competition-based FDR control, we found that, after accounting for chimeric spectra as well as for the inherent difference in the number of candidates in open and narrow searches, the data does not provide sufficient evidence that FDR control in proteomics MS/MS database search is inherently problematic.


Subject(s)
Databases, Protein , Protein Processing, Post-Translational , Proteomics , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Proteomics/methods , Peptides/analysis , Peptides/chemistry , Machine Learning , Humans , Algorithms , Software
4.
Proteomics ; 24(8): e2300084, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38380501

ABSTRACT

Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.


Subject(s)
Algorithms , Peptides , Databases, Protein , Peptides/chemistry , Proteins/analysis , Proteomics/methods
5.
J Proteome Res ; 22(7): 2172-2178, 2023 07 07.
Article in English | MEDLINE | ID: mdl-37261867

ABSTRACT

Controlling the false discovery rate (FDR) among discoveries from a tandem mass spectrometry proteomics experiment using target decoy competition (TDC) controls only the proportion of false discoveries in an average sense. Thus, for any particular analysis, even with a valid FDR control procedure, the proportion of false discoveries (the FDP) may be higher than the specified FDR threshold. We demonstrate this phenomenon using real data and describe two recently developed methods that help bridge the gap between controlling the expected or average rate of false discoveries and the empirical rate (FDP). The FDP Stepdown method controls the FDP at any desired confidence level, and the TDC Uniform Band provides a confidence, or upper prediction bound, on the FDP in TDC's list of discoveries.


Subject(s)
Algorithms , Proteomics , Databases, Protein , Proteomics/methods , Tandem Mass Spectrometry
6.
Biometrics ; 79(4): 3472-3484, 2023 12.
Article in English | MEDLINE | ID: mdl-36652258

ABSTRACT

Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level α, the FDP in the list of discoveries can significantly exceed α. We offer FDP-SD, a new procedure that rigorously controls the FDP in the knockoff/TDC competition setup by guaranteeing that the FDP is bounded by α at a desired confidence level. Compared with the recently published framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated and real data.


Subject(s)
Algorithms , Mass Spectrometry , False Positive Reactions
7.
Methods Mol Biol ; 2426: 25-34, 2023.
Article in English | MEDLINE | ID: mdl-36308683

ABSTRACT

Target-decoy competition has been commonly used for over a decade to control the false discovery rate when analyzing tandem mass spectrometry (MS/MS) data. We recently developed a framework that uses multiple decoys to increase the number of detected peptides in MS/MS data. Here, we present a pipeline of Apache licensed, open-source software that allows the user to readily take advantage of our framework.


Subject(s)
Proteomics , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Proteomics/methods , Peptides/chemistry , Software , Databases, Protein , Algorithms
8.
J Proteome Res ; 21(10): 2412-2420, 2022 Oct 07.
Article in English | MEDLINE | ID: mdl-36166314

ABSTRACT

The analysis of shotgun proteomics data often involves generating lists of inferred peptide-spectrum matches (PSMs) and/or of peptides. The canonical approach for generating these discovery lists is by controlling the false discovery rate (FDR), most commonly through target-decoy competition (TDC). At the PSM level, TDC is implemented by competing each spectrum's best-scoring target (real) peptide match with its best match against a decoy database. This PSM-level procedure can be adapted to the peptide level by selecting the top-scoring PSM per peptide prior to FDR estimation. Here, we first highlight and empirically augment a little known previous work by He et al., which showed that TDC-based PSM-level FDR estimates can be liberally biased. We thus propose that researchers instead focus on peptide-level analysis. We then investigate three ways to carry out peptide-level TDC and show that the most common method ("PSM-only") offers the lowest statistical power in practice. An alternative approach that carries out a double competition, first at the PSM and then at the peptide level ("PSM-and-peptide"), is the most powerful method, yielding an average increase of 17% more discovered peptides at 1% FDR threshold relative to the PSM-only method.


Subject(s)
Algorithms , Tandem Mass Spectrometry , Databases, Protein , Peptides/analysis , Proteomics/methods , Tandem Mass Spectrometry/methods
9.
Bioinformatics ; 38(Suppl_2): ii82-ii88, 2022 09 16.
Article in English | MEDLINE | ID: mdl-36124786

ABSTRACT

MOTIVATION: Target-decoy competition (TDC) is a commonly used method for false discovery rate (FDR) control in the analysis of tandem mass spectrometry data. This type of competition-based FDR control has recently gained significant popularity in other fields after Barber and Candès laid its theoretical foundation in a more general setting that included the feature selection problem. In both cases, the competition is based on a head-to-head comparison between an (observed) target score and a corresponding decoy (knockoff) score. However, the effectiveness of TDC depends on whether the data are homogeneous, which is often not the case: in many settings, the data consist of groups with different score profiles or different proportions of true nulls. In such cases, applying TDC while ignoring the group structure often yields imbalanced lists of discoveries, where some groups might include relatively many false discoveries and other groups include relatively very few. On the other hand, as we show, the alternative approach of applying TDC separately to each group does not rigorously control the FDR. RESULTS: We developed Group-walk, a procedure that controls the FDR in the target-decoy/knockoff setting while taking into account a given group structure. Group-walk is derived from the recently developed AdaPT-a general framework for controlling the FDR with side-information. We show using simulated and real datasets that when the data naturally divide into groups with different characteristics Group-walk can deliver consistent power gains that in some cases are substantial. These groupings include the precursor charge state (4% more discovered peptides at 1% FDR threshold), the peptide length (3.6% increase) and the mass difference due to modifications (26% increase). AVAILABILITY AND IMPLEMENTATION: Group-walk is available at https://cran.r-project.org/web/packages/groupwalk/index.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteomics , Tandem Mass Spectrometry , Peptides/chemistry , Proteomics/methods , Tandem Mass Spectrometry/methods
10.
J Proteome Res ; 21(6): 1382-1391, 2022 06 03.
Article in English | MEDLINE | ID: mdl-35549345

ABSTRACT

Advances in library-based methods for peptide detection from data-independent acquisition (DIA) mass spectrometry have made it possible to detect and quantify tens of thousands of peptides in a single mass spectrometry run. However, many of these methods rely on a comprehensive, high-quality spectral library containing information about the expected retention time and fragmentation patterns of peptides in the sample. Empirical spectral libraries are often generated through data-dependent acquisition and may suffer from biases as a result. Spectral libraries can be generated in silico, but these models are not trained to handle all possible post-translational modifications. Here, we propose a false discovery rate-controlled spectrum-centric search workflow to generate spectral libraries directly from gas-phase fractionated DIA tandem mass spectrometry data. We demonstrate that this strategy is able to detect phosphorylated peptides and can be used to generate a spectral library for accurate peptide detection and quantitation in wide-window DIA data. We compare the results of this search workflow to other library-free approaches and demonstrate that our search is competitive in terms of accuracy and sensitivity. These results demonstrate that the proposed workflow has the capacity to generate spectral libraries while avoiding the limitations of other methods.


Subject(s)
Peptides , Tandem Mass Spectrometry , Peptide Library , Peptides/analysis , Protein Processing, Post-Translational , Proteome/analysis , Tandem Mass Spectrometry/methods , Workflow
11.
J Proteome Res ; 20(8): 4153-4164, 2021 08 06.
Article in English | MEDLINE | ID: mdl-34236864

ABSTRACT

The standard proteomics database search strategy involves searching spectra against a peptide database and estimating the false discovery rate (FDR) of the resulting set of peptide-spectrum matches. One assumption of this protocol is that all the peptides in the database are relevant to the hypothesis being investigated. However, in settings where researchers are interested in a subset of peptides, alternative search and FDR control strategies are needed. Recently, two methods were proposed to address this problem: subset-search and all-sub. We show that both methods fail to control the FDR. For subset-search, this failure is due to the presence of "neighbor" peptides, which are defined as irrelevant peptides with a similar precursor mass and fragmentation spectrum as a relevant peptide. Not considering neighbors compromises the FDR estimate because a spectrum generated by an irrelevant peptide can incorrectly match well to a relevant peptide. Therefore, we have developed a new method, "subset-neighbor search" (SNS), that accounts for neighbor peptides. We show evidence that SNS controls the FDR when neighbors are present and that SNS outperforms group-FDR, the only other method that appears to control the FDR relative to a subset of relevant peptides.


Subject(s)
Algorithms , Tandem Mass Spectrometry , Databases, Protein , Humans , Peptides , Proteomics
12.
J Proteome Res ; 18(2): 585-593, 2019 02 01.
Article in English | MEDLINE | ID: mdl-30560673

ABSTRACT

Decoy database search with target-decoy competition (TDC) provides an intuitive, easy-to-implement method for estimating the false discovery rate (FDR) associated with spectrum identifications from shotgun proteomics data. However, the procedure can yield different results for a fixed data set analyzed with different decoy databases, and this decoy-induced variability is particularly problematic for smaller FDR thresholds, data sets, or databases. The average TDC (aTDC) protocol combats this problem by exploiting multiple independently shuffled decoy databases to provide an FDR estimate with reduced variability. We provide a tutorial introduction to aTDC, describe an improved variant of the protocol that offers increased statistical power, and discuss how to deploy aTDC in practice using the Crux software toolkit.


Subject(s)
Databases, Protein/standards , Proteomics/methods , Software , Datasets as Topic , Humans , Models, Statistical , Reproducibility of Results
13.
J Am Stat Assoc ; 113(523): 973-982, 2018.
Article in English | MEDLINE | ID: mdl-30546175

ABSTRACT

We consider the problem of controlling the FDR among discoveries from searching an incomplete database. This problem differs from the classical multiple testing setting because there are two different types of false discoveries: those arising from objects that have no match in the database and those that are incorrectly matched. We show that commonly used FDR controlling procedures are inadequate for this setup, a special case of which is tandem mass spectrum identification. We then derive a novel FDR controlling approach which extensive simulations suggest is unbiased. We also compare its performance with problem-specific as well as general FDR controlling procedures using both simulated and real mass spectrometry data.

15.
Res Comput Mol Biol ; 10229: 99-116, 2017 May.
Article in English | MEDLINE | ID: mdl-29326989

ABSTRACT

Estimating the false discovery rate (FDR) among a list of tandem mass spectrum identifications is mostly done through target-decoy competition (TDC). Here we offer two new methods that can use an arbitrarily small number of additional randomly drawn decoy databases to improve TDC. Specifically, "Partial Calibration" utilizes a new meta-scoring scheme that allows us to gradually benefit from the increase in the number of identifications calibration yields and "Averaged TDC" (a-TDC) reduces the liberal bias of TDC for small FDR values and its variability throughout. Combining a-TDC with "Progressive Calibration" (PC), which attempts to find the "right" number of decoys required for calibration we see substantial impact in real datasets: when analyzing the Plasmodium falciparum data it typically yields almost the entire 17% increase in discoveries that "full calibration" yields (at FDR level 0.05) using 60 times fewer decoys. Our methods are further validated using a novel realistic simulation scheme and importantly, they apply more generally to the problem of controlling the FDR among discoveries from searching an incomplete database.

18.
J Comput Biol ; 23(6): 508-25, 2016 06.
Article in English | MEDLINE | ID: mdl-27138444

ABSTRACT

Young et al., (2010) showed that due to gene length bias the popular Fisher Exact Test should not be used to study the association between a group of differentially expressed (DE) genes and a specific Gene Ontology (GO) category. Instead they suggest a test where one conditions on the genes in the GO category and draws the pseudo DE expressed genes according to a length-dependent distribution. The same model was presented in a different context by Kazemian et al., (2011) who went on to offer a dynamic programming (DP) algorithm to exactly compute the significance of the proposed test. Here we point out that while valid, the test proposed by these authors is no longer symmetric as Fisher's Exact Test is: one gets different answers if one conditions on the observed GO category than on the DE set. As an alternative we offer a symmetric generalization of Fisher's Exact Test and provide efficient algorithms to evaluate its significance.


Subject(s)
Computational Biology/methods , Gene Expression , Algorithms , Gene Ontology
19.
J Proteome Res ; 14(8): 3148-61, 2015 Aug 07.
Article in English | MEDLINE | ID: mdl-26152888

ABSTRACT

Interpreting the potentially vast number of hypotheses generated by a shotgun proteomics experiment requires a valid and accurate procedure for assigning statistical confidence estimates to identified tandem mass spectra. Despite the crucial role such procedures play in most high-throughput proteomics experiments, the scientific literature has not reached a consensus about the best confidence estimation methodology. In this work, we evaluate, using theoretical and empirical analysis, four previously proposed protocols for estimating the false discovery rate (FDR) associated with a set of identified tandem mass spectra: two variants of the target-decoy competition protocol (TDC) of Elias and Gygi and two variants of the separate target-decoy search protocol of Käll et al. Our analysis reveals significant biases in the two separate target-decoy search protocols. Moreover, the one TDC protocol that provides an unbiased FDR estimate among the target PSMs does so at the cost of forfeiting a random subset of high-scoring spectrum identifications. We therefore propose the mix-max procedure to provide unbiased, accurate FDR estimates in the presence of well-calibrated scores. The method avoids biases associated with the two separate target-decoy search protocols and also avoids the propensity for target-decoy competition to discard a random subset of high-scoring target identifications.


Subject(s)
Algorithms , Computational Biology , Peptides/metabolism , Proteomics/methods , Tandem Mass Spectrometry/methods , Amino Acid Sequence , Animals , Caenorhabditis elegans Proteins/metabolism , Computational Biology/methods , Databases, Protein , False Positive Reactions , Plasmodium falciparum/metabolism , Proteomics/standards , Protozoan Proteins/metabolism , Reproducibility of Results , Saccharomyces cerevisiae Proteins/metabolism , Tandem Mass Spectrometry/standards
20.
J Proteome Res ; 14(8): 3027-38, 2015 Aug 07.
Article in English | MEDLINE | ID: mdl-26084232

ABSTRACT

Accurate assignment of peptide sequences to observed fragmentation spectra is hindered by the large number of hypotheses that must be considered for each observed spectrum. A high score assigned to a particular peptide-spectrum match (PSM) may not end up being statistically significant after multiple testing correction. Researchers can mitigate this problem by controlling the hypothesis space in various ways: considering only peptides resulting from enzymatic cleavages, ignoring possible post-translational modifications or single nucleotide variants, etc. However, these strategies sacrifice identifications of spectra generated by rarer types of peptides. In this work, we introduce a statistical testing framework, cascade search, that directly addresses this problem. The method requires that the user specify a priori a statistical confidence threshold as well as a series of peptide databases. For instance, such a cascade of databases could include fully tryptic, semitryptic, and nonenzymatic peptides or peptides with increasing numbers of modifications. Cascaded search then gradually expands the list of candidate peptides from more likely peptides toward rare peptides, sequestering at each stage any spectrum that is identified with a specified statistical confidence. We compare cascade search to a standard procedure that lumps all of the peptides into a single database, as well as to a previously described group FDR procedure that computes the FDR separately within each database. We demonstrate, using simulated and real data, that cascade search identifies more spectra at a fixed FDR threshold than with either the ungrouped or grouped approach. Cascade search thus provides a general method for maximizing the number of identified spectra in a statistically rigorous fashion.


Subject(s)
Algorithms , Peptides/analysis , Proteomics/methods , Tandem Mass Spectrometry/methods , Cell Line , Computer Simulation , Databases, Protein , Humans , Peptides/metabolism , Protein Isoforms/analysis , Protein Isoforms/metabolism , Protein Processing, Post-Translational , Reproducibility of Results , Saccharomyces cerevisiae Proteins/analysis , Saccharomyces cerevisiae Proteins/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL
...