Search | VHL Regional Portal

Estimating error rates for single molecule protein sequencing experiments.

Smith, Matthew Beauregard; VanderVelden, Kent; Blom, Thomas; Stout, Heather D; Mapes, James H; Folsom, Tucker M; Martin, Christopher; Bardo, Angela M; Marcotte, Edward M.

PLoS Comput Biol ; 20(7): e1012258, 2024 Jul.

Article in English | MEDLINE | ID: mdl-38968291

ABSTRACT

The practical application of new single molecule protein sequencing (SMPS) technologies requires accurate estimates of their associated sequencing error rates. Here, we describe the development and application of two distinct parameter estimation methods for analyzing SMPS reads produced by fluorosequencing. A Hidden Markov Model (HMM) based approach, extends whatprot, where we previously used HMMs for SMPS peptide-read matching. This extension offers a principled approach for estimating key parameters for fluorosequencing experiments, including missed amino acid cleavages, dye loss, and peptide detachment. Specifically, we adapted the Baum-Welch algorithm, a standard technique to estimate transition probabilities for an HMM using expectation maximization, but modified here to estimate a small number of parameter values directly rather than estimating every transition probability independently. We demonstrate a high degree of accuracy on simulated data, but on experimental datasets, we observed that the model needed to be augmented with an additional error type, N-terminal blocking. This, in combination with data pre-processing, results in reasonable parameterizations of experimental datasets that agree with controlled experimental perturbations. A second independent implementation using a hybrid of DIRECT and Powell's method to reduce the root mean squared error (RMSE) between simulations and the real dataset was also developed. We compare these methods on both simulated and real data, finding that our Baum-Welch based approach outperforms DIRECT and Powell's method by most, but not all, criteria. Although some discrepancies between the results exist, we also find that both approaches provide similar error rate estimates from experimental single molecule fluorosequencing datasets.

Subject(s)

Algorithms , Markov Chains , Sequence Analysis, Protein , Sequence Analysis, Protein/methods , Proteins/chemistry , Computational Biology/methods , Single Molecule Imaging/methods , Computer Simulation

Estimating error rates for single molecule protein sequencing experiments.

Smith, Matthew Beauregard; VanderVelden, Kent; Blom, Thomas; Stout, Heather D; Mapes, James H; Folsom, Tucker M; Martin, Christopher; Bardo, Angela M; Marcotte, Edward M.

bioRxiv ; 2023 Jul 19.

Article in English | MEDLINE | ID: mdl-37502879

ABSTRACT

The practical application of new single molecule protein sequencing (SMPS) technologies requires accurate estimates of their associated sequencing error rates. Here, we describe the development and application of two distinct parameter estimation methods for analyzing SMPS reads produced by fluorosequencing. A Hidden Markov Model (HMM) based approach, extends whatprot, where we previously used HMMs for SMPS peptide-read matching. This extension offers a principled approach for estimating key parameters for fluorosequencing experiments, including missed amino acid cleavages, dye loss, and peptide detachment. Specifically, we adapted the Baum-Welch algorithm, a standard technique to estimate transition probabilities for an HMM using expectation maximization, but modified here to estimate a small number of parameter values directly rather than estimating every transition probability independently, which should help prevent overfitting. We demonstrate a high degree of accuracy on simulated data, but on experimental datasets, we observed that the model needed to be augmented with an additional error type, N-terminal blocking. This, in combination with data pre-processing, results in reasonable parameterizations of experimental datasets that agree with controlled experimental perturbations. A second independent implementation using a hybrid of DIRECT and Powell's method to reduce the root mean squared error (RMSE) between simulations and the real dataset was also developed. We compare these methods on both simulated and real data, finding that our Baum-Welch based approach outperforms DIRECT and Powell's method by most, but not all, criteria. Although some discrepancies between the results exist, we also find that both approaches provide similar error rate estimates from experimental single molecule fluorosequencing datasets.

Amino acid sequence assignment from single molecule peptide sequencing data using a two-stage classifier.

Smith, Matthew Beauregard; Simpson, Zack Booth; Marcotte, Edward M.

PLoS Comput Biol ; 19(5): e1011157, 2023 May.

Article in English | MEDLINE | ID: mdl-37253025

ABSTRACT

We present a machine learning-based interpretive framework (whatprot) for analyzing single molecule protein sequencing data produced by fluorosequencing, a recently developed proteomics technology that determines sparse amino acid sequences for many individual peptide molecules in a highly parallelized fashion. Whatprot uses Hidden Markov Models (HMMs) to represent the states of each peptide undergoing the various chemical processes during fluorosequencing, and applies these in a Bayesian classifier, in combination with pre-filtering by a k-Nearest Neighbors (kNN) classifier trained on large volumes of simulated fluorosequencing data. We have found that by combining the HMM based Bayesian classifier with the kNN pre-filter, we are able to retain the benefits of both, achieving both tractable runtimes and acceptable precision and recall for identifying peptides and their parent proteins from complex mixtures, outperforming the capabilities of either classifier on its own. Whatprot's hybrid kNN-HMM approach enables the efficient interpretation of fluorosequencing data using a full proteome reference database and should now also enable improved sequencing error rate estimates.

Subject(s)

Algorithms , Peptides , Amino Acid Sequence , Bayes Theorem , Peptides/genetics , Peptides/chemistry , Proteins/chemistry , Markov Chains

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL