Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
PeerJ Comput Sci ; 7: e501, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34013031

RESUMO

Selection and sorting the Cartesian sum, X + Y, are classic and important problems. Here, a new algorithm is presented, which generates the top k values of the form X i + Y j . The algorithm relies on layer-ordered heaps, partial orderings of exponentially sized layers. The algorithm relies only on median-of-medians and is simple to implement. Furthermore, it uses data structures contiguous in memory, cache efficient, and fast in practice. The presented algorithm is demonstrated to be theoretically optimal.

2.
PeerJ Comput Sci ; 7: e483, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33987456

RESUMO

Selection on the Cartesian product is a classic problem in computer science. Recently, an optimal algorithm for selection on A + B, based on soft heaps, was introduced. By combining this approach with layer-ordered heaps (LOHs), an algorithm using a balanced binary tree of A + B selections was proposed to perform selection on X 1 + X 2 + â‹¯ + X m in o(n⋅m + k⋅m), where X i have length n. Here, that o(n⋅m + k⋅m) algorithm is combined with a novel, optimal LOH-based algorithm for selection on A + B (without a soft heap). Performance of algorithms for selection on X 1 + X 2 + â‹¯ + X m are compared empirically, demonstrating the benefit of the algorithm proposed here.

3.
J Proteome Res ; 20(4): 1849-1854, 2021 04 02.
Artigo em Inglês | MEDLINE | ID: mdl-33529032

RESUMO

Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω(n log(n)) time to sort n values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the p value when investigating the family-wise error rate and q values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in O(n), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., p values or q values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.


Assuntos
Proteômica , Espectrometria de Massas em Tandem , Algoritmos , Bases de Dados de Proteínas , Peptídeos , Software
4.
Anal Chem ; 92(15): 10613-10619, 2020 08 04.
Artigo em Inglês | MEDLINE | ID: mdl-32663022

RESUMO

Computation of the isotopic distribution of compounds is crucial to applications of mass spectrometry, particularly as machine precision continues to improve. In the past decade, several tools have been created for doing so. In this paper we present a novel algorithm for calculating either the most abundant k isotopologue peaks of a compound or the minimal set of isotopologue peaks which have a combined total abundance of at least p. The algorithm uses Serang's optimal method of selection on Cartesian products. The method is significantly faster than the state-of-the-art on large compounds (e.g., Titin protein) and on compounds whose elements have many isotopes (e.g., palladium alloys).

5.
J Proteome Res ; 19(3): 1060-1072, 2020 03 06.
Artigo em Inglês | MEDLINE | ID: mdl-31975601

RESUMO

Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.


Assuntos
Algoritmos , Proteômica , Teorema de Bayes , Bases de Dados de Proteínas , Proteínas , Software
6.
J Proteome Res ; 18(9): 3268-3281, 2019 09 06.
Artigo em Inglês | MEDLINE | ID: mdl-31318211

RESUMO

In the metabolomics, glycomics, and mass spectrometry of structured small molecules, the combinatoric nature of the problem renders a database impossibly large, and thus de novo analysis is necessary. De novo analysis requires an alphabet of mass difference values used to link peaks in fragmentation spectra when they are different by a mass in the alphabet divided by a charge. Often, this alphabet is not known, prohibiting de novo analysis. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. We then introduce a novel approach to efficiently find recurring substructures in the de novo graph results.


Assuntos
Glicômica/métodos , Espectrometria de Massas/métodos , Metabolômica/métodos , Proteômica/métodos , Sequência de Aminoácidos/genética , Bases de Dados de Proteínas/estatística & dados numéricos , Análise de Sequência de Proteína/métodos
7.
Biochim Biophys Acta ; 1864(10): 1363-71, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27426920

RESUMO

We describe in detail the usage of leucine metabolic labelling in yeast in order to monitor quantitative proteome alterations, e.g. upon removal of a protease. Since laboratory yeast strains are typically leucine auxotroph, metabolic labelling with trideuterated leucine (d3-leucine) is a straightforward, cost-effective, and ubiquitously applicable strategy for quantitative proteomic studies, similar to the widely used arginine/lysine metabolic labelling method for mammalian cells. We showcase the usage of advanced peptide quantification using the FeatureFinderMultiplex algorithm (part of the OpenMS software package) for robust and reliable quantification. Furthermore, we present an OpenMS bioinformatics data analysis workflow that combines accurate quantification with high proteome coverage. In order to enable visualization, peptide-mapping, and sharing of quantitative proteomic data, especially for membrane-spanning and cell-surface proteins, we further developed the web-application Proteator (http://proteator.appspot.com). Due to its simplicity and robustness, we expect metabolic leucine labelling in yeast to be of great interest to the research community. As an exemplary application, we show the identification of the copper transporter Ctr1 as a putative substrate of the ER-intramembrane protease Ypf1 by yeast membrane proteomics using d3-leucine isotopic labelling.


Assuntos
Retículo Endoplasmático/metabolismo , Leucina/metabolismo , Proteínas de Membrana/metabolismo , Membranas/metabolismo , Peptídeo Hidrolases/metabolismo , Proteoma/metabolismo , Leveduras/metabolismo , Biologia Computacional/métodos , Proteínas Fúngicas/metabolismo , Marcação por Isótopo/métodos , Mapeamento de Peptídeos/métodos , Peptídeos/metabolismo , Proteômica/métodos
8.
J Proteome Res ; 14(10): 4099-103, 2015 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-26257019

RESUMO

In any high-throughput scientific study, it is often essential to estimate the percent of findings that are actually incorrect. This percentage is called the false discovery rate (abbreviated "FDR"), and it is an invariant (albeit, often unknown) quantity for any well-formed study. In proteomics, it has become common practice to incorrectly conflate the protein FDR (the percent of identified proteins that are actually absent) with protein-level target-decoy, a particular method for estimating the protein-level FDR. In this manner, the challenges of one approach have been used as the basis for an argument that the field should abstain from protein-level FDR analysis altogether or even the suggestion that the very notion of a protein FDR is flawed. As we demonstrate in simple but accurate simulations, not only is the protein-level FDR an invariant concept, when analyzing large data sets, the failure to properly acknowledge it or to correct for multiple testing can result in large, unrecognized errors, whereby thousands of absent proteins (and, potentially every protein in the FASTA database being considered) can be incorrectly identified.


Assuntos
Artefatos , Proteínas/isolamento & purificação , Proteômica/estatística & dados numéricos , Software , Espectrometria de Massas em Tandem/estatística & dados numéricos , Algoritmos , Bases de Dados de Proteínas , Humanos , Proteômica/métodos
9.
J Comput Biol ; 22(8): 770-83, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26161499

RESUMO

Observations depending on sums of random variables are common throughout many fields; however, no efficient solution is currently known for performing max-product inference on these sums of general discrete distributions (max-product inference can be used to obtain maximum a posteriori estimates). The limiting step to max-product inference is the max-convolution problem (sometimes presented in log-transformed form and denoted as "infimal convolution," "min-convolution," or "convolution on the tropical semiring"), for which no O(k log(k)) method is currently known. Presented here is an O(k log(k)) numerical method for estimating the max-convolution of two nonnegative vectors (e.g., two probability mass functions), where k is the length of the larger vector. This numerical max-convolution method is then demonstrated by performing fast max-product inference on a convolution tree, a data structure for performing fast inference given information on the sum of n discrete random variables in O(nk log(nk)log(n)) steps (where each random variable has an arbitrary prior distribution on k contiguous possible states). The numerical max-convolution method can be applied to specialized classes of hidden Markov models to reduce the runtime of computing the Viterbi path from nk(2) to nk log(k), and has potential application to the all-pairs shortest paths problem.


Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Algoritmos , Cadeias de Markov , Modelos Estatísticos
10.
Methods Mol Biol ; 1245: 215-41, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25373761

RESUMO

Accurate genotyping is essential for building genetic maps and performing genome assembly of polyploid species. Recent high-throughput techniques, such as Illumina GoldenGate™ and Sequenom iPLEX MassARRAY®, have made it possible to accurately estimate the relative abundances of different alleles even when the ploidy of the population is unknown. Here we describe the experimental methods for collecting these relative allele intensities and then demonstrate the practical concerns for inferring genotypes using Bayesian inference via the software package SuperMASSA.


Assuntos
Técnicas de Genotipagem/métodos , Polimorfismo de Nucleotídeo Único/genética , Poliploidia , Software , Alelos , Cruzamentos Genéticos , Loci Gênicos/genética , Genótipo , Saccharum/genética , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz , Estatística como Assunto
11.
PLoS One ; 9(3): e91507, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24626234

RESUMO

Exact Bayesian inference can sometimes be performed efficiently for special cases where a function has commutative and associative symmetry of its inputs (called "causal independence"). For this reason, it is desirable to exploit such symmetry on big data sets. Here we present a method to exploit a general form of this symmetry on probabilistic adder nodes by transforming those probabilistic adder nodes into a probabilistic convolution tree with which dynamic programming computes exact probabilities. A substantial speedup is demonstrated using an illustration example that can arise when identifying splice forms with bottom-up mass spectrometry-based proteomics. On this example, even state-of-the-art exact inference algorithms require a runtime more than exponential in the number of splice forms considered. By using the probabilistic convolution tree, we reduce the runtime to O(k log(k)2) and the space to O(k log(k)) where k is the number of variables joined by an additive or cardinal operator. This approach, which can also be used with junction tree inference, is applicable to graphs with arbitrary dependency on counting variables or cardinalities and can be used on diverse problems and fields like forward error correcting codes, elemental decomposition, and spectral demixing. The approach also trivially generalizes to multiple dimensions.


Assuntos
Teorema de Bayes , Cromatografia Líquida , Proteínas/química , Espectrometria de Massas em Tandem , Algoritmos , Processamento Alternativo , Modelos Estatísticos , Probabilidade , Proteômica/métodos , Processamento de Sinais Assistido por Computador
12.
Sci Rep ; 3: 3399, 2013 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-24292365

RESUMO

Many plant species of great economic value (e.g., potato, wheat, cotton, and sugarcane) are polyploids. Despite the essential roles of autopolyploid plants in human activities, our genetic understanding of these species is still poor. Recent progress in instrumentation and biochemical manipulation has led to the accumulation of an incredible amount of genomic data. In this study, we demonstrate for the first time a successful genetic analysis in a highly polyploid genome (sugarcane) by the quantitative analysis of single-nucleotide polymorphism (SNP) allelic dosage and the application of a new data analysis framework. This study provides a better understanding of autopolyploid genomic structure and is a sound basis for genetic studies. The proposed methods can be employed to analyse the genome of any autopolyploid and will permit the future development of high-quality genetic maps to assist in the assembly of reference genome sequences for polyploid species.


Assuntos
Genoma de Planta/genética , Polimorfismo de Nucleotídeo Único/genética , Saccharum/genética , Alelos , Genótipo , Poliploidia
13.
J Proteome Res ; 12(10): 4556-65, 2013 Oct 04.
Artigo em Inglês | MEDLINE | ID: mdl-24024742

RESUMO

Arbitrary cutoffs are ubiquitous in quantitative computational proteomics: maximum acceptable MS/MS PSM or peptide q value, minimum ion intensity to calculate a fold change, the minimum number of peptides that must be available to trust the estimated protein fold change (or the minimum number of PSMs that must be available to trust the estimated peptide fold change), and the "significant" fold change cutoff. Here we introduce a novel experimental setup and nonparametric Bayesian algorithm for determining the statistical quality of a proposed differential set of proteins or peptides. By comparing putatively nonchanging case-control evidence to an empirical null distribution derived from a control-control experiment, we successfully avoid some of these common parameters. We then apply our method to evaluating different fold-change rules and find that for our data a 1.2-fold change is the most permissive of the plausible fold-change rules.


Assuntos
Interpretação Estatística de Dados , Proteoma/metabolismo , Actinas/metabolismo , Algoritmos , Teorema de Bayes , Estudos de Casos e Controles , Proteínas de Ciclo Celular/metabolismo , Células HeLa , Humanos , Proteômica , Estatísticas não Paramétricas , Espectrometria de Massas em Tandem , Tubulina (Proteína)/metabolismo
14.
Mol Cell Proteomics ; 12(6): 1735-40, 2013 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-23443135

RESUMO

The past 15 years have seen significant progress in LC-MS/MS peptide sequencing, including the advent of successful de novo and database search methods; however, analysis of glycopeptide and, more generally, glycoconjugate spectra remains a much more open problem, and much annotation is still performed manually. This is partly because glycans, unlike peptides, need not be linear chains and are instead described by trees. In this study, we introduce SweetSEQer, an extremely simple open source tool for identifying potential glycopeptide MS/MS spectra. We evaluate SweetSEQer on manually curated glycoconjugate spectra and on negative controls, and we demonstrate high quality filtering that can be easily improved for specific applications. We also demonstrate a high overlap between peaks annotated by experts and peaks annotated by SweetSEQer, as well as demonstrate inferred glycan graphs consistent with canonical glycan tree motifs. This study presents a novel tool for annotating spectra and producing glycan graphs from LC-MS/MS spectra. The tool is evaluated and shown to perform similarly to an expert on manually curated data.


Assuntos
Cromatografia Líquida/métodos , Glicoproteínas/isolamento & purificação , Anotação de Sequência Molecular/métodos , Polissacarídeos/isolamento & purificação , Software , Espectrometria de Massas em Tandem/métodos , Glicoproteínas/urina , Humanos , Lactente , Anotação de Sequência Molecular/normas , Polissacarídeos/urina , Análise de Sequência de Proteína
15.
Mol Cell Proteomics ; 12(4): 1017-25, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23438733

RESUMO

Determining which glycan moieties occupy specific N-glycosylation sites is a highly challenging analytical task. Arguably, the most common approach involves LC-MS and LC-MS/MS analysis of glycopeptides generated by proteases with high cleavage site specificity; however, the depth achieved by this approach is modest. Nonglycosylated peptides are a major challenge to glycoproteomics, as they are preferentially selected for data-dependent MS/MS due to higher ionization efficiencies and higher stoichiometric levels in moderately complex samples. With the goal of improving glycopeptide coverage, a mass defect classifier was developed that discriminates between peptides and glycopeptides in complex mixtures based on accurate mass measurements of precursor peaks. By using the classifier, glycopeptides that were not fragmented in an initial data-dependent acquisition run may be targeted in a subsequent analysis without any prior knowledge of the glycan or protein species present in the mixture. Additionally, from probable glycopeptides that were poorly fragmented, tandem mass spectra may be reacquired using optimal glycopeptide settings. We demonstrate high sensitivity (0.892) and specificity (0.947) based on an in silico dataset spanning >100,000 tryptic entries. Comparable results were obtained using chymotryptic species. Further validation using published data and a fractionated tryptic digest of human urinary proteins was performed, yielding a sensitivity of 0.90 and a specificity of 0.93. Lists of glycopeptides may be generated from an initial proteomics experiment, and we show they may be efficiently targeted using the classifier. Considering the growing availability of high accuracy mass analyzers, this approach represents a simple and broadly applicable means of increasing the depth of MS/MS-based glycoproteomic analyses.


Assuntos
Glicoproteínas/química , Processamento de Proteína Pós-Traducional , Algoritmos , Animais , Configuração de Carboidratos , Sequência de Carboidratos , Simulação por Computador , Glicoproteínas/metabolismo , Glicosilação , Humanos , Anotação de Sequência Molecular , Dados de Sequência Molecular , Peso Molecular , Fragmentos de Peptídeos/química , Polissacarídeos/química , Polissacarídeos/metabolismo , Proteólise , Proteômica , Espectrometria de Massas em Tandem/métodos
16.
Mol Cell Proteomics ; 12(3): 807-12, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23292186

RESUMO

This paper proposes a novel, automated method for evaluating sets of proteins identified using mass spectrometry. The remaining peptide-spectrum match score distributions of protein sets are compared to an empirical absent peptide-spectrum match score distribution, and a Bayesian non-parametric method reminiscent of the Dirichlet process is presented to accurately perform this comparison. Thus, for a given protein set, the process computes the likelihood that the proteins identified are correctly identified. First, the method is used to evaluate protein sets chosen using different protein-level false discovery rate (FDR) thresholds, assigning each protein set a likelihood. The protein set assigned the highest likelihood is used to choose a non-arbitrary protein-level FDR threshold. Because the method can be used to evaluate any protein identification strategy (and is not limited to mere comparisons of different FDR thresholds), we subsequently use the method to compare and evaluate multiple simple methods for merging peptide evidence over replicate experiments. The general statistical approach can be applied to other types of data (e.g. RNA sequencing) and generalizes to multivariate problems.


Assuntos
Teorema de Bayes , Espectrometria de Massas/métodos , Proteínas/análise , Estatísticas não Paramétricas , Células HeLa , Humanos , Modelos Estatísticos , Peptídeos/análise , Peptídeos/química , Probabilidade , Proteínas/química , Proteômica/métodos , Reprodutibilidade dos Testes
18.
J Proteome Res ; 11(12): 5586-91, 2012 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-23148905

RESUMO

Parsimony and protein grouping are widely employed to enforce economy in the number of identified proteins, with the goal of increasing the quality and reliability of protein identifications; however, in a counterintuitive manner, parsimony and protein grouping may actually decrease the reproducibility and interpretability of protein identifications. We present a simple illustration demonstrating ways in which parsimony and protein grouping may lower the reproducibility or interpretability of results. We then provide an example of a data set where a probabilistic method increases the reproducibility and interpretability of identifications made on replicate analyses of Human Du145 prostate cancer cell lines.


Assuntos
Espectrometria de Massas/métodos , Proteínas de Neoplasias/isolamento & purificação , Proteômica/métodos , Linhagem Celular Tumoral , Biologia Computacional/métodos , Bases de Dados de Proteínas , Humanos , Masculino , Proteínas de Neoplasias/química , Peptídeos/química , Peptídeos/isolamento & purificação , Neoplasias da Próstata/química , Reprodutibilidade dos Testes , Incerteza
19.
PLoS One ; 7(8): e43706, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22952741

RESUMO

Linear programming (LP) problems are commonly used in analysis and resource allocation, frequently surfacing as approximations to more difficult problems. Existing approaches to LP have been dominated by a small group of methods, and randomized algorithms have not enjoyed popularity in practice. This paper introduces a novel randomized method of solving LP problems by moving along the facets and within the interior of the polytope along rays randomly sampled from the polyhedral cones defined by the bounding constraints. This conic sampling method is then applied to randomly sampled LPs, and its runtime performance is shown to compare favorably to the simplex and primal affine-scaling algorithms, especially on polytopes with certain characteristics. The conic sampling method is then adapted and applied to solve a certain quadratic program, which compute a projection onto a polytope; the proposed method is shown to outperform the proprietary software Mathematica on large, sparse QP problems constructed from mass spectometry-based proteomics.


Assuntos
Algoritmos , Programação Linear , Proteômica , Processos Estocásticos , Fatores de Tempo
20.
Stat Interface ; 5(1): 3-20, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22833779

RESUMO

Tandem mass spectrometry has emerged as a powerful tool for the characterization of complex protein samples, an increasingly important problem in biology. The effort to efficiently and accurately perform inference on data from tandem mass spectrometry experiments has resulted in several statistical methods. We use a common framework to describe the predominant methods and discuss them in detail. These methods are classified using the following categories: set cover methods, iterative methods, and Bayesian methods. For each method, we analyze and evaluate the outcome and methodology of published comparisons to other methods; we use this comparison to comment on the qualities and weaknesses, as well as the overall utility, of all methods. We discuss the similarities between these methods and suggest directions for the field that would help unify these similar assumptions in a more rigorous manner and help enable efficient and reliable protein inference.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...