Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
2.
Bioinformatics ; 39(2)2023 02 03.
Article in English | MEDLINE | ID: mdl-36702456

ABSTRACT

MOTIVATION: Interpretation of newly acquired mass spectrometry data can be improved by identifying, from an online repository, previous mass spectrometry runs that resemble the new data. However, this retrieval task requires computing the similarity between an arbitrary pair of mass spectrometry runs. This is particularly challenging for runs acquired using different experimental protocols. RESULTS: We propose a method, MS1Connect, that calculates the similarity between a pair of runs by examining only the intact peptide (MS1) scans, and we show evidence that the MS1Connect score is accurate. Specifically, we show that MS1Connect outperforms several baseline methods on the task of predicting the species from which a given proteomics sample originated. In addition, we show that MS1Connect scores are highly correlated with similarities computed from fragment (MS2) scans, even though these data are not used by MS1Connect. AVAILABILITY AND IMPLEMENTATION: The MS1Connect software is available at https://github.com/bmx8177/MS1Connect. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Peptides , Software , Mass Spectrometry , Peptides/chemistry , Proteomics/methods
3.
Bioinformatics ; 38(Suppl_2): ii148-ii154, 2022 09 16.
Article in English | MEDLINE | ID: mdl-36124797

ABSTRACT

MOTIVATION: A wide variety of experimental methods are available to characterize different properties of single cells in a complex biosample. However, because these measurement techniques are typically destructive, researchers are often presented with complementary measurements from disjoint subsets of cells, providing a fragmented view of the cell's biological processes. This creates a need for computational tools capable of integrating disjoint multi-omics data. Because different measurements typically do not share any features, the problem requires the integration to be done in unsupervised fashion. Recently, several methods have been proposed that project the cell measurements into a common latent space and attempt to align the corresponding low-dimensional manifolds. RESULTS: In this study, we present an approach, Synmatch, which produces a direct matching of the cells between modalities by exploiting information about neighborhood structure in each modality. Synmatch relies on the intuition that cells which are close in one measurement space should be close in the other as well. This allows us to formulate the matching problem as a constrained supermodular optimization problem over neighborhood structures that can be solved efficiently. We show that our approach successfully matches cells in small real multi-omics datasets and performs favorably when compared with recently published state-of-the-art methods. Further, we demonstrate that Synmatch is capable of scaling to large datasets of thousands of cells. AVAILABILITY AND IMPLEMENTATION: The Synmatch code and data used in this manuscript are available at https://github.com/Noble-Lab/synmatch.


Subject(s)
Cells
4.
Nat Methods ; 19(6): 675-678, 2022 06.
Article in English | MEDLINE | ID: mdl-35637305

ABSTRACT

Computational methods that aim to exploit publicly available mass spectrometry repositories rely primarily on unsupervised clustering of spectra. Here we trained a deep neural network in a supervised fashion on the basis of previous assignments of peptides to spectra. The network, called 'GLEAMS', learns to embed spectra in a low-dimensional space in which spectra generated by the same peptide are close to one another. We applied GLEAMS for large-scale spectrum clustering, detecting groups of unidentified, proximal spectra representing the same peptide. We used these clusters to explore the dark proteome of repeatedly observed yet consistently unidentified mass spectra.


Subject(s)
Peptides , Tandem Mass Spectrometry , Algorithms , Cluster Analysis , Neural Networks, Computer , Peptides/chemistry , Proteome/analysis , Tandem Mass Spectrometry/methods
6.
Bioinformatics ; 37(4): 439-447, 2021 05 01.
Article in English | MEDLINE | ID: mdl-32966546

ABSTRACT

MOTIVATION: Successful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types ('biosamples') and a list of possible high-throughput sequencing assays, where at least one experiment has been performed in each biosample and for each assay, we ask 'Which experiments should ENCODE perform next?' RESULTS: We demonstrate how to represent this task as a submodular optimization problem, where the goal is to choose a panel of experiments that maximize the facility location function. A key aspect of our approach is that we use imputed data, rather than experimental data, to directly answer the posed question. We find that, across several evaluations, our method chooses a panel of experiments that span a diversity of biochemical activity. Finally, we propose two modifications of the facility location function, including a novel submodular-supermodular function, that allow incorporation of domain knowledge or constraints into the optimization procedure. AVAILABILITY AND IMPLEMENTATION: Our method is available as a Python package at https://github.com/jmschrei/kiwano and can be installed using the command pip install kiwano. The source code used here and the similarity matrix can be found at http://doi.org/10.5281/zenodo.3708538. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Epigenomics , Genomics , Humans , Software , Transcriptome
7.
Genome Biol ; 21(1): 282, 2020 11 19.
Article in English | MEDLINE | ID: mdl-33213499

ABSTRACT

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.


Subject(s)
Epigenomics , Genomics/methods , Machine Learning , Chromatin , Gene Expression , Humans
8.
Genome Biol ; 21(1): 81, 2020 03 30.
Article in English | MEDLINE | ID: mdl-32228704

ABSTRACT

The human epigenome has been experimentally characterized by thousands of measurements for every basepair in the human genome. We propose a deep neural network tensor factorization method, Avocado, that compresses this epigenomic data into a dense, information-rich representation. We use this learned representation to impute epigenomic data more accurately than previous methods, and we show that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, replication timing, and an element of 3D chromatin architecture.


Subject(s)
Deep Learning , Epigenome , DNA Replication Timing , Enhancer Elements, Genetic , Gene Expression , Genomics , Humans , Promoter Regions, Genetic
9.
Genome Biol ; 21(1): 82, 2020 03 30.
Article in English | MEDLINE | ID: mdl-32228713

ABSTRACT

Recent efforts to describe the human epigenome have yielded thousands of epigenomic and transcriptomic datasets. However, due primarily to cost, the total number of such assays that can be performed is limited. Accordingly, we applied an imputation approach, Avocado, to a dataset of 3814 tracks of data derived from the ENCODE compendium, including measurements of chromatin accessibility, histone modification, transcription, and protein binding. Avocado shows significant improvements in imputing protein binding compared to the top models in the ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model.


Subject(s)
Epigenesis, Genetic , Chromatin , Chromatin Immunoprecipitation Sequencing , Histone Code , Humans , RNA-Seq , Transcription Factors/metabolism
10.
Genome Biol ; 20(1): 180, 2019 08 28.
Article in English | MEDLINE | ID: mdl-31462275

ABSTRACT

Semi-automated genome annotation methods such as Segway take as input a set of genome-wide measurements such as of histone modification or DNA accessibility and output an annotation of genomic activity in the target cell type. Here we present annotations of 164 human cell types using 1615 data sets. To produce these annotations, we automated the label interpretation step to produce a fully automated annotation strategy. Using these annotations, we developed a measure of the importance of each genomic position called the "conservation-associated activity score." We further combined all annotations into a single, cell type-agnostic encyclopedia that catalogs all human regulatory elements.


Subject(s)
DNA/genetics , Databases, Genetic , Molecular Sequence Annotation , Algorithms , Automation , Cell Line , Humans , Machine Learning , Phenotype , Transcription, Genetic
11.
IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1168-1181, 2019.
Article in English | MEDLINE | ID: mdl-29993658

ABSTRACT

MOTIVATION: Identification of spectra produced by a shotgun proteomics mass spectrometry experiment is commonly performed by searching the observed spectra against a peptide database. The heart of this search procedure is a score function that evaluates the quality of a hypothesized match between an observed spectrum and a theoretical spectrum corresponding to a particular peptide sequence. Accordingly, the success of a spectrum analysis pipeline depends critically upon this peptide-spectrum score function. We develop peptide-spectrum score functions that compute the maximum value of a submodular function under $m$ m matroid constraints. We call this procedure a submodular generalized matching (SGM) since it generalizes bipartite matching. We use a greedy algorithm to compute maximization, which can achieve a solution whose objective is guaranteed to be at least $\frac{1}{1+m}$ 1 1 + m of the true optimum. The advantage of the SGM framework is that known long-range properties of experimental spectra can be modeled by designing suitable submodular functions and matroid constraints. Experiments on four data sets from various organisms and mass spectrometry platforms show that the SGM approach leads to significantly improved performance compared to several state-of-the-art methods. Supplementary information, C++ source code, and data sets can be found at https://melodi-lab.github.io/SGM.


Subject(s)
Computational Biology/methods , Peptides/chemistry , Tandem Mass Spectrometry , Algorithms , Animals , Caenorhabditis elegans/chemistry , Calibration , Databases, Protein , Humans , Models, Statistical , Plasmodium falciparum/chemistry , Proteomics/methods , Saccharomyces cerevisiae/chemistry , Software
12.
Proteins ; 86(4): 454-466, 2018 04.
Article in English | MEDLINE | ID: mdl-29345009

ABSTRACT

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.


Subject(s)
Algorithms , Proteins/chemistry , Sequence Analysis, Protein/methods , Cluster Analysis , Proteomics/methods
13.
Bioinformatics ; 34(4): 669-671, 2018 02 15.
Article in English | MEDLINE | ID: mdl-29028889

ABSTRACT

Summary: Segway performs semi-automated genome annotation, discovering joint patterns across multiple genomic signal datasets. We discuss a major new version of Segway and highlight its ability to model data with substantially greater accuracy. Major enhancements in Segway 2.0 include the ability to model data with a mixture of Gaussians, enabling capture of arbitrarily complex signal distributions, and minibatch training, leading to better learned parameters. Availability and implementation: Segway and its source code are freely available for download at http://segway.hoffmanlab.org. We have made available scripts (https://doi.org/10.5281/zenodo.802939) and datasets (https://doi.org/10.5281/zenodo.802906) for this paper's analysis. Contact: michael.hoffman@utoronto.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics/methods , Molecular Sequence Annotation/methods , Sequence Analysis, DNA/methods , Software , Eukaryota/genetics
14.
Sci Rep ; 7(1): 16943, 2017 12 05.
Article in English | MEDLINE | ID: mdl-29208983

ABSTRACT

A comprehensive characterization of tumor genetic heterogeneity is critical for understanding how cancers evolve and escape treatment. Although many algorithms have been developed for capturing tumor heterogeneity, they are designed for analyzing either a single type of genomic aberration or individual biopsies. Here we present THEMIS (Tumor Heterogeneity Extensible Modeling via an Integrative System), which allows for the joint analysis of different types of genomic aberrations from multiple biopsies taken from the same patient, using a dynamic graphical model. Simulation experiments demonstrate higher accuracy of THEMIS over its ancestor, TITAN. The heterogeneity analysis results from THEMIS are validated with single cell DNA sequencing from a clinical tumor biopsy. When THEMIS is used to analyze tumor heterogeneity among multiple biopsies from the same patient, it helps to reveal the mutation accumulation history, track cancer progression, and identify the mutations related to treatment resistance. We implement our model via an extensible modeling platform, which makes our approach open, reproducible, and easy for others to extend.


Subject(s)
Biopsy/methods , Models, Biological , Neoplasms/pathology , Triple Negative Breast Neoplasms/drug therapy , Triple Negative Breast Neoplasms/genetics , Algorithms , Bayes Theorem , Clonal Evolution , Computational Biology/methods , DNA Copy Number Variations , Female , Humans , Mutation , Neoplasms/genetics , Reproducibility of Results , Sequence Analysis, DNA , Single-Cell Analysis , Transcriptome , Triple Negative Breast Neoplasms/pathology
15.
Genome Biol ; 17(1): 229, 2016 11 15.
Article in English | MEDLINE | ID: mdl-27846892

ABSTRACT

Due to the high cost of sequencing-based genomics assays such as ChIP-seq and DNase-seq, the epigenomic characterization of a cell type is typically carried out using a small panel of assay types. Deciding a priori which assays to perform is, thus, a critical step in many studies. We present the submodular selection of assays (SSA), a method for choosing a diverse panel of genomic assays that leverages methods from submodular optimization. More generally, this application serves as a model for how submodular optimization can be applied to other discrete problems in biology.


Subject(s)
Genome , Genomics/methods , Binding Sites , Chromatin Immunoprecipitation , Databases, Nucleic Acid , Epigenomics/methods , Genomics/standards , High-Throughput Nucleotide Sequencing , Histones/metabolism , Humans , Protein Binding , Transcription Factors/metabolism
16.
Genome Res ; 25(4): 544-57, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25677182

ABSTRACT

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term "specific expression domains." We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.


Subject(s)
Chromatin/genetics , Computational Biology/methods , Genomics/methods , Molecular Conformation , Molecular Sequence Annotation/methods , Algorithms , Amino Acid Motifs/genetics , Cell Line, Tumor , Chromatin/metabolism , Chromosome Structures , Genome, Human/genetics , HeLa Cells , Hep G2 Cells , Human Umbilical Vein Endothelial Cells , Humans , Promoter Regions, Genetic/genetics
17.
Nucleic Acids Res ; 41(2): 827-41, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23221638

ABSTRACT

The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.


Subject(s)
Chromatin/chemistry , Genome, Human , Molecular Sequence Annotation , Regulatory Elements, Transcriptional , Enhancer Elements, Genetic , Genome-Wide Association Study , Humans , Insulator Elements , Promoter Regions, Genetic , Proteins/genetics , Terminator Regions, Genetic , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...