Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 55
Filtrar
1.
Science ; 382(6671): 669-674, 2023 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-37943906

RESUMO

Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients without making any assumptions about the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference were demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.

2.
Nature ; 619(7970): 526-532, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37407824

RESUMO

Extreme precipitation is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through skilful nowcasting that has high resolution, long lead times and local details1-3. Current methods are subject to blur, dissipation, intensity or location errors, with physics-based numerical methods struggling to capture pivotal chaotic dynamics such as convective initiation4 and data-driven learning methods failing to obey intrinsic physical laws such as advective conservation5. We present NowcastNet, a nonlinear nowcasting model for extreme precipitation that unifies physical-evolution schemes and conditional-learning methods into a neural-network framework with end-to-end forecast error optimization. On the basis of radar observations from the USA and China, our model produces physically plausible precipitation nowcasts with sharp multiscale patterns over regions of 2,048 km × 2,048 km and with lead times of up to 3 h. In a systematic evaluation by 62 professional meteorologists from across China, our model ranks first in 71% of cases against the leading methods. NowcastNet provides skilful forecasts at light-to-heavy rain rates, particularly for extreme-precipitation events accompanied by advective or convective processes that were previously considered intractable.

3.
Nat Methods ; 20(8): 1222-1231, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37386189

RESUMO

Jointly profiling the transcriptome, chromatin accessibility and other molecular properties of single cells offers a powerful way to study cellular diversity. Here we present MultiVI, a probabilistic model to analyze such multiomic data and leverage it to enhance single-modality datasets. MultiVI creates a joint representation that allows an analysis of all modalities included in the multiomic input data, even for cells for which one or more modalities are missing. It is available at scvi-tools.org .


Assuntos
Modelos Estatísticos , Transcriptoma
4.
Proc Natl Acad Sci U S A ; 120(21): e2209124120, 2023 05 23.
Artigo em Inglês | MEDLINE | ID: mdl-37192164

RESUMO

Detecting differentially expressed genes is important for characterizing subpopulations of cells. In scRNA-seq data, however, nuisance variation due to technical factors like sequencing depth and RNA capture efficiency obscures the underlying biological signal. Deep generative models have been extensively applied to scRNA-seq data, with a special focus on embedding cells into a low-dimensional latent space and correcting for batch effects. However, little attention has been paid to the problem of utilizing the uncertainty from the deep generative model for differential expression (DE). Furthermore, the existing approaches do not allow for controlling for effect size or the false discovery rate (FDR). Here, we present lvm-DE, a generic Bayesian approach for performing DE predictions from a fitted deep generative model, while controlling the FDR. We apply the lvm-DE framework to scVI and scSphere, two deep generative models. The resulting approaches outperform state-of-the-art methods at estimating the log fold change in gene expression levels as well as detecting differentially expressed genes between subpopulations of cells.


Assuntos
RNA , Análise de Célula Única , Teorema de Bayes , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos
5.
Proc Natl Acad Sci U S A ; 119(43): e2204569119, 2022 10 25.
Artigo em Inglês | MEDLINE | ID: mdl-36256807

RESUMO

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.


Assuntos
Algoritmos , Aprendizado de Máquina , Retroalimentação , Incerteza , Conformação Molecular
6.
Nat Biotechnol ; 40(9): 1360-1369, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35449415

RESUMO

Most spatial transcriptomics technologies are limited by their resolution, with spot sizes larger than that of a single cell. Although joint analysis with single-cell RNA sequencing can alleviate this problem, current methods are limited to assessing discrete cell types, revealing the proportion of cell types inside each spot. To identify continuous variation of the transcriptome within cells of the same type, we developed Deconvolution of Spatial Transcriptomics profiles using Variational Inference (DestVI). Using simulations, we demonstrate that DestVI outperforms existing methods for estimating gene expression for every cell type inside every spot. Applied to a study of infected lymph nodes and of a mouse tumor model, DestVI provides high-resolution, accurate spatial characterization of the cellular organization of these tissues and identifies cell-type-specific changes in gene expression between different tissue regions or between conditions. DestVI is available as part of the open-source software package scvi-tools ( https://scvi-tools.org ).


Assuntos
Neoplasias , Transcriptoma , Animais , Perfilação da Expressão Gênica/métodos , Camundongos , Neoplasias/genética , Análise de Célula Única/métodos , Software , Transcriptoma/genética , Sequenciamento do Exoma
8.
Mol Syst Biol ; 17(1): e9620, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33491336

RESUMO

As the number of single-cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA-seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state-of-the-art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi-tools.


Assuntos
Biologia Computacional/métodos , Análise de Célula Única/métodos , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Humanos , Anotação de Sequência Molecular , Análise de Sequência de RNA , Aprendizado de Máquina Supervisionado
9.
Annu Int Conf IEEE Eng Med Biol Soc ; 2020: 528-531, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-33018043

RESUMO

Current seizure detection systems rely on machine learning classifiers that are trained offline and subsequently require manual retraining to maintain high detection accuracy over long periods of time. For a true deploy-and-forget implantable seizure detection system, a low power, at-the-edge, online learning algorithm can be employed to dynamically adapt to the neural signal drifts over time. This work proposes SOUL: Stochastic-gradient-descent-based Online Unsupervised Logistic regression classifier, which provides continuous unsupervised online model updates that was initially trained with labels offline. SOUL was tested on two datasets, the CHB-MIT scalp EEG dataset, and a long (>250 hours) human ECoG dataset from the University of Melbourne. SOUL achieves an average cumulative sensitivity of 97.5% and 97.9% for the two datasets respectively, while maintaining <1.2 false alarms per day. When compared with state-of-the-art, a moderate sensitivity improvement of 1-3% is observed on the majority of subjects and a large sensitivity improvement of >12% is observed on three subjects with <1% impact on specificity.


Assuntos
Educação a Distância , Algoritmos , Eletroencefalografia , Humanos , Convulsões/diagnóstico , Sensibilidade e Especificidade
10.
Proc Natl Acad Sci U S A ; 116(42): 20881-20885, 2019 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-31570618

RESUMO

Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these 2 kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multistable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

11.
IEEE Trans Pattern Anal Mach Intell ; 41(12): 3071-3085, 2019 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-30188813

RESUMO

Domain adaptation studies learning algorithms that generalize across source domains and target domains that exhibit different distributions. Recent studies reveal that deep neural networks can learn transferable features that generalize well to similar novel tasks. However, as deep features eventually transition from general to specific along the network, feature transferability drops significantly in higher task-specific layers with increasing domain discrepancy. To formally reduce the effects of this discrepancy and enhance feature transferability in task-specific layers, we develop a novel framework for deep adaptation networks that extends deep convolutional neural networks to domain adaptation problems. The framework embeds the deep features of all task-specific layers into reproducing kernel Hilbert spaces (RKHSs) and optimally matches different domain distributions. The deep features are made more transferable by exploiting low-density separation of target-unlabeled data in very deep architectures, while the domain discrepancy is further reduced via the use of multiple kernel learning that enhances the statistical power of kernel embedding matching. The overall framework is cast in a minimax game setting. Extensive empirical evidence shows that the proposed networks yield state-of-the-art results on standard visual domain-adaptation benchmarks.

12.
Nat Methods ; 15(12): 1053-1058, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30504886

RESUMO

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Modelos Biológicos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Transcriptoma , Algoritmos , Animais , Encéfalo/citologia , Encéfalo/metabolismo , Análise por Conglomerados , Variação Genética , Células-Tronco Hematopoéticas/citologia , Células-Tronco Hematopoéticas/metabolismo , Humanos , Leucócitos Mononucleares/citologia , Leucócitos Mononucleares/metabolismo , Camundongos
13.
J Mach Learn Res ; 182018 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-31007630

RESUMO

We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data via a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves the original infinite-dimensional (for splines of degree at least two) optimization problem without pre-specified knot locations. We then adapt our algorithm to fit generalized additive models with saturating splines as coordinate functions and show that the saturation requirement allows our model to simultaneously perform feature selection and nonlinear function fitting. Finally, we briefly sketch how the method can be extended to higher order splines and to different requirements on the extension outside the data range.

14.
IEEE Trans Image Process ; 26(1): 172-184, 2017 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-27723590

RESUMO

Segmenting objects of interest from 3D data sets is a common problem encountered in biological data. Small field of view and intrinsic biological variability combined with optically subtle changes of intensity, resolution, and low contrast in images make the task of segmentation difficult, especially for microscopy of unstained living or freshly excised thick tissues. Incorporating shape information in addition to the appearance of the object of interest can often help improve segmentation performance. However, the shapes of objects in tissue can be highly variable and design of a flexible shape model that encompasses these variations is challenging. To address such complex segmentation problems, we propose a unified probabilistic framework that can incorporate the uncertainty associated with complex shapes, variable appearance, and unknown locations. The driving application that inspired the development of this framework is a biologically important segmentation problem: the task of automatically detecting and segmenting the dermal-epidermal junction (DEJ) in 3D reflectance confocal microscopy (RCM) images of human skin. RCM imaging allows noninvasive observation of cellular, nuclear, and morphological detail. The DEJ is an important morphological feature as it is where disorder, disease, and cancer usually start. Detecting the DEJ is challenging, because it is a 2D surface in a 3D volume which has strong but highly variable number of irregularly spaced and variably shaped "peaks and valleys." In addition, RCM imaging resolution, contrast, and intensity vary with depth. Thus, a prior model needs to incorporate the intrinsic structure while allowing variability in essentially all its parameters. We propose a model which can incorporate objects of interest with complex shapes and variable appearance in an unsupervised setting by utilizing domain knowledge to build appropriate priors of the model. Our novel strategy to model this structure combines a spatial Poisson process with shape priors and performs inference using Gibbs sampling. Experimental results show that the proposed unsupervised model is able to automatically detect the DEJ with physiologically relevant accuracy in the range 10- 20 µm .


Assuntos
Derme/diagnóstico por imagem , Epiderme/diagnóstico por imagem , Imageamento Tridimensional/métodos , Microscopia Confocal/métodos , Algoritmos , Humanos , Distribuição de Poisson
15.
Proc Natl Acad Sci U S A ; 113(47): E7351-E7358, 2016 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-27834219

RESUMO

Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. Although many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian, which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods corresponds to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov's technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.

16.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 256-70, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26353240

RESUMO

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP generalizes the nested Chinese restaurant process (nCRP) to allow each word to follow its own path to a topic node according to a per-document distribution over the paths on a shared tree. This alleviates the rigid, single-path formulation assumed by the nCRP, allowing documents to easily express complex thematic borrowings. We derive a stochastic variational inference algorithm for the model, which enables efficient inference for massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 2.7 million documents from Wikipedia.

17.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 290-306, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26353242

RESUMO

We develop a Bayesian nonparametric approach to a general family of latent class problems in which individuals can belong simultaneously to multiple classes and where each class can be exhibited multiple times by an individual. We introduce a combinatorial stochastic process known as the negative binomial process ( NBP ) as an infinite-dimensional prior appropriate for such problems. We show that the NBP is conjugate to the beta process, and we characterize the posterior distribution under the beta-negative binomial process ( BNBP) and hierarchical models based on the BNBP (the HBNBP). We study the asymptotic properties of the BNBP and develop a three-parameter extension of the BNBP that exhibits power-law behavior. We derive MCMC algorithms for posterior inference under the HBNBP , and we present experiments using these algorithms in the domains of image segmentation, object recognition, and document analysis.


Assuntos
Análise por Conglomerados , Informática/métodos , Algoritmos , Teorema de Bayes , Simulação por Computador , Processamento de Imagem Assistida por Computador , Modelos Teóricos , Estatísticas não Paramétricas
18.
Bioinformatics ; 30(19): 2787-95, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-24894505

RESUMO

MOTIVATION: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. RESULTS: We propose SMaSH, a benchmarking methodology for evaluating germline variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on these benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single-nucleotide polymorphism, indel and structural variant calling algorithms. AVAILABILITY AND IMPLEMENTATION: We provide free and open access online to the SMaSH tool kit, along with detailed documentation, at smash.cs.berkeley.edu


Assuntos
Biologia Computacional/métodos , Genoma Humano , Genômica/métodos , Mutação INDEL , Algoritmos , Interpretação Estatística de Dados , Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Software
19.
IEEE Trans Pattern Anal Mach Intell ; 36(7): 1340-53, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-26353307

RESUMO

Complex data can be grouped and interpreted in many different ways. Most existing clustering algorithms, however, only find one clustering solution, and provide little guidance to data analysts who may not be satisfied with that single clustering and may wish to explore alternatives. We introduce a novel approach that provides several clustering solutions to the user for the purposes of exploratory data analysis. Our approach additionally captures the notion that alternative clusterings may reside in different subspaces (or views). We present an algorithm that simultaneously finds these subspaces and the corresponding clusterings. The algorithm is based on an optimization procedure that incorporates terms for cluster quality and novelty relative to previously discovered clustering solutions. We present a range of experiments that compare our approach to alternatives and explore the connections between simultaneous and iterative modes of discovery of multiple clusterings.

20.
Proteins ; 81(9): 1593-609, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23671031

RESUMO

The subfamily Iα aminotransferases are typically categorized as having narrow specificity toward carboxylic amino acids (AATases), or broad specificity that includes aromatic amino acid substrates (TATases). Because of their general role in central metabolism and, more specifically, their association with liver-related diseases in humans, this subfamily is biologically interesting. The substrate specificities for only a few members of this subfamily have been reported, and the reliable prediction of substrate specificity from protein sequence has remained elusive. In this study, a diverse set of aminotransferases was chosen for characterization based on a scoring system that measures the sequence divergence of the active site. The enzymes that were experimentally characterized include both narrow-specificity AATases and broad-specificity TATases, as well as AATases with broader-specificity and TATases with narrower-specificity than the previously known family members. Molecular function and phylogenetic analyses underscored the complexity of this family's evolution as the TATase function does not follow a single evolutionary thread, but rather appears independently multiple times during the evolution of the subfamily. The additional functional characterizations described in this article, alongside a detailed sequence and phylogenetic analysis, provide some novel clues to understanding the evolutionary mechanisms at work in this family.


Assuntos
Transaminases/química , Transaminases/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Bactérias , Proteínas Fúngicas , Cinética , Dados de Sequência Molecular , Filogenia , Alinhamento de Sequência , Especificidade por Substrato , Transaminases/classificação , Transaminases/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...