Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 26
Filter
1.
J Biomed Inform ; 155: 104656, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38782170

ABSTRACT

OBJECTIVE: Healthcare continues to grapple with the persistent issue of treatment disparities, sparking concerns regarding the equitable allocation of treatments in clinical practice. While various fairness metrics have emerged to assess fairness in decision-making processes, a growing focus has been on causality-based fairness concepts due to their capacity to mitigate confounding effects and reason about bias. However, the application of causal fairness notions in evaluating the fairness of clinical decision-making with electronic health record (EHR) data remains an understudied domain. This study aims to address the methodological gap in assessing causal fairness of treatment allocation with electronic health records data. In addition, we investigate the impact of social determinants of health on the assessment of causal fairness of treatment allocation. METHODS: We propose a causal fairness algorithm to assess fairness in clinical decision-making. Our algorithm accounts for the heterogeneity of patient populations and identifies potential unfairness in treatment allocation by conditioning on patients who have the same likelihood to benefit from the treatment. We apply this framework to a patient cohort with coronary artery disease derived from an EHR database to evaluate the fairness of treatment decisions. RESULTS: Our analysis reveals notable disparities in coronary artery bypass grafting (CABG) allocation among different patient groups. Women were found to be 4.4%-7.7% less likely to receive CABG than men in two out of four treatment response strata. Similarly, Black or African American patients were 5.4%-8.7% less likely to receive CABG than others in three out of four response strata. These results were similar when social determinants of health (insurance and area deprivation index) were dropped from the algorithm. These findings highlight the presence of disparities in treatment allocation among similar patients, suggesting potential unfairness in the clinical decision-making process. CONCLUSION: This study introduces a novel approach for assessing the fairness of treatment allocation in healthcare. By incorporating responses to treatment into fairness framework, our method explores the potential of quantifying fairness from a causal perspective using EHR data. Our research advances the methodological development of fairness assessment in healthcare and highlight the importance of causality in determining treatment fairness.


Subject(s)
Algorithms , Electronic Health Records , Humans , Male , Female , Clinical Decision-Making , Coronary Artery Disease/therapy , Healthcare Disparities , Middle Aged , Social Determinants of Health , Causality
2.
Nat Biotechnol ; 2024 Mar 21.
Article in English | MEDLINE | ID: mdl-38514799

ABSTRACT

Spatially resolved gene expression profiling provides insight into tissue organization and cell-cell crosstalk; however, sequencing-based spatial transcriptomics (ST) lacks single-cell resolution. Current ST analysis methods require single-cell RNA sequencing data as a reference for rigorous interpretation of cell states, mostly do not use associated histology images and are not capable of inferring shared neighborhoods across multiple tissues. Here we present Starfysh, a computational toolbox using a deep generative model that incorporates archetypal analysis and any known cell type markers to characterize known or new tissue-specific cell states without a single-cell reference. Starfysh improves the characterization of spatial dynamics in complex tissues using histology images and enables the comparison of niches as spatial hubs across tissues. Integrative analysis of primary estrogen receptor (ER)-positive breast cancer, triple-negative breast cancer (TNBC) and metaplastic breast cancer (MBC) tissues led to the identification of spatial hubs with patient- and disease-specific cell type compositions and revealed metabolic reprogramming shaping immunosuppressive hubs in aggressive MBC.

3.
bioRxiv ; 2023 Nov 15.
Article in English | MEDLINE | ID: mdl-38014231

ABSTRACT

Single-cell genomics has the potential to map cell states and their dynamics in an unbiased way in response to perturbations like disease. However, elucidating the cell-state transitions from healthy to disease requires analyzing data from perturbed samples jointly with unperturbed reference samples. Existing methods for integrating and jointly visualizing single-cell datasets from distinct contexts tend to remove key biological differences or do not correctly harmonize shared mechanisms. We present Decipher, a model that combines variational autoencoders with deep exponential families to reconstruct derailed trajectories (https://github.com/azizilab/decipher). Decipher jointly represents normal and perturbed single-cell RNA-seq datasets, revealing shared and disrupted dynamics. It further introduces a novel approach to visualize data, without the need for methods such as UMAP or TSNE. We demonstrate Decipher on data from acute myeloid leukemia patient bone marrow specimens, showing that it successfully characterizes the divergence from normal hematopoiesis and identifies transcriptional programs that become disrupted in each patient when they acquire NPM1 driver mutations.

4.
Sci Adv ; 8(42): eade6585, 2022 Oct 21.
Article in English | MEDLINE | ID: mdl-36260667

ABSTRACT

Statistical and machine learning methods help social scientists and other researchers make causal inferences from texts.

5.
J Biomed Inform ; 134: 104204, 2022 10.
Article in English | MEDLINE | ID: mdl-36108816

ABSTRACT

Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs) and administrative claims. Modern medical data typically contain tens of thousands of covariates. Such a large set carries hope that many of the confounders are directly measured, and further hope that others are indirectly measured through their correlation with measured covariates. How can we exploit these large sets of covariates for causal inference? To help answer this question, this paper examines the performance of the large-scale propensity score (LSPS) approach on causal analysis of medical data. We demonstrate that LSPS may adjust for indirectly measured confounders by including tens of thousands of covariates that may be correlated with them. We present conditions under which LSPS removes bias due to indirectly measured confounders, and we show that LSPS may avoid bias when inadvertently adjusting for variables (like colliders) that otherwise can induce bias. We demonstrate the performance of LSPS with both simulated medical data and real medical data.


Subject(s)
Confounding Factors, Epidemiologic , Bias , Causality , Propensity Score
6.
Biostatistics ; 23(2): 643-665, 2022 04 13.
Article in English | MEDLINE | ID: mdl-33417699

ABSTRACT

Personalized cancer treatments based on the molecular profile of a patient's tumor are an emerging and exciting class of treatments in oncology. As genomic tumor profiling is becoming more common, targeted treatments for specific molecular alterations are gaining traction. To discover new potential therapeutics that may apply to broad classes of tumors matching some molecular pattern, experimentalists and pharmacologists rely on high-throughput, in vitro screens of many compounds against many different cell lines. We propose a hierarchical Bayesian model of how cancer cell lines respond to drugs in these experiments and develop a method for fitting the model to real-world high-throughput screening data. Through a case study, the model is shown to capture nontrivial associations between molecular features and drug response, such as requiring both wild type TP53 and overexpression of MDM2 to be sensitive to Nutlin-3(a). In quantitative benchmarks, the model outperforms a standard approach in biology, with $\approx20\%$ lower predictive error on held out data. When combined with a conditional randomization testing procedure, the model discovers markers of therapeutic response that recapitulate known biology and suggest new avenues for investigation. All code for the article is publicly available at https://github.com/tansey/deep-dose-response.


Subject(s)
Antineoplastic Agents , Neoplasms , Antineoplastic Agents/pharmacology , Bayes Theorem , Drug Evaluation, Preclinical/methods , Early Detection of Cancer , High-Throughput Screening Assays , Humans , Neoplasms/drug therapy , Neoplasms/genetics
7.
Int Stat Rev ; 88(Suppl 1): S91-S113, 2020 Dec.
Article in English | MEDLINE | ID: mdl-35356801

ABSTRACT

Analyzing data from large-scale, multi-experiment studies requires scientists to both analyze each experiment and to assess the results as a whole. In this article, we develop double empirical Bayes testing (DEBT), an empirical Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. DEBT is a two-stage method: in the first stage, it reports which experiments yielded significant outcomes; in the second stage, it hypothesizes which covariates drive the experimental significance. In both of its stages, DEBT builds on Efron (2008), which lays out an elegant empirical Bayes approach to testing. DEBT enhances this framework by learning a series of black box predictive models to boost power and control the false discovery rate (FDR). In Stage 1, it uses a deep neural network prior to report which experiments yielded significant outcomes. In Stage 2, it uses an empirical Bayes version of the knockoff filter (Candes et al., 2018) to select covariates that have significant predictive power of Stage-1 significance. In both simulated and real data, DEBT increases the proportion of discovered significant outcomes and selects more features when signals are weak. In a real study of cancer cell lines, DEBT selects a robust set of biologically-plausible genomic drivers of drug sensitivity and resistance in cancer.

8.
Mol Syst Biol ; 15(2): e8557, 2019 02 22.
Article in English | MEDLINE | ID: mdl-30796088

ABSTRACT

Common approaches to gene signature discovery in single-cell RNA-sequencing (scRNA-seq) depend upon predefined structures like clusters or pseudo-temporal order, require prior normalization, or do not account for the sparsity of single-cell data. We present single-cell hierarchical Poisson factorization (scHPF), a Bayesian factorization method that adapts hierarchical Poisson factorization (Gopalan et al, 2015, Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 326) for de novo discovery of both continuous and discrete expression patterns from scRNA-seq. scHPF does not require prior normalization and captures statistical properties of single-cell data better than other methods in benchmark datasets. Applied to scRNA-seq of the core and margin of a high-grade glioma, scHPF uncovers marked differences in the abundance of glioma subpopulations across tumor regions and regionally associated expression biases within glioma subpopulations. scHFP revealed an expression signature that was spatially biased toward the glioma-infiltrated margins and associated with inferior survival in glioblastoma.


Subject(s)
Glioma/genetics , High-Throughput Nucleotide Sequencing/methods , Single-Cell Analysis , Transcriptome/genetics , Bayes Theorem , Gene Expression Regulation, Neoplastic/genetics , Glioma/pathology , Humans , Poisson Distribution
9.
PLoS One ; 13(4): e0195024, 2018.
Article in English | MEDLINE | ID: mdl-29630604

ABSTRACT

OBJECTIVE: Hospital readmission costs a lot of money every year. Many hospital readmissions are avoidable, and excessive hospital readmissions could also be harmful to the patients. Accurate prediction of hospital readmission can effectively help reduce the readmission risk. However, the complex relationship between readmission and potential risk factors makes readmission prediction a difficult task. The main goal of this paper is to explore deep learning models to distill such complex relationships and make accurate predictions. MATERIALS AND METHODS: We propose CONTENT, a deep model that predicts hospital readmissions via learning interpretable patient representations by capturing both local and global contexts from patient Electronic Health Records (EHR) through a hybrid Topic Recurrent Neural Network (TopicRNN) model. The experiment was conducted using the EHR of a real world Congestive Heart Failure (CHF) cohort of 5,393 patients. RESULTS: The proposed model outperforms state-of-the-art methods in readmission prediction (e.g. 0.6103 ± 0.0130 vs. second best 0.5998 ± 0.0124 in terms of ROC-AUC). The derived patient representations were further utilized for patient phenotyping. The learned phenotypes provide more precise understanding of readmission risks. DISCUSSION: Embedding both local and global context in patient representation not only improves prediction performance, but also brings interpretable insights of understanding readmission risks for heterogeneous chronic clinical conditions. CONCLUSION: This is the first of its kind model that integrates the power of both conventional deep neural network and the probabilistic generative models for highly interpretable deep patient representation learning. Experimental results and case studies demonstrate the improved performance and interpretability of the model.


Subject(s)
Electronic Health Records/statistics & numerical data , Heart Failure/therapy , Models, Statistical , Patient Discharge/standards , Patient Readmission , Humans , Risk Factors
10.
Proc Natl Acad Sci U S A ; 115(13): 3308-3313, 2018 03 27.
Article in English | MEDLINE | ID: mdl-29531061

ABSTRACT

Assessing scholarly influence is critical for understanding the collective system of scholarship and the history of academic inquiry. Influence is multifaceted, and citations reveal only part of it. Citation counts exhibit preferential attachment and follow a rigid "news cycle" that can miss sustained and indirect forms of influence. Building on dynamic topic models that track distributional shifts in discourse over time, we introduce a variant that incorporates features, such as authorship, affiliation, and publication venue, to assess how these contexts interact with content to shape future scholarship. We perform in-depth analyses on collections of physics research (500,000 abstracts; 102 years) and scholarship generally (JSTOR repository: 2 million full-text articles; 130 years). Our measure of document influence helps predict citations and shows how outcomes, such as winning a Nobel Prize or affiliation with a highly ranked institution, boost influence. Analysis of citations alongside discursive influence reveals that citations tend to credit authors who persist in their fields over time and discount credit for works that are influential over many topics or are "ahead of their time." In this way, our measures provide a way to acknowledge diverse contributions that take longer and travel farther to achieve scholarly appreciation, enabling us to correct citation biases and enhance sensitivity to the full spectrum of scholarly impact.

11.
Neuroimage ; 180(Pt A): 243-252, 2018 10 15.
Article in English | MEDLINE | ID: mdl-29448074

ABSTRACT

Recent research shows that the covariance structure of functional magnetic resonance imaging (fMRI) data - commonly described as functional connectivity - can change as a function of the participant's cognitive state (for review see Turk-Browne, 2013). Here we present a Bayesian hierarchical matrix factorization model, termed hierarchical topographic factor analysis (HTFA), for efficiently discovering full-brain networks in large multi-subject neuroimaging datasets. HTFA approximates each subject's network by first re-representing each brain image in terms of the activities of a set of localized nodes, and then computing the covariance of the activity time series of these nodes. The number of nodes, along with their locations, sizes, and activities (over time) are learned from the data. Because the number of nodes is typically substantially smaller than the number of fMRI voxels, HTFA can be orders of magnitude more efficient than traditional voxel-based functional connectivity approaches. In one case study, we show that HTFA recovers the known connectivity patterns underlying a collection of synthetic datasets. In a second case study, we illustrate how HTFA may be used to discover dynamic full-brain activity and connectivity patterns in real fMRI data, collected as participants listened to a story. In a third case study, we carried out a similar series of analyses on fMRI data collected as participants viewed an episode of a television show. In these latter case studies, we found that the HTFA-derived activity and connectivity patterns can be used to reliably decode which moments in the story or show the participants were experiencing. Further, we found that these two classes of patterns contained partially non-overlapping information, such that decoders trained on combinations of activity-based and dynamic connectivity-based features performed better than decoders trained on activity or connectivity patterns alone. We replicated this latter result with two additional (previously developed) methods for efficiently characterizing full-brain activity and connectivity patterns.


Subject(s)
Brain Mapping/methods , Brain/physiology , Nerve Net/physiology , Factor Analysis, Statistical , Humans , Image Processing, Computer-Assisted , Magnetic Resonance Imaging/methods
12.
Proc Natl Acad Sci U S A ; 114(33): 8689-8692, 2017 Aug 15.
Article in English | MEDLINE | ID: mdl-28784795

ABSTRACT

Data science has attracted a lot of attention, promising to turn vast amounts of data into useful predictions and insights. In this article, we ask why scientists should care about data science. To answer, we discuss data science from three perspectives: statistical, computational, and human. Although each of the three is a critical component of data science, we argue that the effective combination of all three components is the essence of what data science is about.

13.
Nat Genet ; 48(12): 1587-1590, 2016 12.
Article in English | MEDLINE | ID: mdl-27819665

ABSTRACT

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order of millions of individuals, and existing methods do not scale to data of this size. To solve this problem, we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (1012 observed genotypes; for example, 1 million individuals at 1 million SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure among individuals. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample sizes.


Subject(s)
Algorithms , Computational Biology/methods , Disease/genetics , Genetic Markers/genetics , Genetic Predisposition to Disease , Models, Statistical , Polymorphism, Single Nucleotide/genetics , Bayes Theorem , Genetics, Population , Humans
14.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 256-70, 2015 Feb.
Article in English | MEDLINE | ID: mdl-26353240

ABSTRACT

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP generalizes the nested Chinese restaurant process (nCRP) to allow each word to follow its own path to a topic node according to a per-document distribution over the paths on a shared tree. This alleviates the rigid, single-path formulation assumed by the nCRP, allowing documents to easily express complex thematic borrowings. We derive a stochastic variational inference algorithm for the model, which enables efficient inference for massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 2.7 million documents from Wikipedia.

15.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 334-45, 2015 Feb.
Article in English | MEDLINE | ID: mdl-26353245

ABSTRACT

Latent feature models are widely used to decompose data into a small number of components. Bayesian nonparametric variants of these models, which use the Indian buffet process (IBP) as a prior over latent features, allow the number of features to be determined from the data. We present a generalization of the IBP, the distance dependent Indian buffet process (dd-IBP), for modeling non-exchangeable data. It relies on distances defined between data points, biasing nearby data to share more features. The choice of distance measure allows for many kinds of dependencies, including temporal and spatial. Further, the original IBP is a special case of the dd-IBP. We develop the dd-IBP and theoretically characterize its feature-sharing properties. We derive a Markov chain Monte Carlo sampler for a linear Gaussian model with a dd-IBP prior and study its performance on real-world non-exchangeable data.

16.
IEEE Trans Pattern Anal Mach Intell ; 37(2): 346-58, 2015 Feb.
Article in English | MEDLINE | ID: mdl-26353246

ABSTRACT

Super-resolution methods form high-resolution images from low-resolution images. In this paper, we develop a new Bayesian nonparametric model for super-resolution. Our method uses a beta-Bernoulli process to learn a set of recurring visual patterns, called dictionary elements, from the data. Because it is nonparametric, the number of elements found is also determined from the data. We test the results on both benchmark and natural images, comparing with several other models from the research literature. We perform large-scale human evaluation experiments to assess the visual quality of the results. In a first implementation, we use Gibbs sampling to approximate the posterior. However, this algorithm is not feasible for large-scale data. To circumvent this, we then develop an online variational Bayes (VB) algorithm. This algorithm finds high quality dictionaries in a fraction of the time needed by the Gibbs sampler.


Subject(s)
Image Processing, Computer-Assisted/methods , Algorithms , Bayes Theorem , Humans , Statistics, Nonparametric
17.
Proc Natl Acad Sci U S A ; 112(26): E3441-50, 2015 Jun 30.
Article in English | MEDLINE | ID: mdl-26071445

ABSTRACT

Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of fit of a statistical model to a specific dataset. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the admixture model fit to four qualitatively different population genetic datasets: the population reference sample (POPRES) European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.


Subject(s)
Models, Theoretical , Bayes Theorem , Genetic Variation , Humans , Linkage Disequilibrium , Uncertainty
18.
J Am Med Inform Assoc ; 22(4): 872-80, 2015 Jul.
Article in English | MEDLINE | ID: mdl-25896647

ABSTRACT

BACKGROUND: As adoption of electronic health records continues to increase, there is an opportunity to incorporate clinical documentation as well as laboratory values and demographics into risk prediction modeling. OBJECTIVE: The authors develop a risk prediction model for chronic kidney disease (CKD) progression from stage III to stage IV that includes longitudinal data and features drawn from clinical documentation. METHODS: The study cohort consisted of 2908 primary-care clinic patients who had at least three visits prior to January 1, 2013 and developed CKD stage III during their documented history. Development and validation cohorts were randomly selected from this cohort and the study datasets included longitudinal inpatient and outpatient data from these populations. Time series analysis (Kalman filter) and survival analysis (Cox proportional hazards) were combined to produce a range of risk models. These models were evaluated using concordance, a discriminatory statistic. RESULTS: A risk model incorporating longitudinal data on clinical documentation and laboratory test results (concordance 0.849) predicts progression from state III CKD to stage IV CKD more accurately when compared to a similar model without laboratory test results (concordance 0.733, P<.001), a model that only considers the most recent laboratory test results (concordance 0.819, P < .031) and a model based on estimated glomerular filtration rate (concordance 0.779, P < .001). CONCLUSIONS: A risk prediction model that takes longitudinal laboratory test results and clinical documentation into consideration can predict CKD progression from stage III to stage IV more accurately than three models that do not take all of these variables into consideration.


Subject(s)
Electronic Health Records , Renal Insufficiency, Chronic/physiopathology , Risk Assessment , Aged , Cohort Studies , Disease Progression , Female , Glomerular Filtration Rate , Humans , Longitudinal Studies , Male , Middle Aged , Models, Theoretical , Primary Health Care , Proportional Hazards Models , Survival Analysis , Time
19.
PLoS One ; 9(5): e94914, 2014.
Article in English | MEDLINE | ID: mdl-24804795

ABSTRACT

The neural patterns recorded during a neuroscientific experiment reflect complex interactions between many brain regions, each comprising millions of neurons. However, the measurements themselves are typically abstracted from that underlying structure. For example, functional magnetic resonance imaging (fMRI) datasets comprise a time series of three-dimensional images, where each voxel in an image (roughly) reflects the activity of the brain structure(s)-located at the corresponding point in space-at the time the image was collected. FMRI data often exhibit strong spatial correlations, whereby nearby voxels behave similarly over time as the underlying brain structure modulates its activity. Here we develop topographic factor analysis (TFA), a technique that exploits spatial correlations in fMRI data to recover the underlying structure that the images reflect. Specifically, TFA casts each brain image as a weighted sum of spatial functions. The parameters of those spatial functions, which may be learned by applying TFA to an fMRI dataset, reveal the locations and sizes of the brain structures activated while the data were collected, as well as the interactions between those structures.


Subject(s)
Bayes Theorem , Brain/physiology , Factor Analysis, Statistical , Humans , Magnetic Resonance Imaging , Models, Neurological
20.
Neuroimage ; 98: 91-102, 2014 Sep.
Article in English | MEDLINE | ID: mdl-24791745

ABSTRACT

This paper extends earlier work on spatial modeling of fMRI data to the temporal domain, providing a framework for analyzing high temporal resolution brain imaging modalities such as electroencapholography (EEG). The central idea is to decompose brain imaging data into a covariate-dependent superposition of functions defined over continuous time and space (what we refer to as topographic latent sources). The continuous formulation allows us to parametrically model spatiotemporally localized activations. To make group-level inferences, we elaborate the model hierarchically by sharing sources across subjects. We describe a variational algorithm for parameter estimation that scales efficiently to large data sets. Applied to three EEG data sets, we find that the model produces good predictive performance and reproduces a number of classic findings. Our results suggest that topographic latent sources serve as an effective hypothesis space for interpreting spatiotemporal brain imaging data.


Subject(s)
Brain Mapping , Brain/physiology , Electroencephalography , Models, Neurological , Models, Statistical , Adolescent , Adult , Algorithms , Event-Related Potentials, P300 , Humans , Time Factors , Young Adult
SELECTION OF CITATIONS
SEARCH DETAIL
...