Search | VHL Regional Portal

1.

Random-effects substitution models for phylogenetics via scalable gradient approximations.

Magee, Andrew F; Holbrook, Andrew J; Pekar, Jonathan E; Caviedes-Solis, Itzue W; Matsen Iv, Fredrick A; Baele, Guy; Wertheim, Joel O; Ji, Xiang; Lemey, Philippe; Suchard, Marc A.

Syst Biol ; 2024 May 07.

Article in English | MEDLINE | ID: mdl-38712512

ABSTRACT

Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

2.

ANTsX neuroimaging-derived structural phenotypes of UK Biobank.

Tustison, Nicholas J; Yassa, Michael A; Rizvi, Batool; Cook, Philip A; Holbrook, Andrew J; Sathishkumar, Mithra T; Tustison, Mia G; Gee, James C; Stone, James R; Avants, Brian B.

Sci Rep ; 14(1): 8848, 2024 04 17.

Article in English | MEDLINE | ID: mdl-38632390

ABSTRACT

UK Biobank is a large-scale epidemiological resource for investigating prospective correlations between various lifestyle, environmental, and genetic factors with health and disease progression. In addition to individual subject information obtained through surveys and physical examinations, a comprehensive neuroimaging battery consisting of multiple modalities provides imaging-derived phenotypes (IDPs) that can serve as biomarkers in neuroscience research. In this study, we augment the existing set of UK Biobank neuroimaging structural IDPs, obtained from well-established software libraries such as FSL and FreeSurfer, with related measurements acquired through the Advanced Normalization Tools Ecosystem. This includes previously established cortical and subcortical measurements defined, in part, based on the Desikan-Killiany-Tourville atlas. Also included are morphological measurements from two recent developments: medial temporal lobe parcellation of hippocampal and extra-hippocampal regions in addition to cerebellum parcellation and thickness based on the Schmahmann anatomical labeling. Through predictive modeling, we assess the clinical utility of these IDP measurements, individually and in combination, using commonly studied phenotypic correlates including age, fluid intelligence, numeric memory, and several other sociodemographic variables. The predictive accuracy of these IDP-based models, in terms of root-mean-squared-error or area-under-the-curve for continuous and categorical variables, respectively, provides comparative insights between software libraries as well as potential clinical interpretability. Results demonstrate varied performance between package-based IDP sets and their combination, emphasizing the need for careful consideration in their selection and utilization.

Subject(s)

Biological Specimen Banks , UK Biobank , Ecosystem , Prospective Studies , Neuroimaging/methods , Phenotype , Magnetic Resonance Imaging/methods , Brain

3.

On the surprising effectiveness of a simple matrix exponential derivative approximation, with application to global SARS-CoV-2.

Didier, Gustavo; Glatt-Holtz, Nathan E; Holbrook, Andrew J; Magee, Andrew F; Suchard, Marc A.

Proc Natl Acad Sci U S A ; 121(3): e2318989121, 2024 Jan 16.

Article in English | MEDLINE | ID: mdl-38215186

ABSTRACT

The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC's infinitesimal generator (rate) matrix. Motivated by the derivative's extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a "blessing of dimensionality" result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.

Subject(s)

COVID-19 , SARS-CoV-2 , Humans , Algorithms , COVID-19/epidemiology , Markov Chains

4.

A quantum parallel Markov chain Monte Carlo.

Holbrook, Andrew J.

J Comput Graph Stat ; 32(4): 1402-1415, 2023.

Article in English | MEDLINE | ID: mdl-38127472

ABSTRACT

We propose a novel hybrid quantum computing strategy for parallel MCMC algorithms that generate multiple proposals at each step. This strategy makes the rate-limiting step within parallel MCMC amenable to quantum parallelization by using the Gumbel-max trick to turn the generalized accept-reject step into a discrete optimization problem. When combined with new insights from the parallel MCMC literature, such an approach allows us to embed target density evaluations within a well-known extension of Grover's quantum search algorithm. Letting PdenotethenumberofproposalsinasingleMCMCiteration,thecombinedstrategyreducesthenumberoftargetevaluationsrequiredfromðª(P)toðªP1/2. In the following, we review the rudiments of quantum computing, quantum search and the Gumbel-max trick in order to elucidate their combination for as wide a readership as possible.

5.

ANTsX neuroimaging-derived structural phenotypes of UK Biobank.

Tustison, Nicholas J; Yassa, Michael A; Rizvi, Batool; Cook, Philip A; Holbrook, Andrew J; Sathishkumar, Mithra T; Tustison, Mia G; Gee, James C; Stone, James R; Avants, Brian B.

Res Sq ; 2023 Oct 30.

Article in English | MEDLINE | ID: mdl-37961236

ABSTRACT

UK Biobank is a large-scale epidemiological resource for investigating prospective correlations between various lifestyle, environmental, and genetic factors with health and disease progression. In addition to individual subject information obtained through surveys and physical examinations, a comprehensive neuroimaging battery consisting of multiple modalities provides imaging-derived phenotypes (IDPs) that can serve as biomarkers in neuroscience research. In this study, we augment the existing set of UK Biobank neuroimaging structural IDPs, obtained from well-established software libraries such as FSL and FreeSurfer, with related measurements acquired through the Advanced Normalization Tools Ecosystem. This includes previously established cortical and subcortical measurements defined, in part, based on the Desikan-Killiany-Tourville atlas. Also included are morphological measurements from two recent developments: medial temporal lobe parcellation of hippocampal and extra-hippocampal regions in addition to cerebellum parcellation and thickness based on the Schmahmann anatomical labeling. Through predictive modeling, we assess the clinical utility of these IDP measurements, individually and in combination, using commonly studied phenotypic correlates including age, fluid intelligence, numeric memory, and several other sociodemographic variables. The predictive accuracy of these IDP-based models, in terms of root-mean-squared-error or area-under-the-curve for continuous and categorical variables, respectively, provides comparative insights between software libraries as well as potential clinical interpretability. Results demonstrate varied performance between package-based IDP sets and their combination, emphasizing the need for careful consideration in their selection and utilization.

6.

Generating MCMC proposals by randomly rotating the regular simplex.

Holbrook, Andrew J.

J Multivar Anal ; 1942023 Mar.

Article in English | MEDLINE | ID: mdl-37799825

ABSTRACT

We present the simplicial sampler, a class of parallel MCMC methods that generate and choose from multiple proposals at each iteration. The algorithm's multiproposal randomly rotates a simplex connected to the current Markov chain state in a way that inherently preserves symmetry between proposals. As a result, the simplicial sampler leads to a simplified acceptance step: it simply chooses from among the simplex nodes with probability proportional to their target density values. We also investigate a multivariate Gaussian-based symmetric multiproposal mechanism and prove that it also enjoys the same simplified acceptance step. This insight leads to significant theoretical and practical speedups. While both algorithms enjoy natural parallelizability, we show that conventional implementations are sufficient to confer efficiency gains across an array of dimensions and a number of target distributions.

7.

Accelerating Bayesian inference of dependency between mixed-type biological traits.

Zhang, Zhenyu; Nishimura, Akihiko; Trovão, Nídia S; Cherry, Joshua L; Holbrook, Andrew J; Ji, Xiang; Lemey, Philippe; Suchard, Marc A.

PLoS Comput Biol ; 19(8): e1011419, 2023 08.

Article in English | MEDLINE | ID: mdl-37639445

ABSTRACT

Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck-integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.

Subject(s)

Influenza A Virus, H1N1 Subtype , Bayes Theorem , Influenza A Virus, H1N1 Subtype/genetics , Phylogeny , Flowers , Glycosylation

8.

Random-effects substitution models for phylogenetics via scalable gradient approximations.

Magee, Andrew F; Holbrook, Andrew J; Pekar, Jonathan E; Caviedes-Solis, Itzue W; Iv, Fredrick A Matsen; Baele, Guy; Wertheim, Joel O; Ji, Xiang; Lemey, Philippe; Suchard, Marc A.

ArXiv ; 2023 Sep 25.

Article in English | MEDLINE | ID: mdl-36994154

ABSTRACT

Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

9.

BAYESIAN MITIGATION OF SPATIAL COARSENING FOR A HAWKES MODEL APPLIED TO GUNFIRE, WILDFIRE AND VIRAL CONTAGION.

Holbrook, Andrew J; Ji, Xiang; Suchard, Marc A.

Ann Appl Stat ; 16(1): 573-595, 2022 Mar.

Article in English | MEDLINE | ID: mdl-36211254

ABSTRACT

Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the "wrong" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.

10.

From viral evolution to spatial contagion: a biologically modulated Hawkes model.

Holbrook, Andrew J; Ji, Xiang; Suchard, Marc A.

Bioinformatics ; 38(7): 1846-1856, 2022 03 28.

Article in English | MEDLINE | ID: mdl-35040956

ABSTRACT

SUMMARY: Mutations sometimes increase contagiousness for evolving pathogens. During an epidemic, scientists use viral genome data to infer a shared evolutionary history and connect this history to geographic spread. We propose a model that directly relates a pathogen's evolution to its spatial contagion dynamics-effectively combining the two epidemiological paradigms of phylogenetic inference and self-exciting process modeling-and apply this phylogenetic Hawkes process to a Bayesian analysis of 23â421 viral cases from the 2014 to 2016 Ebola outbreak in West Africa. The proposed model is able to detect individual viruses with significantly elevated rates of spatiotemporal propagation for a subset of 1610 samples that provide genome data. Finally, to facilitate model application in big data settings, we develop massively parallel implementations for the gradient and Hessian of the log-likelihood and apply our high-performance computing framework within an adaptively pre-conditioned Hamiltonian Monte Carlo routine. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Hemorrhagic Fever, Ebola , Humans , Bayes Theorem , Phylogeny , Disease Outbreaks , Genome, Viral

11.

Principled, practical, flexible, fast: a new approach to phylogenetic factor analysis.

Hassler, Gabriel W; Gallone, Brigida; Aristide, Leandro; Allen, William L; Tolkoff, Max R; Holbrook, Andrew J; Baele, Guy; Lemey, Philippe; Suchard, Marc A.

Methods Ecol Evol ; 13(10): 2181-2197, 2022 Oct.

Article in English | MEDLINE | ID: mdl-36908682

ABSTRACT

Biological phenotypes are products of complex evolutionary processes in which selective forces influence multiple biological trait measurements in unknown ways. Phylogenetic comparative methods seek to disentangle these relationships across the evolutionary history of a group of organisms. Unfortunately, most existing methods fail to accommodate high-dimensional data with dozens or even thousands of observations per taxon. Phylogenetic factor analysis offers a solution to the challenge of dimensionality. However, scientists seeking to employ this modeling framework confront numerous modeling and implementation decisions, the details of which pose computational and replicability challenges.We develop new inference techniques that increase both the computational efficiency and modeling flexibility of phylogenetic factor analysis. To facilitate adoption of these new methods, we present a practical analysis plan that guides researchers through the web of complex modeling decisions. We codify this analysis plan in an automated pipeline that distills the potentially overwhelming array of decisions into a small handful of (typically binary) choices.We demonstrate the utility of these methods and analysis plan in four real-world problems of varying scales. Specifically, we study floral phenotype and pollination in columbines, domestication in industrial yeast, life history in mammals, and brain morphology in New World monkeys.General and impactful community employment of these methods requires a data scientific analysis plan that balances flexibility, speed and ease of use, while minimizing model and algorithm tuning. Even in the presence of non-trivial phylogenetic model constraints, we show that one may analytically address latent factor uncertainty in a way that (a) aids model flexibility, (b) accelerates computation (by as much as 500-fold) and (c) decreases required tuning. These efforts coalesce to create an accessible Bayesian approach to high-dimensional phylogenetic comparative methods on large trees.

12.

Scalable Bayesian inference for self-excitatory stochastic processes applied to big American gunfire data.

Holbrook, Andrew J; Loeffler, Charles E; Flaxman, Seth R; Suchard, Marc A.

Stat Comput ; 31(1)2021 Jan.

Article in English | MEDLINE | ID: mdl-34354329

ABSTRACT

The Hawkes process and its extensions effectively model self-excitatory phenomena including earthquakes, viral pandemics, financial transactions, neural spike trains and the spread of memes through social networks. The usefulness of these stochastic process models within a host of economic sectors and scientific disciplines is undercut by the processes' computational burden: complexity of likelihood evaluations grows quadratically in the number of observations for both the temporal and spatiotemporal Hawkes processes. We show that, with care, one may parallelize these calculations using both central and graphics processing unit implementations to achieve over 100-fold speedups over single-core processing. Using a simple adaptive Metropolis-Hastings scheme, we apply our high-performance computing framework to a Bayesian analysis of big gunshot data generated in Washington D.C. between the years of 2006 and 2019, thereby extending a past analysis of the same data from under 10,000 to over 85,000 observations. To encourage widespread use, we provide hpHawkes, an open-source R package, and discuss high-level implementation and program design for leveraging aspects of computational hardware that become necessary in a big data setting.

13.

Massive parallelization boosts big Bayesian multidimensional scaling.

Holbrook, Andrew J; Lemey, Philippe; Baele, Guy; Dellicour, Simon; Brockmann, Dirk; Rambaut, Andrew; Suchard, Marc A.

J Comput Graph Stat ; 30(1): 11-24, 2021.

Article in English | MEDLINE | ID: mdl-34168419

ABSTRACT

Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space-time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We partially mitigate this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes. Finally, we provide MassiveMDS, an open-source, stand-alone C++ library and rudimentary R package, and discuss program design and high-level implementation with an emphasis on important aspects of computing architecture that become relevant at scale.

14.

The ANTsX ecosystem for quantitative biological and medical imaging.

Tustison, Nicholas J; Cook, Philip A; Holbrook, Andrew J; Johnson, Hans J; Muschelli, John; Devenyi, Gabriel A; Duda, Jeffrey T; Das, Sandhitsu R; Cullen, Nicholas C; Gillen, Daniel L; Yassa, Michael A; Stone, James R; Gee, James C; Avants, Brian B.

Sci Rep ; 11(1): 9068, 2021 04 27.

Article in English | MEDLINE | ID: mdl-33907199

ABSTRACT

The Advanced Normalizations Tools ecosystem, known as ANTsX, consists of multiple open-source software libraries which house top-performing algorithms used worldwide by scientific and research communities for processing and analyzing biological and medical imaging data. The base software library, ANTs, is built upon, and contributes to, the NIH-sponsored Insight Toolkit. Founded in 2008 with the highly regarded Symmetric Normalization image registration framework, the ANTs library has since grown to include additional functionality. Recent enhancements include statistical, visualization, and deep learning capabilities through interfacing with both the R statistical project (ANTsR) and Python (ANTsPy). Additionally, the corresponding deep learning extensions ANTsRNet and ANTsPyNet (built on the popular TensorFlow/Keras libraries) contain several popular network architectures and trained models for specific applications. One such comprehensive application is a deep learning analog for generating cortical thickness data from structural T1-weighted brain MRI, both cross-sectionally and longitudinally. These pipelines significantly improve computational efficiency and provide comparable-to-superior accuracy over multiple criteria relative to the existing ANTs workflows and simultaneously illustrate the importance of the comprehensive ANTsX approach as a framework for medical image analysis.

Subject(s)

Algorithms , Brain/anatomy & histology , Ecosystem , Image Processing, Computer-Assisted/methods , Magnetic Resonance Imaging/methods , Neuroimaging/methods , Adult , Aged , Humans , Male , Middle Aged , Software

15.

Anterolateral entorhinal cortex thickness as a new biomarker for early detection of Alzheimer's disease.

Holbrook, Andrew J; Tustison, Nicholas J; Marquez, Freddie; Roberts, Jared; Yassa, Michael A; Gillen, Daniel L.

Alzheimers Dement (Amst) ; 12(1): e12068, 2020.

Article in English | MEDLINE | ID: mdl-32875052

ABSTRACT

INTRODUCTION: Loss of entorhinal cortex (EC) layer II neurons represents the earliest Alzheimer's disease (AD) lesion in the brain. Research suggests differing functional roles between two EC subregions, the anterolateral EC (aLEC) and the posteromedial EC (pMEC). METHODS: We use joint label fusion to obtain aLEC and pMEC cortical thickness measurements from serial magnetic resonance imaging scans of 775 ADNI-1 participants (219 healthy; 380 mild cognitive impairment; 176 AD) and use linear mixed-effects models to analyze longitudinal associations among cortical thickness, disease status, and cognitive measures. RESULTS: Group status is reliably predicted by aLEC thickness, which also exhibits greater associations with cognitive outcomes than does pMEC thickness. Change in aLEC thickness is also associated with cerebrospinal fluid amyloid and tau levels. DISCUSSION: Thinning of aLEC is a sensitive structural biomarker that changes over short durations in the course of AD and tracks disease severity-it is a strong candidate biomarker for detection of early AD.

16.

Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics.

Ji, Xiang; Zhang, Zhenyu; Holbrook, Andrew; Nishimura, Akihiko; Baele, Guy; Rambaut, Andrew; Lemey, Philippe; Suchard, Marc A.

Mol Biol Evol ; 37(10): 3047-3060, 2020 10 01.

Article in English | MEDLINE | ID: mdl-32458974

ABSTRACT

Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.

Subject(s)

Evolution, Molecular , Models, Genetic , Phylogeny , Algorithms , Flavivirus/genetics , Lassa virus/genetics

17.

Flexible Bayesian Dynamic Modeling of Correlation and Covariance Matrices.

Lan, Shiwei; Holbrook, Andrew; Elias, Gabriel A; Fortin, Norbert J; Ombao, Hernando; Shahbaba, Babak.

Bayesian Anal ; 15(4): 1199-1228, 2020 Dec.

Article in English | MEDLINE | ID: mdl-33868547

ABSTRACT

Modeling correlation (and covariance) matrices can be challenging due to the positive-definiteness constraint and potential high-dimensionality. Our approach is to decompose the covariance matrix into the correlation and variance matrices and propose a novel Bayesian framework based on modeling the correlations as products of unit vectors. By specifying a wide range of distributions on a sphere (e.g. the squared-Dirichlet distribution), the proposed approach induces flexible prior distributions for covariance matrices (that go beyond the commonly used inverse-Wishart prior). For modeling real-life spatio-temporal processes with complex dependence structures, we extend our method to dynamic cases and introduce unit-vector Gaussian process priors in order to capture the evolution of correlation among components of a multivariate time series. To handle the intractability of the resulting posterior, we introduce the adaptive Δ-Spherical Hamiltonian Monte Carlo. We demonstrate the validity and flexibility of our proposed framework in a simulation study of periodic processes and an analysis of rat's local field potential activity in a complex sequence memory task.

18.

Neural network gradient Hamiltonian Monte Carlo.

Li, Lingge; Holbrook, Andrew; Shahbaba, Babak; Baldi, Pierre.

Comput Stat ; 34(1): 281-299, 2019 Mar.

Article in English | MEDLINE | ID: mdl-31695242

ABSTRACT

Hamiltonian Monte Carlo is a widely used algorithm for sampling from posterior distributions of complex Bayesian models. It can efficiently explore high-dimensional parameter spaces guided by simulated Hamiltonian flows. However, the algorithm requires repeated gradient calculations, and these computations become increasingly burdensome as data sets scale. We present a method to substantially reduce the computation burden by using a neural network to approximate the gradient. First, we prove that the proposed method still maintains convergence to the true distribution though the approximated gradient no longer comes from a Hamiltonian system. Second, we conduct experiments on synthetic examples and real data to validate the proposed method.

19.

Longitudinal Mapping of Cortical Thickness Measurements: An Alzheimer's Disease Neuroimaging Initiative-Based Evaluation Study.

Tustison, Nicholas J; Holbrook, Andrew J; Avants, Brian B; Roberts, Jared M; Cook, Philip A; Reagh, Zachariah M; Duda, Jeffrey T; Stone, James R; Gillen, Daniel L; Yassa, Michael A.

J Alzheimers Dis ; 71(1): 165-183, 2019.

Article in English | MEDLINE | ID: mdl-31356207

ABSTRACT

Longitudinal studies of development and disease in the human brain have motivated the acquisition of large neuroimaging data sets and the concomitant development of robust methodological and statistical tools for quantifying neurostructural changes. Longitudinal-specific strategies for acquisition and processing have potentially significant benefits including more consistent estimates of intra-subject measurements while retaining predictive power. Using the first phase of the Alzheimer's Disease Neuroimaging Initiative (ADNI-1) data, comprising over 600 subjects with multiple time points from baseline to 36 months, we evaluate the utility of longitudinal FreeSurfer and Advanced Normalization Tools (ANTs) surrogate thickness values in the context of a linear mixed-effects (LME) modeling strategy. Specifically, we estimate the residual variability and between-subject variability associated with each processing stream as it is known from the statistical literature that minimizing the former while simultaneously maximizing the latter leads to greater scientific interpretability in terms of tighter confidence intervals in calculated mean trends, smaller prediction intervals, and narrower confidence intervals for determining cross-sectional effects. This strategy is evaluated over the entire cortex, as defined by the Desikan-Killiany-Tourville labeling protocol, where comparisons are made with the cross-sectional and longitudinal FreeSurfer processing streams. Subsequent linear mixed effects modeling for identifying diagnostic groupings within the ADNI cohort is provided as supporting evidence for the utility of the proposed ANTs longitudinal framework which provides unbiased structural neuroimage processing and competitive to superior power for longitudinal structural change detection.

Subject(s)

Alzheimer Disease/diagnostic imaging , Biomarkers , Brain/diagnostic imaging , Brain/pathology , Cross-Sectional Studies , Disease Progression , Female , Humans , Linear Models , Longitudinal Studies , Male , Neuroimaging

20.

Geodesic Lagrangian Monte Carlo over the space of positive definite matrices: with application to Bayesian spectral density estimation.

Holbrook, Andrew; Lan, Shiwei; Vandenberg-Rodes, Alexander; Shahbaba, Babak.

J Stat Comput Simul ; 88(5): 982-1002, 2018.

Article in English | MEDLINE | ID: mdl-31105358

ABSTRACT

We present geodesic Lagrangian Monte Carlo, an extension of Hamiltonian Monte Carlo for sampling from posterior distributions defined on general Riemannian manifolds. We apply this new algorithm to Bayesian inference on symmetric or Hermitian positive definite matrices. To do so, we exploit the Riemannian structure induced by Cartan's canonical metric. The geodesics that correspond to this metric are available in closed-form and-within the context of Lagrangian Monte Carlo-provide a principled way to travel around the space of positive definite matrices. Our method improves Bayesian inference on such matrices by allowing for a broad range of priors, so we are not limited to conjugate priors only. In the context of spectral density estimation, we use the (non-conjugate) complex reference prior as an example modeling option made available by the algorithm. Results based on simulated and real-world multivariate time series are presented in this context, and future directions are outlined.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL