Search | VHL Regional Portal

Strain Tracking with Uncertainty Quantification.

Kim, Younhun; Worby, Colin J; Acharya, Sawal; van Dijk, Lucas R; Alfonsetti, Daniel; Gromko, Zackary; Azimzadeh, Philippe; Dodson, Karen; Gerber, Georg; Hultgren, Scott; Earl, Ashlee M; Berger, Bonnie; Gibson, Travis E.

bioRxiv ; 2023 Jan 26.

Article in English | MEDLINE | ID: mdl-36747646

ABSTRACT

The ability to detect and quantify microbiota over time has a plethora of clinical, basic science, and public health applications. One of the primary means of tracking microbiota is through sequencing technologies. When the microorganism of interest is well characterized or known a priori, targeted sequencing is often used. In many applications, however, untargeted bulk (shotgun) sequencing is more appropriate; for instance, the tracking of infection transmission events and nucleotide variants across multiple genomic loci, or studying the role of multiple genes in a particular phenotype. Given these applications, and the observation that pathogens (e.g. Clostridioides difficile, Escherichia coli, Salmonella enterica) and other taxa of interest can reside at low relative abundance in the gastrointestinal tract, there is a critical need for algorithms that accurately track low-abundance taxa with strain level resolution. Here we present a sequence quality- and time-aware model, ChronoStrain, that introduces uncertainty quantification to gauge low-abundance species and significantly outperforms the current state-of-the-art on both real and synthetic data. ChronoStrain leverages sequences' quality scores and the samples' temporal information to produce a probability distribution over abundance trajectories for each strain tracked in the model. We demonstrate Chronostrain's improved performance in capturing post-antibiotic E. coli strain blooms among women with recurrent urinary tract infections (UTIs) from the UTI Microbiome (UMB) Project. Other strain tracking models on the same data either show inconsistent temporal colonization or can only track consistently using very coarse groupings. In contrast, our probabilistic outputs can reveal the relationship between low-confidence strains present in the sample that cannot be reliably assigned a single reference label (either due to poor coverage or novelty) while simultaneously calling high-confidence strains that can be unambiguously assigned a label. We also include and analyze newly sequenced cultured samples from the UMB Project.

How Many Subpopulations Is Too Many? Exponential Lower Bounds for Inferring Population Histories.

Kim, Younhun; Koehler, Frederic; Moitra, Ankur; Mossel, Elchanan; Ramnarayan, Govind.

J Comput Biol ; 27(4): 613-625, 2020 04.

Article in English | MEDLINE | ID: mdl-31794679

ABSTRACT

Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, such as the seminal work of Li and Durbin, attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure-the history of multiple subpopulations that merge, split, and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations. Along the way and of independent interest, we essentially determine the optimal number of samples needed to learn an exponential mixture distribution information-theoretically, proving the upper bound by analyzing natural (and efficient) algorithms for this problem.

Subject(s)

Computational Biology , Genetics, Population , Information Theory , Models, Genetic , Algorithms , Computer Simulation , Polymorphism, Single Nucleotide/genetics

Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes.

Leiserson, Mark D M; Vandin, Fabio; Wu, Hsin-Ta; Dobson, Jason R; Eldridge, Jonathan V; Thomas, Jacob L; Papoutsaki, Alexandra; Kim, Younhun; Niu, Beifang; McLellan, Michael; Lawrence, Michael S; Gonzalez-Perez, Abel; Tamborero, David; Cheng, Yuwei; Ryslik, Gregory A; Lopez-Bigas, Nuria; Getz, Gad; Ding, Li; Raphael, Benjamin J.

Nat Genet ; 47(2): 106-14, 2015 Feb.

Article in English | MEDLINE | ID: mdl-25501392

ABSTRACT

Cancers exhibit extensive mutational heterogeneity, and the resulting long-tail phenomenon complicates the discovery of genes and pathways that are significantly mutated in cancer. We perform a pan-cancer analysis of mutated networks in 3,281 samples from 12 cancer types from The Cancer Genome Atlas (TCGA) using HotNet2, a new algorithm to find mutated subnetworks that overcomes the limitations of existing single-gene, pathway and network approaches. We identify 16 significantly mutated subnetworks that comprise well-known cancer signaling pathways as well as subnetworks with less characterized roles in cancer, including cohesin, condensin and others. Many of these subnetworks exhibit co-occurring mutations across samples. These subnetworks contain dozens of genes with rare somatic mutations across multiple cancers; many of these genes have additional evidence supporting a role in cancer. By illuminating these rare combinations of mutations, pan-cancer network analyses provide a roadmap to investigate new diagnostic and therapeutic opportunities across cancer types.

Subject(s)

Algorithms , Computational Biology/methods , Gene Regulatory Networks/genetics , Genome/genetics , Neoplasms/genetics , Signal Transduction/genetics , Databases, Genetic , Humans , Multiprotein Complexes/genetics , Mutation , Neoplasms/diagnosis

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL