Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 104
Filtrar
1.
Bioinformatics ; 40(Supplement_1): i189-i198, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940152

RESUMO

MOTIVATION: Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. To perform this integration, however, the development of novel analytical strategies is needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however, they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations. RESULTS: We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts. AVAILABILITY AND IMPLEMENTATION: The tool is implemented in python and available at https://github.com/ratschlab/path-fa.


Assuntos
Teorema de Bayes , Humanos , Proteômica/métodos , Análise Fatorial , Perfilação da Expressão Gênica/métodos , Melanoma/metabolismo , Algoritmos , Biologia Computacional/métodos
2.
Bioinformatics ; 40(Supplement_1): i337-i346, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940164

RESUMO

MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.


Assuntos
Algoritmos , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas
3.
Bioinformatics ; 40(Supplement_1): i247-i256, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940165

RESUMO

MOTIVATION: Acute kidney injury (AKI) is a syndrome that affects a large fraction of all critically ill patients, and early diagnosis to receive adequate treatment is as imperative as it is challenging to make early. Consequently, machine learning approaches have been developed to predict AKI ahead of time. However, the prevalence of AKI is often underestimated in state-of-the-art approaches, as they rely on an AKI event annotation solely based on creatinine, ignoring urine output.We construct and evaluate early warning systems for AKI in a multi-disciplinary ICU setting, using the complete KDIGO definition of AKI. We propose several variants of gradient-boosted decision tree (GBDT)-based models, including a novel time-stacking based approach. A state-of-the-art LSTM-based model previously proposed for AKI prediction is used as a comparison, which was not specifically evaluated in ICU settings yet. RESULTS: We find that optimal performance is achieved by using GBDT with the time-based stacking technique (AUPRC = 65.7%, compared with the LSTM-based model's AUPRC = 62.6%), which is motivated by the high relevance of time since ICU admission for this task. Both models show mildly reduced performance in the limited training data setting, perform fairly across different subcohorts, and exhibit no issues in gender transfer.Following the official KDIGO definition substantially increases the number of annotated AKI events. In our study GBDTs outperform LSTM models for AKI prediction. Generally, we find that both model types are robust in a variety of challenging settings arising for ICU data. AVAILABILITY AND IMPLEMENTATION: The code to reproduce the findings of our manuscript can be found at: https://github.com/ratschlab/AKI-EWS.


Assuntos
Injúria Renal Aguda , Unidades de Terapia Intensiva , Humanos , Aprendizado de Máquina , Masculino , Feminino , Árvores de Decisões , Idoso , Pessoa de Meia-Idade
4.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38603597

RESUMO

MOTIVATION: The Oxford Nanopore Technologies (ONT) ReadUntil API enables selective sequencing, which aims to selectively favor interesting over uninteresting reads, e.g. to deplete or enrich certain genomic regions. The performance gain depends on the selective sequencing decision-making algorithm (SSDA) which decides whether to reject a read, stop receiving a read, or wait for more data. Since real runs are time-consuming and costly, simulating the ONT sequencer with support for the ReadUntil API is highly beneficial for comparing and optimizing new SSDAs. Existing software like MinKNOW and UNCALLED only return raw signal data, are memory-intensive, require huge and often unavailable multi-fast5 files (≥100GB) and are not clearly documented. RESULTS: We present the ONT device simulator SimReadUntil that takes a set of full reads as input, distributes them to channels and plays them back in real time including mux scans, channel gaps and blockages, and allows to reject reads as well as stop receiving data from them. Our modified ReadUntil API provides the basecalled reads rather than the raw signal, reducing computational load and focusing on the SSDA rather than on basecalling. Tuning the parameters of tools like ReadFish and ReadBouncer becomes easier because a GPU for basecalling is no longer required. We offer various methods to extract simulation parameters from a sequencing summary file and adapt ReadFish to replicate one of their enrichment experiments. SimReadUntil's gRPC interface allows standardized interaction with a wide range of programming languages. AVAILABILITY AND IMPLEMENTATION: Code and fully worked examples are available on GitHub (https://github.com/ratschlab/sim_read_until).


Assuntos
Algoritmos , Benchmarking , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento por Nanoporos/métodos
5.
NPJ Digit Med ; 7(1): 64, 2024 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-38467710

RESUMO

Multiple sclerosis (MS) is a neurological disease of the central nervous system that is the leading cause of non-traumatic disability in young adults. Clinical laboratory tests and neuroimaging studies are the standard methods to diagnose and monitor MS. However, due to infrequent clinic visits, it is fundamental to identify remote and frequent approaches for monitoring MS, which enable timely diagnosis, early access to treatment, and slowing down disease progression. In this work, we investigate the most reliable, clinically useful, and available features derived from mobile and wearable devices as well as their ability to distinguish people with MS (PwMS) from healthy controls, recognize MS disability and fatigue levels. To this end, we formalize clinical knowledge and derive behavioral markers to characterize MS. We evaluate our approach on a dataset we collected from 55 PwMS and 24 healthy controls for a total of 489 days conducted in free-living conditions. The dataset contains wearable sensor data - e.g., heart rate - collected using an arm-worn device, smartphone data - e.g., phone locks - collected through a mobile application, patient health records - e.g., MS type - obtained from the hospital, and self-reports - e.g., fatigue level - collected using validated questionnaires administered via the mobile application. Our results demonstrate the feasibility of using features derived from mobile and wearable sensors to monitor MS. Our findings open up opportunities for continuous monitoring of MS in free-living conditions and can be used to evaluate and guide the effectiveness of treatments, manage the disease, and identify participants for clinical trials.

7.
Elife ; 122023 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-37823551

RESUMO

The splicing factor SF3B1 is recurrently mutated in various tumors, including pancreatic ductal adenocarcinoma (PDAC). The impact of the hotspot mutation SF3B1K700E on the PDAC pathogenesis, however, remains elusive. Here, we demonstrate that Sf3b1K700E alone is insufficient to induce malignant transformation of the murine pancreas, but that it increases aggressiveness of PDAC if it co-occurs with mutated KRAS and p53. We further show that Sf3b1K700E already plays a role during early stages of pancreatic tumor progression and reduces the expression of TGF-ß1-responsive epithelial-mesenchymal transition (EMT) genes. Moreover, we found that SF3B1K700E confers resistance to TGF-ß1-induced cell death in pancreatic organoids and cell lines, partly mediated through aberrant splicing of Map3k7. Overall, our findings demonstrate that SF3B1K700E acts as an oncogenic driver in PDAC, and suggest that it promotes the progression of early stage tumors by impeding the cellular response to tumor suppressive effects of TGF-ß.


Assuntos
Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Animais , Humanos , Camundongos , Carcinoma Ductal Pancreático/patologia , Linhagem Celular Tumoral , Mutação , Ductos Pancreáticos/metabolismo , Neoplasias Pancreáticas/patologia , Fosfoproteínas/metabolismo , Fatores de Processamento de RNA/metabolismo , Fatores de Transcrição/metabolismo , Fator de Crescimento Transformador beta1/metabolismo , Neoplasias Pancreáticas
8.
Nat Methods ; 20(11): 1759-1768, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37770709

RESUMO

Understanding and predicting molecular responses in single cells upon chemical, genetic or mechanical perturbations is a core question in biology. Obtaining single-cell measurements typically requires the cells to be destroyed. This makes learning heterogeneous perturbation responses challenging as we only observe unpaired distributions of perturbed or non-perturbed cells. Here we leverage the theory of optimal transport and the recent advent of input convex neural architectures to present CellOT, a framework for learning the response of individual cells to a given perturbation by mapping these unpaired distributions. CellOT outperforms current methods at predicting single-cell drug responses, as profiled by scRNA-seq and a multiplexed protein-imaging technology. Further, we illustrate that CellOT generalizes well on unseen settings by (1) predicting the scRNA-seq responses of holdout patients with lupus exposed to interferon-ß and patients with glioblastoma to panobinostat; (2) inferring lipopolysaccharide responses across different species; and (3) modeling the hematopoietic developmental trajectories of different subpopulations.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Humanos , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos
9.
PLoS Comput Biol ; 19(5): e1011001, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37126495

RESUMO

The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.


Assuntos
Aprendizado Profundo , Metagenoma , Metagenoma/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Metagenômica , Software
10.
Genome Res ; 33(7): 1208-1217, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37072187

RESUMO

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Alinhamento de Sequência , Análise de Sequência de DNA/métodos
11.
Genomics ; 115(2): 110587, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36796655

RESUMO

Precision oncology relies on the accurate identification of somatic mutations in cancer patients. While the sequencing of the tumoral tissue is frequently part of routine clinical care, the healthy counterparts are rarely sequenced. We previously published PipeIT, a somatic variant calling workflow specific for Ion Torrent sequencing data enclosed in a Singularity container. PipeIT combines user-friendly execution, reproducibility and reliable mutation identification, but relies on matched germline sequencing data to exclude germline variants. Expanding on the original PipeIT, here we describe PipeIT2 to address the clinical need to define somatic mutations in the absence of germline control. We show that PipeIT2 achieves a > 95% recall for variants with variant allele fraction >10%, reliably detects driver and actionable mutations and filters out most of the germline mutations and sequencing artifacts. With its performance, reproducibility, and ease of execution, PipeIT2 is a valuable addition to molecular diagnostics laboratories.


Assuntos
Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Patologia Molecular , Fluxo de Trabalho , Reprodutibilidade dos Testes , Medicina de Precisão , Mutação , Sequenciamento de Nucleotídeos em Larga Escala
13.
Nat Metab ; 5(1): 80-95, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36717752

RESUMO

Methylmalonic aciduria (MMA) is an inborn error of metabolism with multiple monogenic causes and a poorly understood pathogenesis, leading to the absence of effective causal treatments. Here we employ multi-layered omics profiling combined with biochemical and clinical features of individuals with MMA to reveal a molecular diagnosis for 177 out of 210 (84%) cases, the majority (148) of whom display pathogenic variants in methylmalonyl-CoA mutase (MMUT). Stratification of these data layers by disease severity shows dysregulation of the tricarboxylic acid cycle and its replenishment (anaplerosis) by glutamine. The relevance of these disturbances is evidenced by multi-organ metabolomics of a hemizygous Mmut mouse model as well as through identification of physical interactions between MMUT and glutamine anaplerotic enzymes. Using stable-isotope tracing, we find that treatment with dimethyl-oxoglutarate restores deficient tricarboxylic acid cycling. Our work highlights glutamine anaplerosis as a potential therapeutic intervention point in MMA.


Assuntos
Erros Inatos do Metabolismo , Metilmalonil-CoA Mutase , Camundongos , Animais , Metilmalonil-CoA Mutase/genética , Metilmalonil-CoA Mutase/metabolismo , Glutamina , Multiômica , Erros Inatos do Metabolismo/genética
14.
Neuro Oncol ; 25(4): 662-673, 2023 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-36124685

RESUMO

BACKGROUND: Adult-type diffuse gliomas, CNS WHO grade 4 are the most aggressive primary brain tumors and represent a particular challenge for therapeutic intervention. METHODS: In a single-center retrospective study of matched pairs of initial and post-therapeutic glioma cases with a recurrence period greater than 1 year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent gliomas. RESULTS: Mutational analysis of recurrent gliomas revealed early branching evolution in 75% of the patients. High plasticity was confirmed at the mRNA and miRNA levels. SBS1 signature was reduced and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor ß (TGF-ß) signaling in recurrent gliomas. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B and Wnt signaling that, cooperatively with TGF-ß, induces epithelial to mesenchymal transition (EMT), migration, and stemness. TGF-ß-induced expression of pro-apoptotic proteins and repression of antiapoptotic proteins were uncoupled in the recurrent tumor. CONCLUSIONS: Our results suggest an important role of TGF-ß signaling in recurrent gliomas. This may have clinical implications since TGF-ß inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent gliomas show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.


Assuntos
Neoplasias Encefálicas , Glioma , MicroRNAs , Humanos , Adulto , Regulação para Cima , Transição Epitelial-Mesenquimal/genética , Estudos Retrospectivos , Glioma/patologia , Fator de Crescimento Transformador beta/genética , Fator de Crescimento Transformador beta/metabolismo , MicroRNAs/genética , Recidiva , RNA Mensageiro/metabolismo , Neoplasias Encefálicas/metabolismo , Linhagem Celular Tumoral , Receptores de Ativinas Tipo I/genética , Receptores de Ativinas Tipo I/metabolismo
15.
Cell Rep ; 40(8): 111266, 2022 08 23.
Artigo em Inglês | MEDLINE | ID: mdl-36001976

RESUMO

Mutations in the splicing factor SF3B1 are frequently occurring in various cancers and drive tumor progression through the activation of cryptic splice sites in multiple genes. Recent studies also demonstrate a positive correlation between the expression levels of wild-type SF3B1 and tumor malignancy. Here, we demonstrate that SF3B1 is a hypoxia-inducible factor (HIF)-1 target gene that positively regulates HIF1 pathway activity. By physically interacting with HIF1α, SF3B1 facilitates binding of the HIF1 complex to hypoxia response elements (HREs) to activate target gene expression. To further validate the relevance of this mechanism for tumor progression, we show that a reduction in SF3B1 levels via monoallelic deletion of Sf3b1 impedes tumor formation and progression via impaired HIF signaling in a mouse model for pancreatic cancer. Our work uncovers an essential role of SF3B1 in HIF1 signaling, thereby providing a potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors.


Assuntos
Neoplasias Pancreáticas , Transdução de Sinais , Animais , Linhagem Celular Tumoral , Hipóxia/metabolismo , Fator 1 Induzível por Hipóxia/metabolismo , Subunidade alfa do Fator 1 Induzível por Hipóxia/genética , Subunidade alfa do Fator 1 Induzível por Hipóxia/metabolismo , Camundongos , Neoplasias Pancreáticas/genética , Fosfoproteínas/genética , Fosfoproteínas/metabolismo , Sítios de Splice de RNA , Fatores de Processamento de RNA/genética , Fatores de Processamento de RNA/metabolismo , Neoplasias Pancreáticas
16.
J Immunol ; 209(6): 1189-1199, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-36002234

RESUMO

The activation of memory T cells is a very rapid and concerted cellular response that requires coordination between cellular processes in different compartments and on different time scales. In this study, we use ribosome profiling and deep RNA sequencing to define the acute mRNA translation changes in CD8 memory T cells following initial activation events. We find that initial translation enables subsequent events of human and mouse T cell activation and expansion. Briefly, early events in the activation of Ag-experienced CD8 T cells are insensitive to transcriptional blockade with actinomycin D, and instead depend on the translation of pre-existing mRNAs and are blocked by cycloheximide. Ribosome profiling identifies ∼92 mRNAs that are recruited into ribosomes following CD8 T cell stimulation. These mRNAs typically have structured GC and pyrimidine-rich 5' untranslated regions and they encode key regulators of T cell activation and proliferation such as Notch1, Ifngr1, Il2rb, and serine metabolism enzymes Psat1 and Shmt2 (serine hydroxymethyltransferase 2), as well as translation factors eEF1a1 (eukaryotic elongation factor α1) and eEF2 (eukaryotic elongation factor 2). The increased production of receptors of IL-2 and IFN-γ precedes the activation of gene expression and augments cellular signals and T cell activation. Taken together, we identify an early RNA translation program that acts in a feed-forward manner to enable the rapid and dramatic process of CD8 memory T cell expansion and activation.


Assuntos
Glicina Hidroximetiltransferase , Interleucina-2 , Regiões 5' não Traduzidas , Animais , Linfócitos T CD8-Positivos , Cicloeximida/metabolismo , Dactinomicina/metabolismo , Glicina Hidroximetiltransferase/genética , Glicina Hidroximetiltransferase/metabolismo , Humanos , Memória Imunológica , Interleucina-2/metabolismo , Ativação Linfocitária , Células T de Memória , Camundongos , Fator 2 de Elongação de Peptídeos/genética , Fator 2 de Elongação de Peptídeos/metabolismo , Fatores de Alongamento de Peptídeos/genética , Pirimidinas/metabolismo , RNA Mensageiro/genética , Serina/genética
17.
J Comput Biol ; 29(8): 857-866, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35776515

RESUMO

With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5'/3' bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples.


Assuntos
Estabilidade de RNA , RNA , Genômica , RNA/química , RNA/genética , Análise de Sequência de RNA/métodos
18.
Bioinformatics ; 38(18): 4293-4300, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-35900151

RESUMO

MOTIVATION: Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing. RESULTS: We developed Single Cell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of ≈2000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03×, achieving an Adjusted Rand Index (ARI) score of ≈0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of ≈0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDOs performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants. AVAILABILITY AND IMPLEMENTATION: SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version are archived at: https://doi.org/10.5281/zenodo.6516955. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Neoplasias , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Teorema de Bayes , Análise de Sequência de DNA , Genoma , Sequência de Bases , Neoplasias/genética , Polimorfismo de Nucleotídeo Único
19.
Methods Mol Biol ; 2493: 167-193, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35751815

RESUMO

Alternative splicing (AS) is a regulatory process during mRNA maturation that shapes higher eukaryotes' complex transcriptomes. High-throughput sequencing of RNA (RNA-Seq) allows for measurements of AS transcripts at an unprecedented depth and diversity. The ever-expanding catalog of known AS events provides biological insights into gene regulation, population genetics, or in the context of disease. Here, we present an overview on the usage of SplAdder, a graph-based alternative splicing toolbox, which can integrate an arbitrarily large number of RNA-Seq alignments and a given annotation file to augment the shared annotation based on RNA-Seq evidence. The shared augmented annotation graph is then used to identify, quantify, and confirm alternative splicing events based on the RNA-Seq data. Splice graphs for individual alignments can also be tested for significant quantitative differences between other samples or groups of samples.


Assuntos
Processamento Alternativo , RNA , Sequenciamento de Nucleotídeos em Larga Escala , RNA/genética , RNA-Seq , Análise de Sequência de RNA
20.
Genome Res ; 2022 May 24.
Artigo em Inglês | MEDLINE | ID: mdl-35609994

RESUMO

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...