Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 85
Filter
1.
bioRxiv ; 2024 Mar 04.
Article in English | MEDLINE | ID: mdl-38496477

ABSTRACT

The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multi-condition and multi-modal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multi-dataset and multi-modal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multi-modal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multi-modal and multi-condition views of cellular trajectories.

2.
Bioinformatics ; 39(7)2023 07 01.
Article in English | MEDLINE | ID: mdl-37421399

ABSTRACT

MOTIVATION: Modality matching in single-cell omics data analysis-i.e. matching cells across datasets collected using different types of genomic assays-has become an important problem, because unifying perspectives across different technologies holds the promise of yielding biological and clinical discoveries. However, single-cell dataset sizes can now reach hundreds of thousands to millions of cells, which remain out of reach for most multimodal computational methods. RESULTS: We propose LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. In LSMMD-MA, we reformulate the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation in Python. We show that LSMMD-MA scales to a million cells in each modality, two orders of magnitude greater than existing implementations. AVAILABILITY AND IMPLEMENTATION: LSMMD-MA is freely available at https://github.com/google-research/large_scale_mmdma and archived at https://doi.org/10.5281/zenodo.8076311.


Subject(s)
Genome , Genomics , Genomics/methods , Research Design , Data Analysis , Single-Cell Analysis , Software
3.
Genome Biol ; 24(1): 143, 2023 06 20.
Article in English | MEDLINE | ID: mdl-37340307

ABSTRACT

BACKGROUND: Single-cell histone post translational modification (scHPTM) assays such as scCUT&Tag or scChIP-seq allow single-cell mapping of diverse epigenomic landscapes within complex tissues and are likely to unlock our understanding of various mechanisms involved in development or diseases. Running scHTPM experiments and analyzing the data produced remains challenging since few consensus guidelines currently exist regarding good practices for experimental design and data analysis pipelines. RESULTS: We perform a computational benchmark to assess the impact of experimental parameters and data analysis pipelines on the ability of the cell representation to recapitulate known biological similarities. We run more than ten thousand experiments to systematically study the impact of coverage and number of cells, of the count matrix construction method, of feature selection and normalization, and of the dimension reduction algorithm used. This allows us to identify key experimental parameters and computational choices to obtain a good representation of single-cell HPTM data. We show in particular that the count matrix construction step has a strong influence on the quality of the representation and that using fixed-size bin counts outperforms annotation-based binning. Dimension reduction methods based on latent semantic indexing outperform others, and feature selection is detrimental, while keeping only high-quality cells has little influence on the final representation as long as enough cells are analyzed. CONCLUSIONS: This benchmark provides a comprehensive study on how experimental parameters and computational choices affect the representation of single-cell HPTM data. We propose a series of recommendations regarding matrix construction, feature and cell selection, and dimensionality reduction algorithms.


Subject(s)
Benchmarking , Histone Code , Algorithms , Cluster Analysis , Single-Cell Analysis
5.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36594573

ABSTRACT

MOTIVATION: We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data. RESULTS: We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions. AVAILABILITY AND IMPLEMENTATION: A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genome , Likelihood Functions
6.
Nat Methods ; 20(1): 104-111, 2023 01.
Article in English | MEDLINE | ID: mdl-36522501

ABSTRACT

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.


Subject(s)
Algorithms , Proteins , Amino Acid Sequence , Proteins/genetics , Proteins/chemistry , Sequence Alignment , Genomics
7.
Nat Biotechnol ; 41(2): 232-238, 2023 02.
Article in English | MEDLINE | ID: mdl-36050551

ABSTRACT

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.


Subject(s)
High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA
8.
J Comput Biol ; 29(11): 1198-1212, 2022 11.
Article in English | MEDLINE | ID: mdl-36251758

ABSTRACT

Single-cell multi-omics technologies enable comprehensive interrogation of cellular regulation, yet most single-cell assays measure only one type of activity-such as transcription, chromatin accessibility, DNA methylation, or 3D chromatin architecture-for each cell. To enable a multimodal view for individual cells, we propose Polarbear, a semi-supervised machine learning framework that facilitates missing modality profile prediction and single-cell cross-modality alignment. Polarbear learns to translate between modalities by using data from co-assay measurements coupled with the large quantity of single-assay data available in public databases. This semi-supervised scheme mitigates issues related to low cell quantities and high sparsity in co-assay data. Polarbear first pre-trains a beta-variational autoencoder for each modality using both co-assay and single-assay profiles to learn robust representations of individual cells, and it then uses the co-assay labels to train a translator between these cell representations. This semi-supervised framework enables us to predict missing modality profiles and match single cells across modalities with improved accuracy compared with fully supervised methods, thus facilitating multimodal data integration.


Subject(s)
Chromatin , Supervised Machine Learning , Databases, Factual
9.
Nat Commun ; 12(1): 5352, 2021 09 09.
Article in English | MEDLINE | ID: mdl-34504064

ABSTRACT

Systematic DNA sequencing of cancer samples has highlighted the importance of two aspects of cancer genomics: intra-tumor heterogeneity (ITH) and mutational processes. These two aspects may not always be independent, as different mutational processes could be involved in different stages or regions of the tumor, but existing computational approaches to study them largely ignore this potential dependency. Here, we present CloneSig, a computational method to jointly infer ITH and mutational processes in a tumor from bulk-sequencing data. Extensive simulations show that CloneSig outperforms current methods for ITH inference and detection of mutational processes when the distribution of mutational signatures changes between clones. Applied to a large cohort of 8,951 tumors with whole-exome sequencing data from The Cancer Genome Atlas, and on a pan-cancer dataset of 2,632 whole-genome sequencing tumor samples from the Pan-Cancer Analysis of Whole Genomes initiative, CloneSig obtains results overall coherent with previous studies.


Subject(s)
Algorithms , Computational Biology/methods , Genetic Heterogeneity , Mutation , Neoplasms/genetics , Whole Genome Sequencing/methods , Genomics/methods , Humans , Polymorphism, Single Nucleotide , ROC Curve , Reproducibility of Results
10.
Nat Commun ; 12(1): 1173, 2021 02 19.
Article in English | MEDLINE | ID: mdl-33608509

ABSTRACT

Antimicrobial resistance is a major global health threat and its development is promoted by antibiotic misuse. While disk diffusion antibiotic susceptibility testing (AST, also called antibiogram) is broadly used to test for antibiotic resistance in bacterial infections, it faces strong criticism because of inter-operator variability and the complexity of interpretative reading. Automatic reading systems address these issues, but are not always adapted or available to resource-limited settings. We present an artificial intelligence (AI)-based, offline smartphone application for antibiogram analysis. The application captures images with the phone's camera, and the user is guided throughout the analysis on the same device by a user-friendly graphical interface. An embedded expert system validates the coherence of the antibiogram data and provides interpreted results. The fully automatic measurement procedure of our application's reading system achieves an overall agreement of 90% on susceptibility categorization against a hospital-standard automatic system and 98% against manual measurement (gold standard), with reduced inter-operator variability. The application's performance showed that the automatic reading of antibiotic resistance testing is entirely feasible on a smartphone. Moreover our application is suited for resource-limited settings, and therefore has the potential to significantly increase patients' access to AST worldwide.


Subject(s)
Artificial Intelligence , Drug Resistance, Microbial , Microbial Sensitivity Tests/methods , Mobile Applications , Smartphone , Anti-Bacterial Agents/pharmacology , Bacterial Infections , Drug Resistance, Microbial/drug effects , Humans , Image Processing, Computer-Assisted , Machine Learning , Software
11.
PLoS One ; 15(11): e0242927, 2020.
Article in English | MEDLINE | ID: mdl-33253293

ABSTRACT

More and more genome-wide association studies are being designed to uncover the full genetic basis of common diseases. Nonetheless, the resulting loci are often insufficient to fully recover the observed heritability. Epistasis, or gene-gene interaction, is one of many hypotheses put forward to explain this missing heritability. In the present work, we propose epiGWAS, a new approach for epistasis detection that identifies interactions between a target SNP and the rest of the genome. This contrasts with the classical strategy of epistasis detection through exhaustive pairwise SNP testing. We draw inspiration from causal inference in randomized clinical trials, which allows us to take into account linkage disequilibrium. EpiGWAS encompasses several methods, which we compare to state-of-the-art techniques for epistasis detection on simulated and real data. The promising results demonstrate empirically the benefits of EpiGWAS to identify pairwise interactions.


Subject(s)
Epistasis, Genetic/genetics , Genome-Wide Association Study/statistics & numerical data , Linkage Disequilibrium/genetics , Models, Genetic , Algorithms , Humans , Polymorphism, Single Nucleotide/genetics
12.
Bioinformatics ; 36(18): 4774-4780, 2020 09 15.
Article in English | MEDLINE | ID: mdl-33026066

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) offers new possibilities to infer gene regulatory network (GRNs) for biological processes involving a notion of time, such as cell differentiation or cell cycles. It also raises many challenges due to the destructive measurements inherent to the technology. RESULTS: In this work, we propose a new method named GRISLI for de novo GRN inference from scRNA-seq data. GRISLI infers a velocity vector field in the space of scRNA-seq data from profiles of individual cells, and models the dynamics of cell trajectories with a linear ordinary differential equation to reconstruct the underlying GRN with a sparse regression procedure. We show on real data that GRISLI outperforms a recently proposed state-of-the-art method for GRN reconstruction from scRNA-seq data. AVAILABILITY AND IMPLEMENTATION: The MATLAB code of GRISLI is available at: https://github.com/PCAubin/GRISLI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Gene Regulatory Networks , RNA-Seq , Sequence Analysis, RNA
13.
Genome Biol ; 21(1): 212, 2020 08 24.
Article in English | MEDLINE | ID: mdl-32831127

ABSTRACT

BACKGROUND: Many computational methods have been developed recently to analyze single-cell RNA-seq (scRNA-seq) data. Several benchmark studies have compared these methods on their ability for dimensionality reduction, clustering, or differential analysis, often relying on default parameters. Yet, given the biological diversity of scRNA-seq datasets, parameter tuning might be essential for the optimal usage of methods, and determining how to tune parameters remains an unmet need. RESULTS: Here, we propose a benchmark to assess the performance of five methods, systematically varying their tunable parameters, for dimension reduction of scRNA-seq data, a common first step to many downstream applications such as cell type identification or trajectory inference. We run a total of 1.5 million experiments to assess the influence of parameter changes on the performance of each method, and propose two strategies to automatically tune parameters for methods that need it. CONCLUSIONS: We find that principal component analysis (PCA)-based methods like scran and Seurat are competitive with default parameters but do not benefit much from parameter tuning, while more complex models like ZinbWave, DCA, and scVI can reach better performance but after parameter tuning.


Subject(s)
RNA-Seq , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Algorithms , Benchmarking , Biodiversity , Cluster Analysis , Humans , Principal Component Analysis , RNA, Small Cytoplasmic
14.
Cell Rep ; 30(6): 1767-1779.e6, 2020 02 11.
Article in English | MEDLINE | ID: mdl-32049009

ABSTRACT

EWSR1-FLI1, the chimeric oncogene specific for Ewing sarcoma (EwS), induces a cascade of signaling events leading to cell transformation. However, it remains elusive how genetically homogeneous EwS cells can drive the heterogeneity of transcriptional programs. Here, we combine independent component analysis of single-cell RNA sequencing data from diverse cell types and model systems with time-resolved mapping of EWSR1-FLI1 binding sites and of open chromatin regions to characterize dynamic cellular processes associated with EWSR1-FLI1 activity. We thus define an exquisitely specific and direct enhancer-driven EWSR1-FLI1 program. In EwS tumors, cell proliferation and strong oxidative phosphorylation metabolism are associated with a well-defined range of EWSR1-FLI1 activity. In contrast, a subpopulation of cells from below and above the intermediary EWSR1-FLI1 activity is characterized by increased hypoxia. Overall, our study reveals sources of intratumoral heterogeneity within EwS tumors.


Subject(s)
Gene Expression Regulation, Neoplastic/genetics , RNA-Binding Protein EWS/metabolism , Sarcoma, Ewing/genetics , Transcription, Genetic/genetics , Cell Line, Tumor , Humans , Signal Transduction
15.
Nucleic Acids Res ; 48(5): 2303-2311, 2020 03 18.
Article in English | MEDLINE | ID: mdl-32034421

ABSTRACT

Chromatin conformation assays such as Hi-C cannot directly measure differences in 3D architecture between cell types or cell states. For this purpose, two or more Hi-C experiments must be carried out, but direct comparison of the resulting Hi-C matrices is confounded by several features of Hi-C data. Most notably, the genomic distance effect, whereby contacts between pairs of genomic loci that are proximal along the chromosome exhibit many more Hi-C contacts that distal pairs of loci, dominates every Hi-C matrix. Furthermore, the form that this distance effect takes often varies between different Hi-C experiments, even between replicate experiments. Thus, a statistical confidence measure designed to identify differential Hi-C contacts must accurately account for the genomic distance effect or risk being misled by large-scale but artifactual differences. ACCOST (Altered Chromatin COnformation STatistics) accomplishes this goal by extending the statistical model employed by DEseq, re-purposing the 'size factors,' which were originally developed to account for differences in read depth between samples, to instead model the genomic distance effect. We show via analysis of simulated and real data that ACCOST provides unbiased statistical confidence estimates that compare favorably with competing methods such as diffHiC, FIND and HiCcompare. ACCOST is freely available with an Apache license at https://bitbucket.org/noblelab/accost.


Subject(s)
Chromatin/chemistry , DNA/chemistry , Genetic Loci , Genome , Software , Animals , Cell Line , Chromatin/metabolism , DNA/metabolism , Epistasis, Genetic , Epithelial Cells/cytology , Epithelial Cells/metabolism , Humans , Lymphocytes/cytology , Lymphocytes/metabolism , Mice , Molecular Conformation , Plasmodium falciparum/genetics , Sporozoites/genetics , Trophozoites/genetics
16.
PLoS One ; 14(11): e0224143, 2019.
Article in English | MEDLINE | ID: mdl-31697689

ABSTRACT

Tumors are made of evolving and heterogeneous populations of cells which arise from successive appearance and expansion of subclonal populations, following acquisition of mutations conferring them a selective advantage. Those subclonal populations can be sensitive or resistant to different treatments, and provide information about tumor aetiology and future evolution. Hence, it is important to be able to assess the level of heterogeneity of tumors with high reliability for clinical applications. In the past few years, a large number of methods have been proposed to estimate intra-tumor heterogeneity from whole exome sequencing (WES) data, but the accuracy and robustness of these methods on real data remains elusive. Here we systematically apply and compare 6 computational methods to estimate tumor heterogeneity on 1,697 WES samples from the cancer genome atlas (TCGA) covering 3 cancer types (breast invasive carcinoma, bladder urothelial carcinoma, and head and neck squamous cell carcinoma), and two distinct input mutation sets. We observe significant differences between the estimates produced by different methods, and identify several likely confounding factors in heterogeneity assessment for the different methods. We further show that the prognostic value of tumor heterogeneity for survival prediction is limited in those datasets, and find no evidence that it improves over prognosis based on other clinical variables. In conclusion, heterogeneity inference from WES data on a single sample, and its use in cancer prognosis, should be considered with caution. Other approaches to assess intra-tumoral heterogeneity such as those based on multiple samples may be preferable for clinical applications.


Subject(s)
DNA Copy Number Variations/genetics , Exome Sequencing , Genetic Heterogeneity , Genome, Human/genetics , Algorithms , Breast Neoplasms/genetics , Breast Neoplasms/pathology , Computational Biology , Exome/genetics , Female , Humans , Mutation , Squamous Cell Carcinoma of Head and Neck/genetics , Squamous Cell Carcinoma of Head and Neck/pathology , Urinary Bladder Neoplasms/genetics , Urinary Bladder Neoplasms/pathology
17.
PLoS Comput Biol ; 15(9): e1007381, 2019 09.
Article in English | MEDLINE | ID: mdl-31568528

ABSTRACT

Cancer driver genes, i.e., oncogenes and tumor suppressor genes, are involved in the acquisition of important functions in tumors, providing a selective growth advantage, allowing uncontrolled proliferation and avoiding apoptosis. It is therefore important to identify these driver genes, both for the fundamental understanding of cancer and to help finding new therapeutic targets or biomarkers. Although the most frequently mutated driver genes have been identified, it is believed that many more remain to be discovered, particularly for driver genes specific to some cancer types. In this paper, we propose a new computational method called LOTUS to predict new driver genes. LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types. We empirically show that LOTUS outperforms five other state-of-the-art driver gene prediction methods, both in terms of intrinsic consistency and prediction accuracy, and provide predictions of new cancer genes across many cancer types.


Subject(s)
Algorithms , Computational Biology/methods , Machine Learning , Neoplasms/genetics , Oncogenes/genetics , Software , Humans , Models, Statistical
18.
Nat Commun ; 10(1): 646, 2019 02 04.
Article in English | MEDLINE | ID: mdl-30718493

ABSTRACT

The original PDF version of this Article contained errors in two equations. In Eq. (1), all Γ symbols were inadvertently omitted. In the second equation in the subsection entitled '1. Dispersion optimization' within the Methods section 'ZINB-WaVE estimation procedure', all Ψ symbols were inadvertently omitted. These errors have been corrected in the PDF version of the Article; the HTML version was correct from the time of publication.

19.
J Comput Biol ; 26(6): 509-518, 2019 06.
Article in English | MEDLINE | ID: mdl-30785347

ABSTRACT

We propose a new model for fast classification of DNA sequences output by next-generation sequencing machines. The model, which we call fastDNA, embeds DNA sequences in a vector space by learning continuous low-dimensional representations of the k -mers it contains. We show on metagenomics benchmarks that it outperforms the state-of-the-art methods in terms of accuracy and scalability.


Subject(s)
Metagenomics/methods , Sequence Analysis, DNA/methods , DNA/genetics , High-Throughput Nucleotide Sequencing/methods
20.
Algorithms Bioinform ; 1432019 Sep 03.
Article in English | MEDLINE | ID: mdl-34632462

ABSTRACT

Many single-cell sequencing technologies are now available, but it is still difficult to apply multiple sequencing technologies to the same single cell. In this paper, we propose an unsupervised manifold alignment algorithm, MMD-MA, for integrating multiple measurements carried out on disjoint aliquots of a given population of cells. Effectively, MMD-MA performs an in silico co-assay by embedding cells measured in different ways into a learned latent space. In the MMD-MA algorithm, single-cell data points from multiple domains are aligned by optimizing an objective function with three components: (1) a maximum mean discrepancy (MMD) term to encourage the differently measured points to have similar distributions in the latent space, (2) a distortion term to preserve the structure of the data between the input space and the latent space, and (3) a penalty term to avoid collapse to a trivial solution. Notably, MMD-MA does not require any correspondence information across data modalities, either between the cells or between the features. Furthermore, MMD-MA's weak distributional requirements for the domains to be aligned allow the algorithm to integrate heterogeneous types of single cell measures, such as gene expression, DNA accessibility, chromatin organization, methylation, and imaging data. We demonstrate the utility of MMD-MA in simulation experiments and using a real data set involving single-cell gene expression and methylation data.

SELECTION OF CITATIONS
SEARCH DETAIL
...