Pesquisa | Portal Regional da BVS (teste)

1.

Collaborative learning from distributed data with differentially private synthetic data.

Prediger, Lukas; Jälkö, Joonas; Honkela, Antti; Kaski, Samuel.

BMC Med Inform Decis Mak ; 24(1): 167, 2024 Jun 14.

Artigo em Inglês | MEDLINE | ID: mdl-38877563

RESUMO

BACKGROUND: Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. METHODS: We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. RESULTS: We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. CONCLUSIONS: Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.

Assuntos

Disseminação de Informação , Humanos , Reino Unido , Comportamento Cooperativo , Confidencialidade/normas , Privacidade , Bancos de Espécimes Biológicos , Estudos Prospectivos

2.

Digital public health leadership in the global fight for health security.

AlKnawy, Bandar; Kozlakidis, Zisis; Tarkoma, Sasu; Bates, David; Honkela, Antti; Crooks, George; Rhee, Kyu; McKillop, Mollie.

BMJ Glob Health ; 8(2)2023 02.

Artigo em Inglês | MEDLINE | ID: mdl-36792230

RESUMO

The COVID-19 pandemic highlighted the need to prioritise mature digital health and data governance at both national and supranational levels to guarantee future health security. The Riyadh Declaration on Digital Health was a call to action to create the infrastructure needed to share effective digital health evidence-based practices and high-quality, real-time data locally and globally to provide actionable information to more health systems and countries. The declaration proposed nine key recommendations for data and digital health that need to be adopted by the global health community to address future pandemics and health threats. Here, we expand on each recommendation and provide an evidence-based roadmap for their implementation. This policy document serves as a resource and toolkit that all stakeholders in digital health and disaster preparedness can follow to develop digital infrastructure and protocols in readiness for future health threats through robust digital public health leadership.

Assuntos

COVID-19 , Saúde Pública , Humanos , Liderança , Pandemias/prevenção & controle , Saúde Global

3.

Strong pathogen competition in neonatal gut colonisation.

Mäklin, Tommi; Thorpe, Harry A; Pöntinen, Anna K; Gladstone, Rebecca A; Shao, Yan; Pesonen, Maiju; McNally, Alan; Johnsen, Pål J; Samuelsen, Ørjan; Lawley, Trevor D; Honkela, Antti; Corander, Jukka.

Nat Commun ; 13(1): 7417, 2022 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-36456554

RESUMO

Opportunistic bacterial pathogen species and their strains that colonise the human gut are generally understood to compete against both each other and the commensal species colonising this ecosystem. Currently we are lacking a population-wide quantification of strain-level colonisation dynamics and the relationship of colonisation potential to prevalence in disease, and how ecological factors might be modulating these. Here, using a combination of latest high-resolution metagenomics and strain-level genomic epidemiology methods we performed a characterisation of the competition and colonisation dynamics for a longitudinal cohort of neonatal gut microbiomes. We found strong inter- and intra-species competition dynamics in the gut colonisation process, but also a number of synergistic relationships among several species belonging to genus Klebsiella, which includes the prominent human pathogen Klebsiella pneumoniae. No evidence of preferential colonisation by hospital-adapted pathogen lineages in either vaginal or caesarean section birth groups was detected. Our analysis further enabled unbiased assessment of strain-level colonisation potential of extra-intestinal pathogenic Escherichia coli (ExPEC) in comparison with their propensity to cause bloodstream infections. Our study highlights the importance of systematic surveillance of bacterial gut pathogens, not only from disease but also from carriage state, to better inform therapies and preventive medicine in the future.

Assuntos

Cesárea , Ecossistema , Feminino , Gravidez , Recém-Nascido , Humanos , Klebsiella , Metagenômica , Parto , Escherichia coli/genética

4.

Bacterial genomic epidemiology with mixed samples.

Mäklin, Tommi; Kallonen, Teemu; Alanko, Jarno; Samuelsen, Ørjan; Hegstad, Kristin; Mäkinen, Veli; Corander, Jukka; Heinz, Eva; Honkela, Antti.

Microb Genom ; 7(11)2021 11.

Artigo em Inglês | MEDLINE | ID: mdl-34779765

RESUMO

Genomic epidemiology is a tool for tracing transmission of pathogens based on whole-genome sequencing. We introduce the mGEMS pipeline for genomic epidemiology with plate sweeps representing mixed samples of a target pathogen, opening the possibility to sequence all colonies on selective plates with a single DNA extraction and sequencing step. The pipeline includes the novel mGEMS read binner for probabilistic assignments of sequencing reads, and the scalable pseudoaligner Themisto. We demonstrate the effectiveness of our approach using closely related samples in a nosocomial setting, obtaining results that are comparable to those based on single-colony picks. Our results lend firm support to more widespread consideration of genomic epidemiology with mixed infection samples.

Assuntos

Genoma Bacteriano , Genômica , Análise de Sequência , Sequenciamento Completo do Genoma

5.

Privacy-preserving data sharing via probabilistic modeling.

Jälkö, Joonas; Lagerspetz, Eemil; Haukka, Jari; Tarkoma, Sasu; Honkela, Antti; Kaski, Samuel.

Patterns (N Y) ; 2(7): 100271, 2021 Jul 09.

Artigo em Inglês | MEDLINE | ID: mdl-34286296

RESUMO

Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research.

6.

Closing Clostridium botulinum Group III Genomes Using Long-Read Sequencing.

Woudstra, Cedric; Mäklin, Tommi; Derman, Yagmur; Bano, Luca; Skarin, Hanna; Mazuet, Christelle; Honkela, Antti; Lindström, Miia.

Microbiol Resour Announc ; 10(22): e0136420, 2021 Jun 03.

Artigo em Inglês | MEDLINE | ID: mdl-34080898

RESUMO

Clostridium botulinum group III is the anaerobic Gram-positive bacterium producing the deadly neurotoxin responsible for animal botulism. Here, we used long-read sequencing to produce four complete genomes from Clostridium botulinum group III neurotoxin types C, D, C/D, and D/C. The protocol for obtaining high-molecular-weight DNA from C. botulinum group III is described.

7.

High-resolution sweep metagenomics using fast probabilistic inference.

Mäklin, Tommi; Kallonen, Teemu; David, Sophia; Boinett, Christine J; Pascoe, Ben; Méric, Guillaume; Aanensen, David M; Feil, Edward J; Baker, Stephen; Parkhill, Julian; Sheppard, Samuel K; Corander, Jukka; Honkela, Antti.

Wellcome Open Res ; 5: 14, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-34746439

RESUMO

Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP pipeline for identifying and estimating the relative sequence abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our pipeline facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.

8.

Representation transfer for differentially private drug sensitivity prediction.

Niinimäki, Teppo; Heikkilä, Mikko A; Honkela, Antti; Kaski, Samuel.

Bioinformatics ; 35(14): i218-i224, 2019 07 15.

Artigo em Inglês | MEDLINE | ID: mdl-31510659

RESUMO

MOTIVATION: Human genomic datasets often contain sensitive information that limits use and sharing of the data. In particular, simple anonymization strategies fail to provide sufficient level of protection for genomic data, because the data are inherently identifiable. Differentially private machine learning can help by guaranteeing that the published results do not leak too much information about any individual data point. Recent research has reached promising results on differentially private drug sensitivity prediction using gene expression data. Differentially private learning with genomic data is challenging because it is more difficult to guarantee privacy in high dimensions. Dimensionality reduction can help, but if the dimension reduction mapping is learned from the data, then it needs to be differentially private too, which can carry a significant privacy cost. Furthermore, the selection of any hyperparameters (such as the target dimensionality) needs to also avoid leaking private information. RESULTS: We study an approach that uses a large public dataset of similar type to learn a compact representation for differentially private learning. We compare three representation learning methods: variational autoencoders, principal component analysis and random projection. We solve two machine learning tasks on gene expression of cancer cell lines: cancer type classification, and drug sensitivity prediction. The experiments demonstrate significant benefit from all representation learning methods with variational autoencoders providing the most accurate predictions most often. Our results significantly improve over previous state-of-the-art in accuracy of differentially private drug sensitivity prediction. AVAILABILITY AND IMPLEMENTATION: Code used in the experiments is available at https://github.com/DPBayes/dp-representation-transfer.

Assuntos

Aprendizado de Máquina , Humanos , Neoplasias

9.

Seasonal Variation in Genome-Wide DNA Methylation Patterns and the Onset of Seasonal Timing of Reproduction in Great Tits.

Viitaniemi, Heidi M; Verhagen, Irene; Visser, Marcel E; Honkela, Antti; van Oers, Kees; Husby, Arild.

Genome Biol Evol ; 11(3): 970-983, 2019 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-30840074

RESUMO

In seasonal environments, timing of reproduction is a trait with important fitness consequences, but we know little about the molecular mechanisms that underlie the variation in this trait. Recently, several studies put forward DNA methylation as a mechanism regulating seasonal timing of reproduction in both plants and animals. To understand the involvement of DNA methylation in seasonal timing of reproduction, it is necessary to examine within-individual temporal changes in DNA methylation, but such studies are very rare. Here, we use a temporal sampling approach to examine changes in DNA methylation throughout the breeding season in female great tits (Parus major) that were artificially selected for early timing of breeding. These females were housed in climate-controlled aviaries and subjected to two contrasting temperature treatments. Reduced representation bisulfite sequencing on red blood cell derived DNA showed genome-wide temporal changes in more than 40,000 out of the 522,643 CpG sites examined. Although most of these changes were relatively small (mean within-individual change of 6%), the sites that showed a temporal and treatment-specific response in DNA methylation are candidate sites of interest for future studies trying to understand the link between DNA methylation patterns and timing of reproduction.

Assuntos

Metilação de DNA , Reprodução , Estações do Ano , Aves Canoras/metabolismo , Animais , Epigênese Genética , Feminino , Aves Canoras/genética , Temperatura

10.

GPrank: an R package for detecting dynamic elements from genome-wide time series.

Topa, Hande; Honkela, Antti.

BMC Bioinformatics ; 19(1): 367, 2018 Oct 04.

Artigo em Inglês | MEDLINE | ID: mdl-30286713

RESUMO

BACKGROUND: Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. RESULTS: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. CONCLUSIONS: Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.

Assuntos

Variação Genética/genética , Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software

11.

Identifying Bacterial Strains from Sequencing Data.

Mäklin, Tommi; Corander, Jukka; Honkela, Antti.

Methods Mol Biol ; 1807: 1-7, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30030799

RESUMO

Environmental and clinical settings can host a wide variety of both bacterial species and strains in a single colony but accurate identification of the organisms is difficult. We describe BIB, a probabilistic method for estimating the relative abundances of species or strains contained in mixed samples analyzed by short read high-throughput sequencing. By grouping closely related strains together in clusters, the BIB pipeline is capable of estimating the relative abundances of the clusters contained in a sequencing sample.

Assuntos

Bactérias/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequência de Bases , Genoma Bacteriano , Alinhamento de Sequência

12.

Efficient differentially private learning improves drug sensitivity prediction.

Honkela, Antti; Das, Mrinal; Nieminen, Arttu; Dikmen, Onur; Kaski, Samuel.

Biol Direct ; 13(1): 1, 2018 02 06.

Artigo em Inglês | MEDLINE | ID: mdl-29409513

RESUMO

BACKGROUND: Users of a personalised recommendation system face a dilemma: recommendations can be improved by learning from data, but only if other users are willing to share their private information. Good personalised predictions are vitally important in precision medicine, but genomic information on which the predictions are based is also particularly sensitive, as it directly identifies the patients and hence cannot easily be anonymised. Differential privacy has emerged as a potentially promising solution: privacy is considered sufficient if presence of individual patients cannot be distinguished. However, differentially private learning with current methods does not improve predictions with feasible data sizes and dimensionalities. RESULTS: We show that useful predictors can be learned under powerful differential privacy guarantees, and even from moderately-sized data sets, by demonstrating significant improvements in the accuracy of private drug sensitivity prediction with a new robust private regression method. Our method matches the predictive accuracy of the state-of-the-art non-private lasso regression using only 4x more samples under relatively strong differential privacy guarantees. Good performance with limited data is achieved by limiting the sharing of private information by decreasing the dimensionality and by projecting outliers to fit tighter bounds, therefore needing to add less noise for equal privacy. CONCLUSIONS: The proposed differentially private regression method combines theoretical appeal and asymptotic efficiency with good prediction accuracy even with moderate-sized data. As already the simple-to-implement method shows promise on the challenging genomic data, we anticipate rapid progress towards practical applications in many fields. REVIEWERS: This article was reviewed by Zoltan Gaspari and David Kreil.

Assuntos

Modelos Teóricos , Algoritmos , Feminino , Humanos , Masculino , Privacidade

13.

Predicting stimulation-dependent enhancer-promoter interactions from ChIP-Seq time course data.

Dzida, Tomasz; Iqbal, Mudassar; Charapitsa, Iryna; Reid, George; Stunnenberg, Henk; Matarese, Filomena; Grote, Korbinian; Honkela, Antti; Rattray, Magnus.

PeerJ ; 5: e3742, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28970965

RESUMO

We have developed a machine learning approach to predict stimulation-dependent enhancer-promoter interactions using evidence from changes in genomic protein occupancy over time. The occupancy of estrogen receptor alpha (ERα), RNA polymerase (Pol II) and histone marks H2AZ and H3K4me3 were measured over time using ChIP-Seq experiments in MCF7 cells stimulated with estrogen. A Bayesian classifier was developed which uses the correlation of temporal binding patterns at enhancers and promoters and genomic proximity as features to predict interactions. This method was trained using experimentally determined interactions from the same system and was shown to achieve much higher precision than predictions based on the genomic proximity of nearest ERα binding. We use the method to identify a genome-wide confident set of ERα target genes and their regulatory enhancers genome-wide. Validation with publicly available GRO-Seq data demonstrates that our predicted targets are much more likely to show early nascent transcription than predictions based on genomic ERα binding proximity alone.

14.

Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes.

Lees, John A; Vehkala, Minna; Välimäki, Niko; Harris, Simon R; Chewapreecha, Claire; Croucher, Nicholas J; Marttinen, Pekka; Davies, Mark R; Steer, Andrew C; Tong, Steven Y C; Honkela, Antti; Parkhill, Julian; Bentley, Stephen D; Corander, Jukka.

Nat Commun ; 7: 12797, 2016 09 16.

Artigo em Inglês | MEDLINE | ID: mdl-27633831

RESUMO

Bacterial genomes vary extensively in terms of both gene content and gene sequence. This plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogens Streptococcus pneumoniae and Streptococcus pyogenes, SEER identifies relevant previously characterized resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness of S. pyogenes. We thus demonstrate that our method can answer important biologically and medically relevant questions.

Assuntos

DNA Bacteriano/genética , Streptococcus pneumoniae/genética , Streptococcus pyogenes/genética , Simulação por Computador , Genoma Bacteriano , Estudo de Associação Genômica Ampla , Modelos Genéticos , Técnicas de Amplificação de Ácido Nucleico

15.

Analysis of differential splicing suggests different modes of short-term splicing regulation.

Topa, Hande; Honkela, Antti.

Bioinformatics ; 32(12): i147-i155, 2016 06 15.

Artigo em Inglês | MEDLINE | ID: mdl-27307611

RESUMO

MOTIVATION: Alternative splicing is an important mechanism in which the regions of pre-mRNAs are differentially joined in order to form different transcript isoforms. Alternative splicing is involved in the regulation of normal physiological functions but also linked to the development of diseases such as cancer. We analyse differential expression and splicing using RNA-sequencing time series in three different settings: overall gene expression levels, absolute transcript expression levels and relative transcript expression levels. RESULTS: Using estrogen receptor α signaling response as a model system, our Gaussian process-based test identifies genes with differential splicing and/or differentially expressed transcripts. We discover genes with consistent changes in alternative splicing independent of changes in absolute expression and genes where some transcripts change whereas others stay constant in absolute level. The results suggest classes of genes with different modes of alternative splicing regulation during the experiment. AVAILABILITY AND IMPLEMENTATION: R and Matlab codes implementing the method are available at https://github.com/PROBIC/diffsplicing An interactive browser for viewing all model fits is available at http://users.ics.aalto.fi/hande/splicingGP/ CONTACT: hande.topa@helsinki.fi or antti.honkela@helsinki.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Processamento Alternativo , Perfilação da Expressão Gênica , Humanos , Isoformas de Proteínas , Precursores de RNA , Análise de Sequência de RNA

16.

Bayesian identification of bacterial strains from sequencing data.

Sankar, Aravind; Malone, Brandon; Bayliss, Sion C; Pascoe, Ben; Méric, Guillaume; Hitchings, Matthew D; Sheppard, Samuel K; Feil, Edward J; Corander, Jukka; Honkela, Antti.

Microb Genom ; 2(8): e000075, 2016 08.

Artigo em Inglês | MEDLINE | ID: mdl-28348870

RESUMO

Rapidly assaying the diversity of a bacterial species present in a sample obtained from a hospital patient or an environmental source has become possible after recent technological advances in DNA sequencing. For several applications it is important to accurately identify the presence and estimate relative abundances of the target organisms from short sequence reads obtained from a sample. This task is particularly challenging when the set of interest includes very closely related organisms, such as different strains of pathogenic bacteria, which can vary considerably in terms of virulence, resistance and spread. Using advanced Bayesian statistical modelling and computation techniques we introduce a novel pipeline for bacterial identification that is shown to outperform the currently leading pipeline for this purpose. Our approach enables fast and accurate sequence-based identification of bacterial strains while using only modest computational resources. Hence it provides a useful tool for a wide spectrum of applications, including rapid clinical diagnostics to distinguish among closely related strains causing nosocomial infections. The software implementation is available at https://github.com/PROBIC/BIB.

Assuntos

Bactérias/classificação , Bactérias/genética , Técnicas de Tipagem Bacteriana/métodos , Software , Técnicas de Tipagem Bacteriana/normas , Teorema de Bayes , DNA Bacteriano/genética , Genoma Bacteriano/genética , Humanos , Análise de Sequência de DNA

17.

On the inconsistency of â ₁-penalised sparse precision matrix estimation.

Heinävaara, Otte; Leppä-Aho, Janne; Corander, Jukka; Honkela, Antti.

BMC Bioinformatics ; 17(Suppl 16): 448, 2016 Dec 13.

Artigo em Inglês | MEDLINE | ID: mdl-28105909

RESUMO

BACKGROUND: Various â 1-penalised estimation methods such as graphical lasso and CLIME are widely used for sparse precision matrix estimation and learning of undirected network structure from data. Many of these methods have been shown to be consistent under various quantitative assumptions about the underlying true covariance matrix. Intuitively, these conditions are related to situations where the penalty term will dominate the optimisation. RESULTS: We explore the consistency of â 1-based methods for a class of bipartite graphs motivated by the structure of models commonly used for gene regulatory networks. We show that all â 1-based methods fail dramatically for models with nearly linear dependencies between the variables. We also study the consistency on models derived from real gene expression data and note that the assumptions needed for consistency never hold even for modest sized gene networks and â 1-based methods also become unreliable in practice for larger networks. CONCLUSIONS: Our results demonstrate that â 1-penalised undirected network structure learning methods are unable to reliably learn many sparse bipartite graph structures, which arise often in gene expression data. Users of such methods should be aware of the consistency criteria of the methods and check if they are likely to be met in their application of interest.

Assuntos

Biologia Computacional/métodos , Redes Reguladoras de Genes , Aprendizado de Máquina , Transcriptoma , Animais , Humanos , Modelos Estatísticos

18.

Genome-wide modeling of transcription kinetics reveals patterns of RNA production delays.

Honkela, Antti; Peltonen, Jaakko; Topa, Hande; Charapitsa, Iryna; Matarese, Filomena; Grote, Korbinian; Stunnenberg, Hendrik G; Reid, George; Lawrence, Neil D; Rattray, Magnus.

Proc Natl Acad Sci U S A ; 112(42): 13115-20, 2015 Oct 20.

Artigo em Inglês | MEDLINE | ID: mdl-26438844

RESUMO

Genes with similar transcriptional activation kinetics can display very different temporal mRNA profiles because of differences in transcription time, degradation rate, and RNA-processing kinetics. Recent studies have shown that a splicing-associated RNA production delay can be significant. To investigate this issue more generally, it is useful to develop methods applicable to genome-wide datasets. We introduce a joint model of transcriptional activation and mRNA accumulation that can be used for inference of transcription rate, RNA production delay, and degradation rate given data from high-throughput sequencing time course experiments. We combine a mechanistic differential equation model with a nonparametric statistical modeling approach allowing us to capture a broad range of activation kinetics, and we use Bayesian parameter estimation to quantify the uncertainty in estimates of the kinetic parameters. We apply the model to data from estrogen receptor α activation in the MCF-7 breast cancer cell line. We use RNA polymerase II ChIP-Seq time course data to characterize transcriptional activation and mRNA-Seq time course data to quantify mature transcripts. We find that 11% of genes with a good signal in the data display a delay of more than 20 min between completing transcription and mature mRNA production. The genes displaying these long delays are significantly more likely to be short. We also find a statistical association between high delay and late intron retention in pre-mRNA data, indicating significant splicing-associated production delays in many genes.

Assuntos

Genoma Humano , Modelos Genéticos , RNA/biossíntese , Transcrição Gênica , Receptor alfa de Estrogênio/metabolismo , Humanos , Cinética , Células MCF-7 , RNA/genética , Transdução de Sinais

19.

Fast and accurate approximate inference of transcript expression from RNA-seq data.

Hensman, James; Papastamoulis, Panagiotis; Glaus, Peter; Honkela, Antti; Rattray, Magnus.

Bioinformatics ; 31(24): 3881-9, 2015 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-26315907

RESUMO

MOTIVATION: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared with competing methods. However, exact Bayesian inference is intractable and approximate methods such as Markov chain Monte Carlo and Variational Bayes (VB) are typically used. While providing a high degree of accuracy and modelling flexibility, standard implementations can be prohibitively slow for large datasets and complex transcriptome annotations. RESULTS: We propose a novel approximate inference scheme based on VB and apply it to an existing model of transcript expression inference from RNA-seq data. Recent advances in VB algorithmics are used to improve the convergence of the algorithm beyond the standard Variational Bayes Expectation Maximization algorithm. We apply our algorithm to simulated and biological datasets, demonstrating a significant increase in speed with only very small loss in accuracy of expression level estimation. We carry out a comparative study against seven popular alternative methods and demonstrate that our new algorithm provides excellent accuracy and inter-replicate consistency while remaining competitive in computation time. AVAILABILITY AND IMPLEMENTATION: The methods were implemented in R and C++, and are available as part of the BitSeq project at github.com/BitSeq. The method is also available through the BitSeq Bioconductor package. The source code to reproduce all simulation results can be accessed via github.com/BitSeq/BitSeqVB_benchmarking.

Assuntos

Algoritmos , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Teorema de Bayes , Humanos , Cadeias de Markov , Método de Monte Carlo

20.

Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability.

Uziela, Karolis; Honkela, Antti.

PLoS One ; 10(5): e0126545, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-25966034

RESUMO

Rapidly growing public gene expression databases contain a wealth of data for building an unprecedentedly detailed picture of human biology and disease. This data comes from many diverse measurement platforms that make integrating it all difficult. Although RNA-sequencing (RNA-seq) is attracting the most attention, at present, the rate of new microarray studies submitted to public databases far exceeds the rate of new RNA-seq studies. There is clearly a need for methods that make it easier to combine data from different technologies. In this paper, we propose a new method for processing RNA-seq data that yields gene expression estimates that are much more similar to corresponding estimates from microarray data, hence greatly improving cross-platform comparability. The method we call PREBS is based on estimating the expression from RNA-seq reads overlapping the microarray probe regions, and processing these estimates with standard microarray summarisation algorithms. Using paired microarray and RNA-seq samples from TCGA LAML data set we show that PREBS expression estimates derived from RNA-seq are more similar to microarray-based expression estimates than those from other RNA-seq processing methods. In an experiment to retrieve paired microarray samples from a database using an RNA-seq query sample, gene signatures defined based on PREBS expression estimates were found to be much more accurate than those from other methods. PREBS also allows new ways of using RNA-seq data, such as expression estimation for microarray probe sets. An implementation of the proposed method is available in the Bioconductor package "prebs."

Assuntos

Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos/métodos , RNA/biossíntese , Sequência de Bases , Bases de Dados Genéticas , Humanos , RNA/genética , Análise de Sequência de RNA

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA