Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
1.
Blood ; 143(3): 243-257, 2024 Jan 18.
Article in English | MEDLINE | ID: mdl-37922454

ABSTRACT

ABSTRACT: Regulation of lineage biases in hematopoietic stem and progenitor cells (HSPCs) is pivotal for balanced hematopoietic output. However, little is known about the mechanism behind lineage choice in HSPCs. Here, we show that messenger RNA (mRNA) decay factors regnase-1 (Reg1; Zc3h12a) and regnase-3 (Reg3; Zc3h12c) are essential for determining lymphoid fate and restricting myeloid differentiation in HSPCs. Loss of Reg1 and Reg3 resulted in severe impairment of lymphopoiesis and a mild increase in myelopoiesis in the bone marrow. Single-cell RNA sequencing analysis revealed that Reg1 and Reg3 regulate lineage directions in HSPCs via the control of a set of myeloid-related genes. Reg1- and Reg3-mediated control of mRNA encoding Nfkbiz, a transcriptional and epigenetic regulator, was essential for balancing lymphoid/myeloid lineage output in HSPCs in vivo. Furthermore, single-cell assay for transposase-accessible chromatin sequencing analysis revealed that Reg1 and Reg3 control the epigenetic landscape on myeloid-related gene loci in early stage HSPCs via Nfkbiz. Consistently, an antisense oligonucleotide designed to inhibit Reg1- and Reg3-mediated Nfkbiz mRNA degradation primed hematopoietic stem cells toward myeloid lineages by enhancing Nfkbiz expression. Collectively, the collaboration between posttranscriptional control and chromatin remodeling by the Reg1/Reg3-Nfkbiz axis governs HSPC lineage biases, ultimately dictating the fate of lymphoid vs myeloid differentiation.


Subject(s)
Bone Marrow , Hematopoietic Stem Cells , Cell Lineage/genetics , Hematopoietic Stem Cells/metabolism , Bone Marrow/metabolism , Hematopoiesis/genetics , RNA, Messenger/metabolism , Cell Differentiation/genetics
2.
Commun Biol ; 6(1): 1290, 2023 12 28.
Article in English | MEDLINE | ID: mdl-38155269

ABSTRACT

Single-cell RNA-seq analysis coupled with CRISPR-based perturbation has enabled the inference of gene regulatory networks with causal relationships. However, a snapshot of single-cell CRISPR data may not lead to an accurate inference, since a gene knockout can influence multi-layered downstream over time. Here, we developed RENGE, a computational method that infers gene regulatory networks using a time-series single-cell CRISPR dataset. RENGE models the propagation process of the effects elicited by a gene knockout on its regulatory network. It can distinguish between direct and indirect regulations, which allows for the inference of regulations by genes that are not knocked out. RENGE therefore outperforms current methods in the accuracy of inferring gene regulatory networks. When used on a dataset we derived from human-induced pluripotent stem cells, RENGE yielded a network consistent with multiple databases and literature. Accurate inference of gene regulatory networks by RENGE would enable the identification of key factors for various biological systems.


Subject(s)
Gene Regulatory Networks , Single-Cell Gene Expression Analysis , Humans , Gene Knockout Techniques , Time Factors
3.
Sci Adv ; 9(22): eadf1814, 2023 06 02.
Article in English | MEDLINE | ID: mdl-37267354

ABSTRACT

Embryonic development proceeds as a series of orderly cell state transitions built upon noisy molecular processes. We defined gene expression and cell motion states using single-cell RNA sequencing data and in vivo time-lapse cell tracking data of the zebrafish tailbud. We performed a parallel identification of these states using dimensional reduction methods and a change point detection algorithm. Both types of cell states were quantitatively mapped onto embryos, and we used the cell motion states to study the dynamics of biological state transitions over time. The time average pattern of cell motion states is reproducible among embryos. However, individual embryos exhibit transient deviations from the time average forming left-right asymmetries in collective cell motion. Thus, the reproducible pattern of cell states and bilateral symmetry arise from temporal averaging. In addition, collective cell behavior can be a source of asymmetry rather than a buffer against noisy individual cell behavior.


Subject(s)
Zebrafish Proteins , Zebrafish , Animals , Zebrafish/metabolism , Time-Lapse Imaging , Zebrafish Proteins/metabolism , Cell Tracking/methods , Embryonic Development
4.
Sci Rep ; 13(1): 6817, 2023 04 26.
Article in English | MEDLINE | ID: mdl-37100862

ABSTRACT

Alzheimer's disease (AD) is the most prevalent dementia disorder globally, and there are still no effective interventions for slowing or stopping the underlying pathogenic mechanisms. There is strong evidence implicating neural oxidative stress (OS) and ensuing neuroinflammation in the progressive neurodegeneration observed in the AD brain both during and prior to symptom emergence. Thus, OS-related biomarkers may be valuable for prognosis and provide clues to therapeutic targets during the early presymptomatic phase. In the current study, we gathered brain RNA-seq data of AD patients and matched controls from the Gene Expression Omnibus (GEO) to identify differentially expressed OS-related genes (OSRGs). These OSRGs were analyzed for cellular functions using the Gene Ontology (GO) database and used to construct a weighted gene co-expression network (WGCN) and protein-protein interaction (PPI) network. Receiver operating characteristic (ROC) curves were then constructed to identify network hub genes. A diagnostic model was established based on these hub genes using Least Absolute Shrinkage and Selection Operator (LASSO) and ROC analyses. Immune-related functions were examined by assessing correlations between hub gene expression and immune cell brain infiltration scores. Further, target drugs were predicted using the Drug-Gene Interaction database, while regulatory miRNAs and transcription factors were predicted using miRNet. In total, 156 candidate genes were identified among 11046 differentially expressed genes, 7098 genes in WGCN modules, and 446 OSRGs, and 5 hub genes (MAPK9, FOXO1, BCL2, ETS1, and SP1) were identified by ROC curve analyses. These hub genes were enriched in GO annotations "Alzheimer's disease pathway," "Parkinson's Disease," "Ribosome," and "Chronic myeloid leukemia." In addition, 78 drugs were predicted to target FOXO1, SP1, MAPK9, and BCL2, including fluorouracil, cyclophosphamide, and epirubicin. A hub gene-miRNA regulatory network with 43 miRNAs and hub gene-transcription factor (TF) network with 36 TFs were also generated. These hub genes may serve as biomarkers for AD diagnosis and provide clues to novel potential treatment targets.


Subject(s)
Alzheimer Disease , MicroRNAs , Humans , Alzheimer Disease/diagnosis , Alzheimer Disease/genetics , Oxidative Stress/genetics , Brain , Proto-Oncogene Proteins c-bcl-2 , Gene Regulatory Networks , Computational Biology , Gene Expression Profiling
5.
Methods Mol Biol ; 2586: 35-48, 2023.
Article in English | MEDLINE | ID: mdl-36705897

ABSTRACT

The information of RNA secondary structure has been widely applied to the inference of RNA function. However, a classical prediction method is not feasible to long RNAs such as mRNA due to the problems of computational time and numerical errors. To overcome those problems, sliding window methods have been applied while their results are not directly comparable to global RNA structure prediction. In this chapter, we introduce ParasoR, a method designed for parallel computation of genome-wide RNA secondary structures. To enable genome-wide prediction, ParasoR distributes dynamic programming (DP) matrices required for structure prediction to multiple computational nodes. Using the database of not the original DP variable but the ratio of variables, ParasoR can locally compute the structure scores such as stem probability or accessibility on demand. A comprehensive analysis of local secondary structures by ParasoR is expected to be a promising way to detect the statistical constraints on long RNAs.


Subject(s)
Algorithms , RNA , RNA/genetics , RNA/chemistry , Nucleic Acid Conformation , Computational Biology/methods , RNA, Messenger
6.
Methods Mol Biol ; 2586: 107-120, 2023.
Article in English | MEDLINE | ID: mdl-36705901

ABSTRACT

A point mutation in coding RNA can cause not only an amino acid substitution but also a dynamic change of RNA secondary structure, leading to a dysfunctional RNA. Although in silico structure prediction has been used to detect structure-disrupting point mutations known as riboSNitches, exhaustive simulation of long RNAs is needed to detect a significant enrichment or depletion of riboSNitches in functional RNAs. Here, we have developed a novel algorithm Radiam (RNA secondary structure Analysis with Deletion, Insertion, And substitution Mutations) for a comprehensive riboSNitch analysis of long RNAs. Radiam is based on the ParasoR framework, which efficiently computes local RNA secondary structures for long RNAs. ParasoR can compute a variety of structure scores over globally consistent structures with maximal span constraints for the base pair distance. Using the reusable structure database made by ParasoR, Radiam performs an efficient recomputation of the secondary structures for mutated sequences. An exhaustive simulation of Radiam is expected to find reliable riboSNitch candidates on long RNAs by evaluating their statistical significance in terms of the change of local structure stability.


Subject(s)
Algorithms , RNA , RNA/genetics , RNA/chemistry , Nucleic Acid Conformation , RNA, Untranslated , Nucleotides , Sequence Analysis, RNA
7.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36094092

ABSTRACT

The identification of cancer subtypes can help researchers understand hidden genomic mechanisms, enhance diagnostic accuracy and improve clinical treatments. With the development of high-throughput techniques, researchers can access large amounts of data from multiple sources. Because of the high dimensionality and complexity of multiomics and clinical data, research into the integration of multiomics data is needed, and developing effective tools for such purposes remains a challenge for researchers. In this work, we proposed an entirely unsupervised clustering method without harnessing any prior knowledge (MODEC). We used manifold optimization and deep-learning techniques to integrate multiomics data for the identification of cancer subtypes and the analysis of significant clinical variables. Since there is nonlinearity in the gene-level datasets, we used manifold optimization methodology to extract essential information from the original omics data to obtain a low-dimensional latent subspace. Then, MODEC uses a deep learning-based clustering module to iteratively define cluster centroids and assign cluster labels to each sample by minimizing the Kullback-Leibler divergence loss. MODEC was applied to six public cancer datasets from The Cancer Genome Atlas database and outperformed eight competing methods in terms of the accuracy and reliability of the subtyping results. MODEC was extremely competitive in the identification of survival patterns and significant clinical features, which could help doctors monitor disease progression and provide more suitable treatment strategies.


Subject(s)
Algorithms , Neoplasms , Humans , Reproducibility of Results , Cluster Analysis , Genomics/methods , Neoplasms/genetics
8.
Commun Biol ; 3(1): 434, 2020 08 13.
Article in English | MEDLINE | ID: mdl-32792557

ABSTRACT

Recent high-throughput approaches have revealed a vast number of transcripts with unknown functions. Many of these transcripts are long noncoding RNAs (lncRNAs), and intergenic region-derived lncRNAs are classified as long intergenic noncoding RNAs (lincRNAs). Although Myosin heavy chain 6 (Myh6) encoding primary contractile protein is down-regulated in stressed hearts, the underlying mechanisms are not fully clarified especially in terms of lincRNAs. Here, we screen upregulated lincRNAs in pressure overloaded hearts and identify a muscle-abundant lincRNA termed Lionheart. Compared with controls, deletion of the Lionheart in mice leads to decreased systolic function and a reduction in MYH6 protein levels following pressure overload. We reveal decreased MYH6 results from an interaction between Lionheart and Purine-rich element-binding protein A after pressure overload. Furthermore, human LIONHEART levels in left ventricular biopsy specimens positively correlate with cardiac systolic function. Our results demonstrate Lionheart plays a pivotal role in cardiac remodeling via regulation of MYH6.


Subject(s)
Heart/physiopathology , Pressure , RNA, Long Noncoding/genetics , Systole/genetics , Animals , Biopsy , Dependovirus/metabolism , Heart Ventricles/ultrastructure , Humans , Mice, Inbred C57BL , Mice, Knockout , Phenotype , Promoter Regions, Genetic/genetics , RNA, Long Noncoding/metabolism , Rats , Up-Regulation/genetics
9.
Bioinformatics ; 36(1): 221-231, 2020 01 01.
Article in English | MEDLINE | ID: mdl-31218366

ABSTRACT

MOTIVATION: Evolve and resequence (E&R) experiments show promise in capturing real-time evolution at genome-wide scales, enabling the assessment of allele frequency changes SNPs in evolving populations and thus the estimation of population genetic parameters in the Wright-Fisher model (WF) that quantify the selection on SNPs. Currently, these analyses face two key difficulties: the numerous SNPs in E&R data and the frequent unreliability of estimates. Hence, a methodology for efficiently estimating WF parameters is needed to understand the evolutionary processes that shape genomes. RESULTS: We developed a novel method for estimating WF parameters (EMWER), by applying an expectation maximization algorithm to the Kolmogorov forward equation associated with the WF model diffusion approximation. EMWER was used to infer the effective population size, selection coefficients and dominance parameters from E&R data. Of the methods examined, EMWER was the most efficient method for selection strength estimation in multi-core computing environments, estimating both selection and dominance with accurate confidence intervals. We applied EMWER to E&R data from experimental Drosophila populations adapting to thermally fluctuating environments and found a common selection affecting allele frequency of many SNPs within the cosmopolitan In(3R)P inversion. Furthermore, this application indicated that many of beneficial alleles in this experiment are dominant. AVAILABILITY AND IMPLEMENTATION: Our C++ implementation of 'EMWER' is available at https://github.com/kojikoji/EMWER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology , Models, Genetic , Animals , Computational Biology/methods , Drosophila/genetics , Evolution, Molecular , Gene Frequency , Genetics, Population , Population Density
10.
Algorithms Mol Biol ; 14: 23, 2019.
Article in English | MEDLINE | ID: mdl-31832082

ABSTRACT

BACKGROUND: As the number of sequenced genomes grows, researchers have access to an increasingly rich source for discovering detailed evolutionary information. However, the computational technologies for inferring biologically important evolutionary events are not sufficiently developed. RESULTS: We present algorithms to estimate the evolutionary time ( t MRS ) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. As the confidence in estimated t MRS values varies depending on gap fractions and nucleotide patterns of alignment columns, we also compute the standard deviation σ of t MRS by using a dynamic programming algorithm. We identified a number of human genomic sites at which the last substitutions occurred between two speciation events in the human lineage with confidence. A large fraction of such sites have substitutions that occurred between the concestor nodes of Hominoidea and Euarchontoglires. We investigated the correlation between tissue-specific transcribed enhancers and the distribution of the sites with specific substitution time intervals, and found that brain-specific transcribed enhancers are threefold enriched in the density of substitutions in the human lineage relative to expectations. CONCLUSIONS: We have presented algorithms to estimate the evolutionary time ( t MRS ) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. Our algorithms will be useful for Evo-Devo studies, as they facilitate screening potential genomic sites that have played an important role in the acquisition of unique biological features by target species.

11.
BMC Bioinformatics ; 20(Suppl 3): 130, 2019 Mar 29.
Article in English | MEDLINE | ID: mdl-30925857

ABSTRACT

BACKGROUND: Recently, next-generation sequencing techniques have been applied for the detection of RNA secondary structures, which is referred to as high-throughput RNA structural (HTS) analyses, and many different protocols have been used to detect comprehensive RNA structures at single-nucleotide resolution. However, the existing computational analyses heavily depend on the experimental methodology to generate data, which results in difficulties associated with statistically sound comparisons or combining the results obtained using different HTS methods. RESULTS: Here, we introduced a statistical framework, reactIDR, which can be applied to the experimental data obtained using multiple HTS methodologies. Using this approach, nucleotides are classified into three structural categories, loop, stem/background, and unmapped. reactIDR uses the irreproducible discovery rate (IDR) with a hidden Markov model to discriminate between the true and spurious signals obtained in the replicated HTS experiments accurately, and it is able to incorporate an expectation-maximization algorithm and supervised learning for efficient parameter optimization. The results of our analyses of the real-life HTS data showed that reactIDR had the highest accuracy in the classification of ribosomal RNA stem/loop structures when using both individual and integrated HTS datasets, and its results corresponded the best to the three-dimensional structures. CONCLUSIONS: We have developed a novel software, reactIDR, for the prediction of stem/loop regions from the HTS analysis datasets. For the rRNA structure analyses, reactIDR was shown to have robust accuracy across different datasets by using the reproducibility criterion, suggesting its potential for increasing the value of existing HTS datasets. reactIDR is publicly available at https://github.com/carushi/reactIDR .


Subject(s)
Algorithms , High-Throughput Nucleotide Sequencing/methods , Nucleic Acid Conformation , RNA/chemistry , Statistics as Topic , Area Under Curve , Machine Learning , Markov Chains , Nucleotides , RNA, Ribosomal/chemistry , RNA, Ribosomal/genetics , ROC Curve , Reproducibility of Results
12.
Bioinformatics ; 33(15): 2314-2321, 2017 Aug 01.
Article in English | MEDLINE | ID: mdl-28379368

ABSTRACT

MOTIVATION: The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation. RESULTS: In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses. AVAILABILITY AND IMPLEMENTATION: The R source code of SCODE is available at https://github.com/hmatsu1226/SCODE. CONTACT: hirotaka.matsumoto@riken.jp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Cell Differentiation/genetics , Gene Regulatory Networks , Sequence Analysis, RNA/methods , Software , Algorithms , Animals , Humans , Mice , Single-Cell Analysis/methods
13.
BMC Bioinformatics ; 17(1): 232, 2016 Jun 08.
Article in English | MEDLINE | ID: mdl-27277014

ABSTRACT

BACKGROUND: Single-cell technologies make it possible to quantify the comprehensive states of individual cells, and have the power to shed light on cellular differentiation in particular. Although several methods have been developed to fully analyze the single-cell expression data, there is still room for improvement in the analysis of differentiation. RESULTS: In this paper, we propose a novel method SCOUP to elucidate differentiation process. Unlike previous dimension reduction-based approaches, SCOUP describes the dynamics of gene expression throughout differentiation directly, including the degree of differentiation of a cell (in pseudo-time) and cell fate. SCOUP is superior to previous methods with respect to pseudo-time estimation, especially for single-cell RNA-seq. SCOUP also successfully estimates cell lineage more accurately than previous method, especially for cells at an early stage of bifurcation. In addition, SCOUP can be applied to various downstream analyses. As an example, we propose a novel correlation calculation method for elucidating regulatory relationships among genes. We apply this method to a single-cell RNA-seq data and detect a candidate of key regulator for differentiation and clusters in a correlation network which are not detected with conventional correlation analysis. CONCLUSIONS: We develop a stochastic process-based method SCOUP to analyze single-cell expression data throughout differentiation. SCOUP can estimate pseudo-time and cell lineage more accurately than previous methods. We also propose a novel correlation calculation method based on SCOUP. SCOUP is a promising approach for further single-cell analysis and available at https://github.com/hmatsu1226/SCOUP.


Subject(s)
Algorithms , Cell Differentiation/genetics , Gene Expression Regulation , Models, Statistical , Single-Cell Analysis/methods , Area Under Curve , Cell Line , Cell Lineage , Cluster Analysis , Gene Ontology , Gene Regulatory Networks , Humans , Principal Component Analysis , Reproducibility of Results , Stochastic Processes , Time Factors , Transcription Factors/metabolism
14.
BMC Bioinformatics ; 17(1): 203, 2016 May 06.
Article in English | MEDLINE | ID: mdl-27153986

ABSTRACT

BACKGROUND: RNA secondary structure around splice sites is known to assist normal splicing by promoting spliceosome recognition. However, analyzing the structural properties of entire intronic regions or pre-mRNA sequences has been difficult hitherto, owing to serious experimental and computational limitations, such as low read coverage and numerical problems. RESULTS: Our novel software, "ParasoR", is designed to run on a computer cluster and enables the exact computation of various structural features of long RNA sequences under the constraint of maximal base-pairing distance. ParasoR divides dynamic programming (DP) matrices into smaller pieces, such that each piece can be computed by a separate computer node without losing the connectivity information between the pieces. ParasoR directly computes the ratios of DP variables to avoid the reduction of numerical precision caused by the cancellation of a large number of Boltzmann factors. The structural preferences of mRNAs computed by ParasoR shows a high concordance with those determined by high-throughput sequencing analyses. Using ParasoR, we investigated the global structural preferences of transcribed regions in the human genome. A genome-wide folding simulation indicated that transcribed regions are significantly more structural than intergenic regions after removing repeat sequences and k-mer frequency bias. In particular, we observed a highly significant preference for base pairing over entire intronic regions as compared to their antisense sequences, as well as to intergenic regions. A comparison between pre-mRNAs and mRNAs showed that coding regions become more accessible after splicing, indicating constraints for translational efficiency. Such changes are correlated with gene expression levels, as well as GC content, and are enriched among genes associated with cytoskeleton and kinase functions. CONCLUSIONS: We have shown that ParasoR is very useful for analyzing the structural properties of long RNA sequences such as mRNAs, pre-mRNAs, and long non-coding RNAs whose lengths can be more than a million bases in the human genome. In our analyses, transcribed regions including introns are indicated to be subject to various types of structural constraints that cannot be explained from simple sequence composition biases. ParasoR is freely available at https://github.com/carushi/ParasoR .


Subject(s)
Computational Biology/methods , Genome, Human , Nucleic Acid Conformation , RNA/chemistry , RNA/genetics , Animals , Area Under Curve , Base Sequence , Computer Simulation , Gene Ontology , Humans , Mice , Microfilament Proteins/metabolism , Propensity Score , RNA Precursors/genetics , RNA Precursors/metabolism , RNA Splicing/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism , Reproducibility of Results , Software , Transcription, Genetic
15.
Nucleic Acids Res ; 44(W1): W302-7, 2016 07 08.
Article in English | MEDLINE | ID: mdl-27131356

ABSTRACT

The secondary structures, as well as the nucleotide sequences, are the important features of RNA molecules to characterize their functions. According to the thermodynamic model, however, the probability of any secondary structure is very small. As a consequence, any tool to predict the secondary structures of RNAs has limited accuracy. On the other hand, there are a few tools to compensate the imperfect predictions by calculating and visualizing the secondary structural information from RNA sequences. It is desirable to obtain the rich information from those tools through a friendly interface. We implemented a web server of the tools to predict secondary structures and to calculate various structural features based on the energy models of secondary structures. By just giving an RNA sequence to the web server, the user can get the different types of solutions of the secondary structures, the marginal probabilities such as base-paring probabilities, loop probabilities and accessibilities of the local bases, the energy changes by arbitrary base mutations as well as the measures for validations of the predicted secondary structures. The web server is available at http://rtools.cbrc.jp, which integrates software tools, CentroidFold, CentroidHomfold, IPKnot, CapR, Raccess, Rchange and RintD.


Subject(s)
Nucleic Acid Conformation , RNA Folding , RNA/chemistry , Software , Algorithms , Base Pairing , Base Sequence , Computer Graphics , Internet , Mutation , RNA/genetics , Sequence Analysis, RNA , Thermodynamics
16.
BMC Genomics ; 15: 733, 2014 Aug 28.
Article in English | MEDLINE | ID: mdl-25167975

ABSTRACT

BACKGROUND: Haplotype information is useful for many genetic analyses and haplotypes are usually inferred using computational approaches. Among such approaches, the importance of single individual haplotyping (SIH), which infers individual haplotypes from sequence fragments, has been increasing with the advent of novel sequencing techniques, such as dilution-based sequencing. These techniques could produce virtual long read fragments by separating DNA fragments into multiple low-concentration aliquots, sequencing and mapping each aliquot, and merging clustered short reads. Although these experimental techniques are sophisticated, they have the problem of producing chimeric fragments whose left and right parts match different chromosomes. In our previous research, we found that chimeric fragments significantly decrease the accuracy of SIH. Although chimeric fragments can be removed by using haplotypes which are determined from pedigree genotypes, pedigree genotypes are generally not available. The length of reads cluster and heterozygous calls were also used to detect chimeric fragments. Although some chimeric fragments will be removed with these features, considerable number of chimeric fragments will be undetected because of the dispersion of the length and the absence of SNPs in the overlapped regions. For these reasons, a general method to detect and remove chimeric fragments is needed. RESULTS: In this paper, we propose a general method to detect chimeric fragments. The basis of our method is that a chimeric fragment would correspond to an artificial recombinant haplotype and would differ from biological haplotypes. To detect differences from biological haplotypes, we integrated statistical phasing, which is a haplotype inference approach from population genotypes, into our method. We applied our method to two datasets and detected chimeric fragments with high AUC. AUC values of our method are higher than those of just using cluster length and heterozygous calls. We then used multiple SIH algorithm to compare the accuracy of SIH before and after removing the chimeric fragment candidates. The accuracy of assembled haplotypes increased significantly after removing chimeric fragment candidates. CONCLUSIONS: Our method is useful for detecting chimeric fragments and improving SIH accuracy. The Ruby script is available at https://sites.google.com/site/hmatsu1226/software/csp.


Subject(s)
Genotype , Haplotypes , Sequence Analysis, DNA , Algorithms , Chimera , Genetics, Population , Genotyping Techniques , Heterozygote , High-Throughput Nucleotide Sequencing , Humans , Models, Statistical , Polymorphism, Single Nucleotide , ROC Curve , Reproducibility of Results , Sequence Analysis, DNA/methods
17.
Genome Biol ; 15(1): R16, 2014 Jan 21.
Article in English | MEDLINE | ID: mdl-24447569

ABSTRACT

RNA-binding proteins (RBPs) bind to their target RNA molecules by recognizing specific RNA sequences and structural contexts. The development of CLIP-seq and related protocols has made it possible to exhaustively identify RNA fragments that bind to RBPs. However, no efficient bioinformatics method exists to reveal the structural specificities of RBP-RNA interactions using these data. We present CapR, an efficient algorithm that calculates the probability that each RNA base position is located within each secondary structural context. Using CapR, we demonstrate that several RBPs bind to their target RNA molecules under specific structural contexts. CapR is available at https://sites.google.com/site/fukunagatsu/software/capr.


Subject(s)
Algorithms , Computational Biology , RNA-Binding Proteins/metabolism , Sequence Analysis, RNA/methods , Animals , Binding Sites/genetics , Databases, Genetic , Humans , Immunoprecipitation , Mice , Models, Genetic , Nucleic Acid Conformation , Sensitivity and Specificity
18.
BMC Genomics ; 14 Suppl 2: S5, 2013.
Article in English | MEDLINE | ID: mdl-23445519

ABSTRACT

BACKGROUND: Haplotype information is useful for various genetic analyses, including genome-wide association studies. Determining haplotypes experimentally is difficult and there are several computational approaches that infer haplotypes from genomic data. Among such approaches, single individual haplotyping or haplotype assembly, which infers two haplotypes of an individual from aligned sequence fragments, has been attracting considerable attention. To avoid incorrect results in downstream analyses, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. Although there are several efficient algorithms for solving haplotype assembly, there are no efficient method that allow for extracting the regions assembled with high confidence. RESULTS: We develop a probabilistic model, called MixSIH, for solving the haplotype assembly problem. The model has two mixture components representing two haplotypes. Based on the optimized model, a quality score is defined, which we call the 'minimum connectivity' (MC) score, for each segment in the haplotype assembly. Because existing accuracy measures for haplotype assembly are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the set of partially assembled haplotype segments, we develop an accuracy measure based on the pairwise consistency and evaluate the accuracy on the simulation and real data. By using the MC scores, our algorithm can extract highly accurate haplotype segments. We also show evidence that an existing experimental dataset contains chimeric read fragments derived from different haplotypes, which significantly degrade the quality of assembled haplotypes. CONCLUSIONS: We develop a novel method for solving the haplotype assembly problem. We also define the quality score which is based on our model and indicates the accuracy of the haplotypes segments. In our evaluation, MixSIH has successfully extracted reliable haplotype segments. The C++ source code of MixSIH is available at https://sites.google.com/site/hmatsu1226/software/mixsih.


Subject(s)
Algorithms , Computational Biology/methods , Haplotypes , Models, Statistical , Humans , Sequence Analysis, DNA/methods
19.
Bioinformatics ; 28(8): 1093-101, 2012 Apr 15.
Article in English | MEDLINE | ID: mdl-22373787

ABSTRACT

MOTIVATION: Measuring the effects of base mutations is a powerful tool for functional and evolutionary analyses of RNA structures. To date, only a few methods have been developed for systematically computing the thermodynamic changes of RNA secondary structures in response to base mutations. RESULTS: We have developed algorithms for computing the changes of the ensemble free energy, mean energy and the thermodynamic entropy of RNA secondary structures for exhaustive patterns of single and double mutations. The computational complexities are O(NW(2)) (where N is sequence length and W is maximal base pair span) for single mutations and O(N(2)W(2)) for double mutations with large constant factors. We show that the changes are relatively insensitive to GC composition and the maximal span constraint. The mean free energy changes are bounded ~7-9 kcal/mol and depend only weakly on position if sequence lengths are sufficiently large. For tRNA sequences, the most stabilizing mutations come from the change of the 5(')-most base of the anticodon loop. We also show that most of the base changes in the acceptor stem destabilize the structures, indicating that the nucleotide sequence in the acceptor stem is highly optimized for secondary structure stability. We investigate the 22 tRNA genes in the human mitochondrial genome and show that non-pathogenic polymorphisms tend to cause smaller changes in thermodynamic variables than generic mutations, suggesting that a mutation which largely increases thermodynamic variables has higher possibility to be a pathogenic or lethal mutation. AVAILABILITY AND IMPLEMENTATION: The C++ source code of the Rchange software is available at http://www.ncrna.org/software/rchange/.


Subject(s)
Algorithms , Nucleic Acid Conformation , RNA/chemistry , Anticodon , Base Composition , Point Mutation , RNA/genetics , RNA, Transfer/chemistry , RNA, Transfer/genetics , Software , Thermodynamics
20.
Bioinformatics ; 27(17): 2346-53, 2011 Sep 01.
Article in English | MEDLINE | ID: mdl-21757463

ABSTRACT

MOTIVATION: Measuring evolutionary conservation is a routine step in the identification of functional elements in genome sequences. Although a number of studies have proposed methods that use the continuous time Markov models (CTMMs) to find evolutionarily constrained elements, their probabilistic structures have been less frequently investigated. RESULTS: In this article, we investigate a sufficient statistic for CTMMs. The statistic is composed of the fractional duration of nucleotide characters over evolutionary time, F(d), and the number of substitutions occurring in phylogenetic trees, N(s). We first derive basic properties of the sufficient statistic. Then, we derive an expectation maximization (EM) algorithm for estimating the parameters of a phylogenetic model, which iteratively computes the expectation values of the sufficient statistic. We show that the EM algorithm exhibits much faster convergence than other optimization methods that use numerical gradient descent algorithms. Finally, we investigate the genome-wide distribution of fractional duration time F(d) which, unlike the number of substitutions N(s), has rarely been investigated. We show that F(d) has evolutionary information that is distinct from that in N(s), which may be useful for detecting novel types of evolutionary constraints existing in the human genome. AVAILABILITY: The C++ source code of the 'Fdur' software is available at http://www.ncrna.org/software/fdur/ CONTACT: kiryu-h@k.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Phylogeny , Evolution, Molecular , Genome, Human , Genomics/methods , Humans , Markov Chains
SELECTION OF CITATIONS
SEARCH DETAIL
...