Search | VHL Regional Portal

1.

Correction: Reproducible big data science: A case study in continuous FAIRness.

Madduri, Ravi; Chard, Kyle; D'Arcy, Mike; Jung, Segun C; Rodriguez, Alexis; Sulakhe, Dinanath; Deutsch, Eric; Funk, Cory; Heavner, Ben; Richards, Matthew; Shannon, Paul; Glusman, Gustavo; Price, Nathan; Kesselman, Carl; Foster, Ian.

PLoS One ; 18(11): e0294883, 2023.

Article in English | MEDLINE | ID: mdl-37988378

ABSTRACT

[This corrects the article DOI: 10.1371/journal.pone.0213013.].

2.

Integration of genomics and transcriptomics predicts diabetic retinopathy susceptibility genes.

Skol, Andrew D; Jung, Segun C; Sokovic, Ana Marija; Chen, Siquan; Fazal, Sarah; Sosina, Olukayode; Borkar, Poulami P; Lin, Amy; Sverdlov, Maria; Cao, Dingcai; Swaroop, Anand; Bebu, Ionut; Stranger, Barbara E; Grassi, Michael A.

Elife ; 92020 11 09.

Article in English | MEDLINE | ID: mdl-33164750

ABSTRACT

We determined differential gene expression in response to high glucose in lymphoblastoid cell lines derived from matched individuals with type 1 diabetes with and without retinopathy. Those genes exhibiting the largest difference in glucose response were assessed for association with diabetic retinopathy in a genome-wide association study meta-analysis. Expression quantitative trait loci (eQTLs) of the glucose response genes were tested for association with diabetic retinopathy. We detected an enrichment of the eQTLs from the glucose response genes among small association p-values and identified folliculin (FLCN) as a susceptibility gene for diabetic retinopathy. Expression of FLCN in response to glucose was greater in individuals with diabetic retinopathy. Independent cohorts of individuals with diabetes revealed an association of FLCN eQTLs with diabetic retinopathy. Mendelian randomization confirmed a direct positive effect of increased FLCN expression on retinopathy. Integrating genetic association with gene expression implicated FLCN as a disease gene for diabetic retinopathy.

One of the side effects of diabetes is loss of vision from diabetic retinopathy, which is caused by injury to the light sensing tissue in the eye, the retina. Almost all individuals with diabetes develop diabetic retinopathy to some extent, and it is the leading cause of irreversible vision loss in working-age adults in the United States. How long a person has been living with diabetes, the extent of increased blood sugars and genetics all contribute to the risk and severity of diabetic retinopathy. Unfortunately, virtually no genes associated with diabetic retinopathy have yet been identified. When a gene is activated, it produces messenger molecules known as mRNA that are used by cells as instructions to produce proteins. The analysis of mRNA molecules, as well as genes themselves, can reveal the role of certain genes in disease. The studies of all genes and their associated mRNAs are respectively called genomics and transcriptomics. Genomics reveals what genes are present, while transcriptomics shows how active genes are in different cells. Skol et al. developed methods to study genomics and transcriptomics together to help discover genes that cause diabetic retinopathy. Genes involved in how cells respond to high blood sugar were first identified using cells grown in the lab. By comparing the activity of these genes in people with and without retinopathy the study identified genes associated with an increased risk of retinopathy in diabetes. In people with retinopathy, the activity of the folliculin gene (FLCN) increased more in response to high blood sugar. This was further verified with independent groups of people and using computer models to estimate the effect of different versions of the folliculin gene. The methods used here could be applied to understand complex genetics in other diseases. The results provide new understanding of the effects of diabetes. They may also help in the development of new treatments for diabetic retinopathy, which are likely to improve on the current approach of using laser surgery or injections into the eye.

Subject(s)

Diabetes Mellitus, Type 1/genetics , Diabetic Retinopathy/genetics , Gene Expression Profiling , Glucose/toxicity , Lymphocytes/drug effects , Polymorphism, Single Nucleotide , Proto-Oncogene Proteins/genetics , Transcriptome , Tumor Suppressor Proteins/genetics , Adult , Case-Control Studies , Cell Line, Transformed , Diabetes Mellitus, Type 1/complications , Diabetes Mellitus, Type 1/diagnosis , Diabetes Mellitus, Type 1/metabolism , Diabetic Retinopathy/diagnosis , Diabetic Retinopathy/metabolism , Female , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Lymphocytes/metabolism , Male , Mendelian Randomization Analysis , Proto-Oncogene Proteins/metabolism , Quantitative Trait Loci , Tumor Suppressor Proteins/metabolism , Young Adult

3.

Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types.

Funk, Cory C; Casella, Alex M; Jung, Segun; Richards, Matthew A; Rodriguez, Alex; Shannon, Paul; Donovan-Maiye, Rory; Heavner, Ben; Chard, Kyle; Xiao, Yukai; Glusman, Gustavo; Ertekin-Taner, Nilufer; Golde, Todd E; Toga, Arthur; Hood, Leroy; Van Horn, John D; Kesselman, Carl; Foster, Ian; Madduri, Ravi; Price, Nathan D; Ament, Seth A.

Cell Rep ; 32(7): 108029, 2020 08 18.

Article in English | MEDLINE | ID: mdl-32814038

ABSTRACT

Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits.

Subject(s)

Binding Sites/genetics , Deoxyribonucleases/metabolism , Genomics/methods , Transcription Factors/metabolism , Humans

4.

Genetic deletion of Sphk2 confers protection against Pseudomonas aeruginosa mediated differential expression of genes related to virulent infection and inflammation in mouse lung.

Ebenezer, David L; Fu, Panfeng; Krishnan, Yashaswin; Maienschein-Cline, Mark; Hu, Hong; Jung, Segun; Madduri, Ravi; Arbieva, Zarema; Harijith, Anantha; Natarajan, Viswanathan.

BMC Genomics ; 20(1): 984, 2019 Dec 16.

Article in English | MEDLINE | ID: mdl-31842752

ABSTRACT

BACKGROUND: Pseudomonas aeruginosa (PA) is an opportunistic Gram-negative bacterium that causes serious life threatening and nosocomial infections including pneumonia. PA has the ability to alter host genome to facilitate its invasion, thus increasing the virulence of the organism. Sphingosine-1- phosphate (S1P), a bioactive lipid, is known to play a key role in facilitating infection. Sphingosine kinases (SPHK) 1&2 phosphorylate sphingosine to generate S1P in mammalian cells. We reported earlier that Sphk2-/- mice offered significant protection against lung inflammation, compared to wild type (WT) animals. Therefore, we profiled the differential expression of genes between the protected group of Sphk2-/- and the wild type controls to better understand the underlying protective mechanisms related to the Sphk2 deletion in lung inflammatory injury. Whole transcriptome shotgun sequencing (RNA-Seq) was performed on mouse lung tissue using NextSeq 500 sequencing system. RESULTS: Two-way analysis of variance (ANOVA) analysis was performed and differentially expressed genes following PA infection were identified using whole transcriptome of Sphk2-/- mice and their WT counterparts. Pathway (PW) enrichment analyses of the RNA seq data identified several signaling pathways that are likely to play a crucial role in pneumonia caused by PA such as those involved in: 1. Immune response to PA infection and NF-κB signal transduction; 2. PKC signal transduction; 3. Impact on epigenetic regulation; 4. Epithelial sodium channel pathway; 5. Mucin expression; and 6. Bacterial infection related pathways. Our genomic data suggests a potential role for SPHK2 in PA-induced pneumonia through elevated expression of inflammatory genes in lung tissue. Further, validation by RT-PCR on 10 differentially expressed genes showed 100% concordance in terms of vectoral changes as well as significant fold change. CONCLUSION: Using Sphk2-/- mice and differential gene expression analysis, we have shown here that S1P/SPHK2 signaling could play a key role in promoting PA pneumonia. The identified genes promote inflammation and suppress others that naturally inhibit inflammation and host defense. Thus, targeting SPHK2/S1P signaling in PA-induced lung inflammation could serve as a potential therapy to combat PA-induced pneumonia.

Subject(s)

Gene Deletion , Gene Expression Profiling/methods , Gene Regulatory Networks , Phosphotransferases (Alcohol Group Acceptor)/genetics , Pseudomonas Infections/prevention & control , Pseudomonas aeruginosa/pathogenicity , Analysis of Variance , Animals , Disease Models, Animal , Female , Gene Expression Regulation , High-Throughput Nucleotide Sequencing , Lung/immunology , Lung/microbiology , Mice , Pseudomonas Infections/genetics , Pseudomonas Infections/immunology , RNA-Seq , Virulence

5.

Reproducible big data science: A case study in continuous FAIRness.

Madduri, Ravi; Chard, Kyle; D'Arcy, Mike; Jung, Segun C; Rodriguez, Alexis; Sulakhe, Dinanath; Deutsch, Eric; Funk, Cory; Heavner, Ben; Richards, Matthew; Shannon, Paul; Glusman, Gustavo; Price, Nathan; Kesselman, Carl; Foster, Ian.

PLoS One ; 14(4): e0213013, 2019.

Article in English | MEDLINE | ID: mdl-30973881

ABSTRACT

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

Subject(s)

Big Data , Data Science/statistics & numerical data , Databases, Factual/statistics & numerical data , Algorithms , Humans , Information Dissemination , Longitudinal Studies , Software

6.

A novel MERTK mutation causing retinitis pigmentosa.

Al-Khersan, Hasenin; Shah, Kaanan P; Jung, Segun C; Rodriguez, Alex; Madduri, Ravi K; Grassi, Michael A.

Graefes Arch Clin Exp Ophthalmol ; 255(8): 1613-1619, 2017 Aug.

Article in English | MEDLINE | ID: mdl-28462455

ABSTRACT

PURPOSE: Retinitis pigmentosa (RP) is a genetically heterogeneous inherited retinal dystrophy. To date, over 80 genes have been implicated in RP. However, the disease demonstrates significant locus and allelic heterogeneity not entirely captured by current testing platforms. The purpose of the present study was to characterize the underlying mutation in a patient with RP without a molecular diagnosis after initial genetic testing. METHODS: Whole-exome sequencing of the affected proband was performed. Candidate gene mutations were selected based on adherence to expected genetic inheritance pattern and predicted pathogenicity. Sanger sequencing of MERTK was completed on the patient's unaffected mother, affected brother, and unaffected sister to determine genetic phase. RESULTS: Eight sequence variants were identified in the proband in known RP-associated genes. Sequence analysis revealed that the proband was a compound heterozygote with two independent mutations in MERTK, a novel nonsense mutation (c.2179C > T) and a previously reported missense variant (c.2530C > T). The proband's affected brother also had both mutations. Predicted phase was confirmed in unaffected family members. CONCLUSION: Our study identifies a novel nonsense mutation in MERTK in a family with RP and no prior molecular diagnosis. The present study also demonstrates the clinical value of exome sequencing in determining the genetic basis of Mendelian diseases when standard genetic testing is unsuccessful.

Subject(s)

DNA/genetics , Mutation , Retinitis Pigmentosa/genetics , c-Mer Tyrosine Kinase/genetics , DNA Mutational Analysis , Exome , Female , Humans , Male , Ophthalmoscopy , Pedigree , Retina/pathology , Retinitis Pigmentosa/diagnosis , Retinitis Pigmentosa/metabolism , c-Mer Tyrosine Kinase/metabolism

7.

Development of Bioinformatics Infrastructure for Genomics Research.

Mulder, Nicola J; Adebiyi, Ezekiel; Adebiyi, Marion; Adeyemi, Seun; Ahmed, Azza; Ahmed, Rehab; Akanle, Bola; Alibi, Mohamed; Armstrong, Don L; Aron, Shaun; Ashano, Efejiro; Baichoo, Shakuntala; Benkahla, Alia; Brown, David K; Chimusa, Emile R; Fadlelmola, Faisal M; Falola, Dare; Fatumo, Segun; Ghedira, Kais; Ghouila, Amel; Hazelhurst, Scott; Isewon, Itunuoluwa; Jung, Segun; Kassim, Samar Kamal; Kayondo, Jonathan K; Mbiyavanga, Mamana; Meintjes, Ayton; Mohammed, Somia; Mosaku, Abayomi; Moussa, Ahmed; Muhammd, Mustafa; Mungloo-Dilmohamud, Zahra; Nashiru, Oyekanmi; Odia, Trust; Okafor, Adaobi; Oladipo, Olaleye; Osamor, Victor; Oyelade, Jellili; Sadki, Khalid; Salifu, Samson Pandam; Soyemi, Jumoke; Panji, Sumir; Radouani, Fouzia; Souiai, Oussama; Tastan Bishop, Özlem.

Glob Heart ; 12(2): 91-98, 2017 06.

Article in English | MEDLINE | ID: mdl-28302555

ABSTRACT

BACKGROUND: Although pockets of bioinformatics excellence have developed in Africa, generally, large-scale genomic data analysis has been limited by the availability of expertise and infrastructure. H3ABioNet, a pan-African bioinformatics network, was established to build capacity specifically to enable H3Africa (Human Heredity and Health in Africa) researchers to analyze their data in Africa. Since the inception of the H3Africa initiative, H3ABioNet's role has evolved in response to changing needs from the consortium and the African bioinformatics community. OBJECTIVES: H3ABioNet set out to develop core bioinformatics infrastructure and capacity for genomics research in various aspects of data collection, transfer, storage, and analysis. METHODS AND RESULTS: Various resources have been developed to address genomic data management and analysis needs of H3Africa researchers and other scientific communities on the continent. NetMap was developed and used to build an accurate picture of network performance within Africa and between Africa and the rest of the world, and Globus Online has been rolled out to facilitate data transfer. A participant recruitment database was developed to monitor participant enrollment, and data is being harmonized through the use of ontologies and controlled vocabularies. The standardized metadata will be integrated to provide a search facility for H3Africa data and biospecimens. Because H3Africa projects are generating large-scale genomic data, facilities for analysis and interpretation are critical. H3ABioNet is implementing several data analysis platforms that provide a large range of bioinformatics tools or workflows, such as Galaxy, the Job Management System, and eBiokits. A set of reproducible, portable, and cloud-scalable pipelines to support the multiple H3Africa data types are also being developed and dockerized to enable execution on multiple computing infrastructures. In addition, new tools have been developed for analysis of the uniquely divergent African data and for downstream interpretation of prioritized variants. To provide support for these and other bioinformatics queries, an online bioinformatics helpdesk backed by broad consortium expertise has been established. Further support is provided by means of various modes of bioinformatics training. CONCLUSIONS: For the past 4 years, the development of infrastructure support and human capacity through H3ABioNet, have significantly contributed to the establishment of African scientific networks, data analysis facilities, and training programs. Here, we describe the infrastructure and how it has affected genomics and bioinformatics research in Africa.

Subject(s)

Biomedical Research/methods , Computational Biology/trends , Genomics/methods , Africa , Humans

8.

Identification of Genetic and Epigenetic Variants Associated with Breast Cancer Prognosis by Integrative Bioinformatics Analysis.

Shilpi, Arunima; Bi, Yingtao; Jung, Segun; Patra, Samir K; Davuluri, Ramana V.

Cancer Inform ; 16: 1-13, 2017.

Article in English | MEDLINE | ID: mdl-28096648

ABSTRACT

INTRODUCTION: Breast cancer being a multifaceted disease constitutes a wide spectrum of histological and molecular variability in tumors. However, the task for the identification of these variances is complicated by the interplay between inherited genetic and epigenetic aberrations. Therefore, this study provides an extrapolate outlook to the sinister partnership between DNA methylation and single-nucleotide polymorphisms (SNPs) in relevance to the identification of prognostic markers in breast cancer. The effect of these SNPs on methylation is defined as methylation quantitative trait loci (meQTL). MATERIALS AND METHODS: We developed a novel method to identify prognostic gene signatures for breast cancer by integrating genomic and epigenomic data. This is based on the hypothesis that multiple sources of evidence pointing to the same gene or pathway are likely to lead to reduced false positives. We also apply random resampling to reduce overfitting noise by dividing samples into training and testing data sets. Specifically, the common samples between Illumina 450 DNA methylation, Affymetrix SNP array, and clinical data sets obtained from the Cancer Genome Atlas (TCGA) for breast invasive carcinoma (BRCA) were randomly divided into training and test models. An intensive statistical analysis based on log-rank test and Cox proportional hazard model has established a significant association between differential methylation and the stratification of breast cancer patients into high- and low-risk groups, respectively. RESULTS: The comprehensive assessment based on the conjoint effect of CpG-SNP pair has guided in delaminating the breast cancer patients into the high- and low-risk groups. In particular, the most significant association was found with respect to cg05370838-rs2230576, cg00956490-rs940453, and cg11340537-rs2640785 CpG-SNP pairs. These CpG-SNP pairs were strongly associated with differential expression of ADAM8, CREB5, and EXPH5 genes, respectively. Besides, the exclusive effect of SNPs such as rs10101376, rs140679, and rs1538146 also hold significant prognostic determinant. CONCLUSIONS: Thus, the analysis based on DNA methylation and SNPs have resulted in the identification of novel susceptible loci that hold prognostic relevance in breast cancer.

9.

Identification and validation of regulatory SNPs that modulate transcription factor chromatin binding and gene expression in prostate cancer.

Jin, Hong-Jian; Jung, Segun; DebRoy, Auditi R; Davuluri, Ramana V.

Oncotarget ; 7(34): 54616-54626, 2016 08 23.

Article in English | MEDLINE | ID: mdl-27409348

ABSTRACT

Prostate cancer (PCa) is the second most common solid tumor for cancer related deaths in American men. Genome wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with the increased risk of PCa. Because most of the susceptibility SNPs are located in noncoding regions, little is known about their functional mechanisms. We hypothesize that functional SNPs reside in cell type-specific regulatory elements that mediate the binding of critical transcription factors (TFs), which in turn result in changes in target gene expression. Using PCa-specific functional genomics data, here we identify 38 regulatory candidate SNPs and their target genes in PCa. Through risk analysis by incorporating gene expression and clinical data, we identify 6 target genes (ZG16B, ANKRD5, RERE, FAM96B, NAALADL2 and GTPBP10) as significant predictors of PCa biochemical recurrence. In addition, 5 SNPs (rs2659051, rs10936845, rs9925556, rs6057110 and rs2742624) are selected for experimental validation using Chromatin immunoprecipitation (ChIP), dual-luciferase reporter assay in LNCaP cells, showing allele-specific enhancer activity. Furthermore, we delete the rs2742624-containing region using CRISPR/Cas9 genome editing and observe the drastic downregulation of its target gene UPK3A. Taken together, our results illustrate that this new methodology can be applied to identify regulatory SNPs and their target genes that likely impact PCa risk. We suggest that similar studies can be performed to characterize regulatory variants in other diseases.

Subject(s)

Chromatin/metabolism , Gene Expression Regulation, Neoplastic , Polymorphism, Single Nucleotide , Prostatic Neoplasms/genetics , Transcription Factors/metabolism , Alleles , Base Sequence , Cell Line, Tumor , Chromatin/genetics , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study/methods , Humans , Kaplan-Meier Estimate , Male , Prostatic Neoplasms/metabolism , Prostatic Neoplasms/pathology , Protein Binding

10.

Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping.

Jung, Segun; Bi, Yingtao; Davuluri, Ramana V.

BMC Genomics ; 16 Suppl 11: S3, 2015.

Article in English | MEDLINE | ID: mdl-26576613

ABSTRACT

BACKGROUND: Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS: We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS: The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.

Subject(s)

Computational Biology/methods , Gene Expression Profiling/methods , Glioblastoma/classification , Glioblastoma/genetics , Machine Learning , RNA Isoforms/genetics , Algorithms , Cluster Analysis , Humans

11.

Graph-based sampling for approximating global helical topologies of RNA.

Kim, Namhee; Laing, Christian; Elmetwaly, Shereef; Jung, Segun; Curuksu, Jeremy; Schlick, Tamar.

Proc Natl Acad Sci U S A ; 111(11): 4079-84, 2014 Mar 18.

Article in English | MEDLINE | ID: mdl-24591615

ABSTRACT

A current challenge in RNA structure prediction is the description of global helical arrangements compatible with a given secondary structure. Here we address this problem by developing a hierarchical graph sampling/data mining approach to reduce conformational space and accelerate global sampling of candidate topologies. Starting from a 2D structure, we construct an initial graph from size measures deduced from solved RNAs and junction topologies predicted by our data-mining algorithm RNAJAG trained on known RNAs. We sample these graphs in 3D space guided by knowledge-based statistical potentials derived from bending and torsion measures of internal loops as well as radii of gyration for known RNAs. Graph sampling results for 30 representative RNAs are analyzed and compared with reference graphs from both solved structures and predicted structures by available programs. This comparison indicates promise for our graph-based sampling approach for characterizing global helical arrangements in large RNAs: graph rmsds range from 2.52 to 28.24 Å for RNAs of size 25-158 nucleotides, and more than half of our graph predictions improve upon other programs. The efficiency in graph sampling, however, implies an additional step of translating candidate graphs into atomic models. Such models can be built with the same idea of graph partitioning and build-up procedures we used for RNA design.

Subject(s)

Computational Biology/methods , Models, Molecular , Nucleic Acid Conformation , RNA Folding/genetics , RNA/chemistry , Algorithms , Data Mining

12.

Interconversion between parallel and antiparallel conformations of a 4H RNA junction in domain 3 of foot-and-mouth disease virus IRES captured by dynamics simulations.

Jung, Segun; Schlick, Tamar.

Biophys J ; 106(2): 447-58, 2014 Jan 21.

Article in English | MEDLINE | ID: mdl-24461020

ABSTRACT

RNA junctions are common secondary structural elements present in a wide range of RNA species. They play crucial roles in directing the overall folding of RNA molecules as well as in a variety of biological functions. In particular, there has been great interest in the dynamics of RNA junctions, including conformational pathways of fully base-paired 4-way (4H) RNA junctions. In such constructs, all nucleotides participate in one of the four double-stranded stem regions, with no connecting loops. Dynamical aspects of these 4H RNAs are interesting because frequent interchanges between parallel and antiparallel conformations are thought to occur without binding of other factors. Gel electrophoresis and single-molecule fluorescence resonance energy transfer experiments have suggested two possible pathways: one involves a helical rearrangement via disruption of coaxial stacking, and the other occurs by a rotation between the helical axes of coaxially stacked conformers. Employing molecular dynamics simulations, we explore this conformational variability in a 4H junction derived from domain 3 of the foot-and-mouth disease virus internal ribosome entry site (IRES); this junction contains highly conserved motifs for RNA-RNA and RNA-protein interactions, important for IRES activity. Our simulations capture transitions of the 4H junction between parallel and antiparallel conformations. The interconversion is virtually barrier-free and occurs via a rotation between the axes of coaxially stacked helices with a transient perpendicular intermediate. We characterize this transition, with various interhelical orientations, by pseudodihedral angle and interhelical distance measures. The high flexibility of the junction, as also demonstrated experimentally, is suitable for IRES activity. Because foot-and-mouth disease virus IRES structure depends on long-range interactions involving domain 3, the perpendicular intermediate, which maintains coaxial stacking of helices and thereby consensus primary and secondary structure information, may be beneficial for guiding the overall organization of the RNA system in domain 3.

Subject(s)

Foot-and-Mouth Disease Virus , Molecular Dynamics Simulation , Nucleic Acid Conformation , RNA, Viral/chemistry , Base Sequence , Peptide Chain Initiation, Translational , Principal Component Analysis , RNA, Viral/genetics

13.

Predicting helical topologies in RNA junctions as tree graphs.

Laing, Christian; Jung, Segun; Kim, Namhee; Elmetwaly, Shereef; Zahran, Mai; Schlick, Tamar.

PLoS One ; 8(8): e71947, 2013.

Article in English | MEDLINE | ID: mdl-23991010

ABSTRACT

RNA molecules are important cellular components involved in many fundamental biological processes. Understanding the mechanisms behind their functions requires knowledge of their tertiary structures. Though computational RNA folding approaches exist, they often require manual manipulation and expert intuition; predicting global long-range tertiary contacts remains challenging. Here we develop a computational approach and associated program module (RNAJAG) to predict helical arrangements/topologies in RNA junctions. Our method has two components: junction topology prediction and graph modeling. First, junction topologies are determined by a data mining approach from a given secondary structure of the target RNAs; second, the predicted topology is used to construct a tree graph consistent with geometric preferences analyzed from solved RNAs. The predicted graphs, which model the helical arrangements of RNA junctions for a large set of 200 junctions using a cross validation procedure, yield fairly good representations compared to the helical configurations in native RNAs, and can be further used to develop all-atom models as we show for two examples. Because junctions are among the most complex structural elements in RNA, this work advances folding structure prediction methods of large RNAs. The RNAJAG module is available to academic users upon request.

Subject(s)

Models, Molecular , Nucleic Acid Conformation , RNA Folding , RNA/chemistry , Base Sequence , Computational Biology/methods , Molecular Sequence Data , RNA/genetics , Reproducibility of Results

14.

Candidate RNA structures for domain 3 of the foot-and-mouth-disease virus internal ribosome entry site.

Jung, Segun; Schlick, Tamar.

Nucleic Acids Res ; 41(3): 1483-95, 2013 Feb 01.

Article in English | MEDLINE | ID: mdl-23275533

ABSTRACT

The foot-and-mouth-disease virus (FMDV) utilizes non-canonical translation initiation for viral protein synthesis, by forming a specific RNA structure called internal ribosome entry site (IRES). Domain 3 in FMDV IRES is phylogenetically conserved and highly structured; it contains four-way junctions where intramolecular RNA-RNA interactions serve as a scaffold for the RNA to fold for efficient IRES activity. Although the 3D structure of domain 3 is crucial to exploring and deciphering the initiation mechanism of translation, little is known. Here, we employ a combination of various modeling approaches to propose candidate tertiary structures for the apical region of domain 3, thought to be crucial for IRES function. We begin by modeling junction topology candidates and build atomic 3D models consistent with available experimental data. We then investigate each of the four candidate 3D structures by molecular dynamics simulations to determine the most energetically favorable configurations and to analyze specific tertiary interactions. Only one model emerges as viable containing not only the specific binding site for the GNRA tetraloop but also helical arrangements which enhance the stability of domain 3. These collective findings, together with available experimental data, suggest a plausible theoretical tertiary structure of the apical region in FMDV IRES domain 3.

Subject(s)

Foot-and-Mouth Disease Virus/genetics , RNA, Viral/chemistry , Untranslated Regions , Base Sequence , Conserved Sequence , Models, Molecular , Molecular Dynamics Simulation , Nucleic Acid Conformation , Peptide Chain Initiation, Translational

15.

Biomolecularmodeling and simulation: a field coming of age.

Schlick, Tamar; Collepardo-Guevara, Rosana; Halvorsen, Leif Arthur; Jung, Segun; Xiao, Xia.

Q Rev Biophys ; 44(2): 191-228, 2011 May.

Article in English | MEDLINE | ID: mdl-21226976

ABSTRACT

We assess the progress in biomolecular modeling and simulation, focusing on structure prediction and dynamics, by presenting the field's history, metrics for its rise in popularity, early expressed expectations, and current significant applications. The increases in computational power combined with improvements in algorithms and force fields have led to considerable success, especially in protein folding, specificity of ligand/biomolecule interactions, and interpretation of complex experimental phenomena (e.g. NMR relaxation, protein-folding kinetics and multiple conformational states) through the generation of structural hypotheses and pathway mechanisms. Although far from a general automated tool, structure prediction is notable for proteins and RNA that preceded the experiment, especially by knowledge-based approaches. Thus, despite early unrealistic expectations and the realization that computer technology alone will not quickly bridge the gap between experimental and theoretical time frames, ongoing improvements to enhance the accuracy and scope of modeling and simulation are propelling the field onto a productive trajectory to become full partner with experiment and a field on its own right.

Subject(s)

Models, Molecular , Molecular Biology/methods , Molecular Dynamics Simulation , Proteins/chemistry , RNA/chemistry , Algorithms , Humans , Molecular Biology/trends , Molecular Dynamics Simulation/trends

16.

Tertiary motifs revealed in analyses of higher-order RNA junctions.

Laing, Christian; Jung, Segun; Iqbal, Abdul; Schlick, Tamar.

J Mol Biol ; 393(1): 67-82, 2009 Oct 16.

Article in English | MEDLINE | ID: mdl-19660472

ABSTRACT

RNA junctions are secondary-structure elements formed when three or more helices come together. They are present in diverse RNA molecules with various fundamental functions in the cell. To better understand the intricate architecture of three-dimensional (3D) RNAs, we analyze currently solved 3D RNA junctions in terms of base-pair interactions and 3D configurations. First, we study base-pair interaction diagrams for solved RNA junctions with 5 to 10 helices and discuss common features. Second, we compare these higher-order junctions to those containing 3 or 4 helices and identify global motif patterns such as coaxial stacking and parallel and perpendicular helical configurations. These analyses show that higher-order junctions organize their helical components in parallel and helical configurations similar to lower-order junctions. Their sub-junctions also resemble local helical configurations found in three- and four-way junctions and are stabilized by similar long-range interaction preferences such as A-minor interactions. Furthermore, loop regions within junctions are high in adenine but low in cytosine, and in agreement with previous studies, we suggest that coaxial stacking between helices likely forms when the common single-stranded loop is small in size; however, other factors such as stacking interactions involving noncanonical base pairs and proteins can greatly determine or disrupt coaxial stacking. Finally, we introduce the ribo-base interactions: when combined with the along-groove packing motif, these ribo-base interactions form novel motifs involved in perpendicular helix-helix interactions. Overall, these analyses suggest recurrent tertiary motifs that stabilize junction architecture, pack helices, and help form helical configurations that occur as sub-elements of larger junction networks. The frequent occurrence of similar helical motifs suggest nature's finite and perhaps limited repertoire of RNA helical conformation preferences. More generally, studies of RNA junctions and tertiary building blocks can ultimately help in the difficult task of RNA 3D structure prediction.

Subject(s)

Nucleic Acid Conformation , RNA/chemistry , Base Pairing , Models, Chemical , Models, Molecular

17.

Learning from positive examples when the negative class is undetermined--microRNA gene identification.

Yousef, Malik; Jung, Segun; Showe, Louise C; Showe, Michael K.

Algorithms Mol Biol ; 3: 2, 2008 Jan 28.

Article in English | MEDLINE | ID: mdl-18226233

ABSTRACT

BACKGROUND: The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species. RESULTS: Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70-80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs. CONCLUSION: One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.

18.

Naïve Bayes for microRNA target predictions--machine learning for microRNA targets.

Yousef, Malik; Jung, Segun; Kossenkov, Andrew V; Showe, Louise C; Showe, Michael K.

Bioinformatics ; 23(22): 2987-92, 2007 Nov 15.

Article in English | MEDLINE | ID: mdl-17925304

ABSTRACT

MOTIVATION: Most computational methodologies for miRNA:mRNA target gene prediction use the seed segment of the miRNA and require cross-species sequence conservation in this region of the mRNA target. Methods that do not rely on conservation generate numbers of predictions, which are too large to validate. We describe a target prediction method (NBmiRTar) that does not require sequence conservation, using instead, machine learning by a naïve Bayes classifier. It generates a model from sequence and miRNA:mRNA duplex information from validated targets and artificially generated negative examples. Both the 'seed' and 'out-seed' segments of the miRNA:mRNA duplex are used for target identification. RESULTS: The application of machine-learning techniques to the features we have used is a useful and general approach for microRNA target gene prediction. Our technique produces fewer false positive predictions and fewer target candidates to be tested. It exhibits higher sensitivity and specificity than algorithms that rely on conserved genomic regions to decrease false positive predictions.

Subject(s)

Artificial Intelligence , Gene Targeting/methods , MicroRNAs/genetics , Pattern Recognition, Automated/methods , RNA Probes/genetics , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Algorithms , Base Sequence , Bayes Theorem , Molecular Sequence Data

19.

Recursive cluster elimination (RCE) for classification and feature selection from gene expression data.

Yousef, Malik; Jung, Segun; Showe, Louise C; Showe, Michael K.

BMC Bioinformatics ; 8: 144, 2007 May 02.

Article in English | MEDLINE | ID: mdl-17474999

ABSTRACT

BACKGROUND: Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. RESULTS: We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. CONCLUSION: SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.

Subject(s)

Databases, Genetic/classification , Gene Expression Profiling/classification , Gene Expression Regulation, Neoplastic/genetics , Multigene Family/genetics , Databases, Genetic/statistics & numerical data , Gene Expression/genetics , Gene Expression Profiling/methods , Gene Expression Profiling/statistics & numerical data , Head and Neck Neoplasms/genetics , Humans , Male , Prostatic Neoplasms/genetics

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL