Search | VHL Regional Portal

1.

Phenome-wide identification of therapeutic genetic targets, leveraging knowledge graphs, graph neural networks, and UK Biobank data.

Middleton, Lawrence; Melas, Ioannis; Vasavda, Chirag; Raies, Arwa; Rozemberczki, Benedek; Dhindsa, Ryan S; Dhindsa, Justin S; Weido, Blake; Wang, Quanli; Harper, Andrew R; Edwards, Gavin; Petrovski, Slavé; Vitsios, Dimitrios.

Sci Adv ; 10(19): eadj1424, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38718126

ABSTRACT

The ongoing expansion of human genomic datasets propels therapeutic target identification; however, extracting gene-disease associations from gene annotations remains challenging. Here, we introduce Mantis-ML 2.0, a framework integrating AstraZeneca's Biological Insights Knowledge Graph and numerous tabular datasets, to assess gene-disease probabilities throughout the phenome. We use graph neural networks, capturing the graph's holistic structure, and train them on hundreds of balanced datasets via a robust semi-supervised learning framework to provide gene-disease probabilities across the human exome. Mantis-ML 2.0 incorporates natural language processing to automate disease-relevant feature selection for thousands of diseases. The enhanced models demonstrate a 6.9% average classification power boost, achieving a median receiver operating characteristic (ROC) area under curve (AUC) score of 0.90 across 5220 diseases from Human Phenotype Ontology, OpenTargets, and Genomics England. Notably, Mantis-ML 2.0 prioritizes associations from an independent UK Biobank phenome-wide association study (PheWAS), providing a stronger form of triaging and mitigating against underpowered PheWAS associations. Results are exposed through an interactive web resource.

Subject(s)

Biological Specimen Banks , Neural Networks, Computer , Humans , Genome-Wide Association Study/methods , Phenotype , United Kingdom , Phenomics/methods , Genetic Predisposition to Disease , Genomics/methods , Databases, Genetic , Algorithms , Computational Biology/methods , UK Biobank

2.

Rare variant associations with plasma protein levels in the UK Biobank.

Dhindsa, Ryan S; Burren, Oliver S; Sun, Benjamin B; Prins, Bram P; Matelska, Dorota; Wheeler, Eleanor; Mitchell, Jonathan; Oerton, Erin; Hristova, Ventzislava A; Smith, Katherine R; Carss, Keren; Wasilewski, Sebastian; Harper, Andrew R; Paul, Dirk S; Fabre, Margarete A; Runz, Heiko; Viollet, Coralie; Challis, Benjamin; Platt, Adam; Vitsios, Dimitrios; Ashley, Euan A; Whelan, Christopher D; Pangalos, Menelas N; Wang, Quanli; Petrovski, Slavé.

Nature ; 622(7982): 339-347, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37794183

ABSTRACT

Integrating human genomics and proteomics can help elucidate disease mechanisms, identify clinical biomarkers and discover drug targets1-4. Because previous proteogenomic studies have focused on common variation via genome-wide association studies, the contribution of rare variants to the plasma proteome remains largely unknown. Here we identify associations between rare protein-coding variants and 2,923 plasma protein abundances measured in 49,736 UK Biobank individuals. Our variant-level exome-wide association study identified 5,433 rare genotype-protein associations, of which 81% were undetected in a previous genome-wide association study of the same cohort5. We then looked at aggregate signals using gene-level collapsing analysis, which revealed 1,962 gene-protein associations. Of the 691 gene-level signals from protein-truncating variants, 99.4% were associated with decreased protein levels. STAB1 and STAB2, encoding scavenger receptors involved in plasma protein clearance, emerged as pleiotropic loci, with 77 and 41 protein associations, respectively. We demonstrate the utility of our publicly accessible resource through several applications. These include detailing an allelic series in NLRC4, identifying potential biomarkers for a fatty liver disease-associated variant in HSD17B13 and bolstering phenome-wide association studies by integrating protein quantitative trait loci with protein-truncating variants in collapsing analyses. Finally, we uncover distinct proteomic consequences of clonal haematopoiesis (CH), including an association between TET2-CH and increased FLT3 levels. Our results highlight a considerable role for rare variation in plasma protein abundance and the value of proteogenomics in therapeutic discovery.

Subject(s)

Biological Specimen Banks , Blood Proteins , Genetic Association Studies , Genomics , Proteomics , Humans , Alleles , Biomarkers/blood , Blood Proteins/analysis , Blood Proteins/genetics , Databases, Factual , Exome/genetics , Hematopoiesis , Mutation , Plasma/chemistry , United Kingdom

3.

Author Correction: DrugnomeAI is an ensemble machine-learning framework for predicting druggability of candidate drug targets.

Raies, Arwa; Tulodziecka, Ewa; Stainer, James; Middleton, Lawrence; Dhindsa, Ryan S; Hill, Pamela; Engkvist, Ola; Harper, Andrew R; Petrovski, Slavé; Vitsios, Dimitrios.

Commun Biol ; 6(1): 710, 2023 Jul 11.

Article in English | MEDLINE | ID: mdl-37433831

4.

Effects of protein-coding variants on blood metabolite measurements and clinical biomarkers in the UK Biobank.

Nag, Abhishek; Dhindsa, Ryan S; Middleton, Lawrence; Jiang, Xiao; Vitsios, Dimitrios; Wigmore, Eleanor; Allman, Erik L; Reznichenko, Anna; Carss, Keren; Smith, Katherine R; Wang, Quanli; Challis, Benjamin; Paul, Dirk S; Harper, Andrew R; Petrovski, Slavé.

Am J Hum Genet ; 110(3): 487-498, 2023 03 02.

Article in English | MEDLINE | ID: mdl-36809768

ABSTRACT

Genome-wide association studies (GWASs) have established the contribution of common and low-frequency variants to metabolic blood measurements in the UK Biobank (UKB). To complement existing GWAS findings, we assessed the contribution of rare protein-coding variants in relation to 355 metabolic blood measurements-including 325 predominantly lipid-related nuclear magnetic resonance (NMR)-derived blood metabolite measurements (Nightingale Health Plc) and 30 clinical blood biomarkers-using 412,393 exome sequences from four genetically diverse ancestries in the UKB. Gene-level collapsing analyses were conducted to evaluate a diverse range of rare-variant architectures for the metabolic blood measurements. Altogether, we identified significant associations (p < 1 × 10-8) for 205 distinct genes that involved 1,968 significant relationships for the Nightingale blood metabolite measurements and 331 for the clinical blood biomarkers. These include associations for rare non-synonymous variants in PLIN1 and CREB3L3 with lipid metabolite measurements and SYT7 with creatinine, among others, which may not only provide insights into novel biology but also deepen our understanding of established disease mechanisms. Of the study-wide significant clinical biomarker associations, 40% were not previously detected on analyzing coding variants in a GWAS in the same cohort, reinforcing the importance of studying rare variation to fully understand the genetic architecture of metabolic blood measurements.

Subject(s)

Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Biological Specimen Banks , Biomarkers , Lipids , United Kingdom , Polymorphism, Single Nucleotide

5.

A minimal role for synonymous variation in human disease.

Dhindsa, Ryan S; Wang, Quanli; Vitsios, Dimitrios; Burren, Oliver S; Hu, Fengyuan; DiCarlo, James E; Kruglyak, Leonid; MacArthur, Daniel G; Hurles, Matthew E; Petrovski, Slavé.

Am J Hum Genet ; 109(12): 2105-2109, 2022 12 01.

Article in English | MEDLINE | ID: mdl-36459978

ABSTRACT

Synonymous mutations change the DNA sequence of a gene without affecting the amino acid sequence of the encoded protein. Although some synonymous mutations can affect RNA splicing, translational efficiency, and mRNA stability, studies in human genetics, mutagenesis screens, and other experiments and evolutionary analyses have repeatedly shown that most synonymous variants are neutral or only weakly deleterious, with some notable exceptions. Based on a recent study in yeast, there have been claims that synonymous mutations could be as important as nonsynonymous mutations in causing disease, assuming the yeast findings hold up and translate to humans. Here, we argue that there is insufficient evidence to overturn the large, coherent body of knowledge establishing the predominant neutrality of synonymous variants in the human genome.

Subject(s)

Biological Evolution , Saccharomyces cerevisiae , Humans , Mutation/genetics , Amino Acid Sequence , Genome, Human/genetics

6.

Human genetics uncovers MAP3K15 as an obesity-independent therapeutic target for diabetes.

Nag, Abhishek; Dhindsa, Ryan S; Mitchell, Jonathan; Vasavda, Chirag; Harper, Andrew R; Vitsios, Dimitrios; Ahnmark, Andrea; Bilican, Bilada; Madeyski-Bengtson, Katja; Zarrouki, Bader; Zoghbi, Anthony W; Wang, Quanli; Smith, Katherine R; Alegre-Díaz, Jesus; Kuri-Morales, Pablo; Berumen, Jaime; Tapia-Conyer, Roberto; Emberson, Jonathan; Torres, Jason M; Collins, Rory; Smith, David M; Challis, Benjamin; Paul, Dirk S; Bohlooly-Y, Mohammad; Snowden, Mike; Baker, David; Fritsche-Danielson, Regina; Pangalos, Menelas N; Petrovski, Slavé.

Sci Adv ; 8(46): eadd5430, 2022 11 18.

Article in English | MEDLINE | ID: mdl-36383675

ABSTRACT

We performed collapsing analyses on 454,796 UK Biobank (UKB) exomes to detect gene-level associations with diabetes. Recessive carriers of nonsynonymous variants in MAP3K15 were 30% less likely to develop diabetes (P = 5.7 × 10-10) and had lower glycosylated hemoglobin (ß = -0.14 SD units, P = 1.1 × 10-24). These associations were independent of body mass index, suggesting protection against insulin resistance even in the setting of obesity. We replicated these findings in 96,811 Admixed Americans in the Mexico City Prospective Study (P < 0.05)Moreover, the protective effect of MAP3K15 variants was stronger in individuals who did not carry the Latino-enriched SLC16A11 risk haplotype (P = 6.0 × 10-4). Separately, we identified a Finnish-enriched MAP3K15 protein-truncating variant associated with decreased odds of both type 1 and type 2 diabetes (P < 0.05) in FinnGen. No adverse phenotypes were associated with protein-truncating MAP3K15 variants in the UKB, supporting this gene as a therapeutic target for diabetes.

Subject(s)

Diabetes Mellitus, Type 2 , MAP Kinase Kinase Kinases , Humans , Diabetes Mellitus, Type 2/genetics , Genetic Predisposition to Disease , Monocarboxylic Acid Transporters/genetics , Obesity/genetics , Prospective Studies , MAP Kinase Kinase Kinases/genetics

7.

DrugnomeAI is an ensemble machine-learning framework for predicting druggability of candidate drug targets.

Raies, Arwa; Tulodziecka, Ewa; Stainer, James; Middleton, Lawrence; Dhindsa, Ryan S; Hill, Pamela; Engkvist, Ola; Harper, Andrew R; Petrovski, Slavé; Vitsios, Dimitrios.

Commun Biol ; 5(1): 1291, 2022 11 24.

Article in English | MEDLINE | ID: mdl-36434048

ABSTRACT

The druggability of targets is a crucial consideration in drug target selection. Here, we adopt a stochastic semi-supervised ML framework to develop DrugnomeAI, which estimates the druggability likelihood for every protein-coding gene in the human exome. DrugnomeAI integrates gene-level properties from 15 sources resulting in 324 features. The tool generates exome-wide predictions based on labelled sets of known drug targets (median AUC: 0.97), highlighting features from protein-protein interaction networks as top predictors. DrugnomeAI provides generic as well as specialised models stratified by disease type or drug therapeutic modality. The top-ranking DrugnomeAI genes were significantly enriched for genes previously selected for clinical development programs (p value < 1 × 10-308) and for genes achieving genome-wide significance in phenome-wide association studies of 450 K UK Biobank exomes for binary (p value = 1.7 × 10-5) and quantitative traits (p value = 1.6 × 10-7). We accompany our method with a web application ( http://drugnomeai.public.cgr.astrazeneca.com ) to visualise the druggability predictions and the key features that define gene druggability, per disease type and modality.

Subject(s)

Machine Learning , Software , Humans , Drug Delivery Systems

8.

Cancer-driving mutations are enriched in genic regions intolerant to germline variation.

Vitsios, Dimitrios; Dhindsa, Ryan S; Matelska, Dorota; Mitchell, Jonathan; Zou, Xuequing; Armenia, Joshua; Hu, Fengyuan; Wang, Quanli; Sidders, Ben; Harper, Andrew R; Petrovski, Slavé.

Sci Adv ; 8(34): eabo6371, 2022 08 26.

Article in English | MEDLINE | ID: mdl-36026442

ABSTRACT

Large reference datasets of protein-coding variation in human populations have allowed us to determine which genes and genic subregions are intolerant to germline genetic variation. There is also a growing number of genes implicated in severe Mendelian diseases that overlap with genes implicated in cancer. We hypothesized that cancer-driving mutations might be enriched in genic subregions that are depleted of germline variation relative to somatic variation. We introduce a new metric, OncMTR (oncology missense tolerance ratio), which uses 125,748 exomes in the Genome Aggregation Database (gnomAD) to identify these genic subregions. We demonstrate that OncMTR can significantly predict driver mutations implicated in hematologic malignancies. Divergent OncMTR regions were enriched for cancer-relevant protein domains, and overlaying OncMTR scores on protein structures identified functionally important protein residues. Last, we performed a rare variant, gene-based collapsing analysis on an independent set of 394,694 exomes from the UK Biobank and find that OncMTR markedly improves genetic signals for hematologic malignancies.

Subject(s)

Germ-Line Mutation , Hematologic Neoplasms , Germ Cells , Hematologic Neoplasms/genetics , Humans

9.

Gene-SCOUT: identifying genes with similar continuous trait fingerprints from phenome-wide association analyses.

Middleton, Lawrence; Harper, Andrew R; Nag, Abhishek; Wang, Quanli; Reznichenko, Anna; Vitsios, Dimitrios; Petrovski, Slavé.

Nucleic Acids Res ; 50(8): 4289-4301, 2022 05 06.

Article in English | MEDLINE | ID: mdl-35474393

ABSTRACT

Large-scale phenome-wide association studies performed using densely-phenotyped cohorts such as the UK Biobank (UKB), reveal many statistically robust gene-phenotype relationships for both clinical and continuous traits. Here, we present Gene-SCOUT, a tool used to identify genes with similar continuous trait fingerprints to a gene of interest. A fingerprint reflects the continuous traits identified to be statistically associated with a gene of interest based on multiple underlying rare variant genetic architectures. Similarities between genes are evaluated by the cosine similarity measure, to capture concordant effect directionality, elucidating clusters of genes in a high dimensional space. The underlying gene-biomarker population-scale association statistics were obtained from a gene-level rare variant collapsing analysis performed on over 1500 continuous traits using 394 692 UKB participant exomes, with additional metabolomic trait associations provided through Nightingale Health's recent study of 121 394 of these participants. We demonstrate that gene similarity estimates from Gene-SCOUT provide stronger enrichments for clinical traits compared to existing methods. Furthermore, we provide a fully interactive web-resource (http://genescout.public.cgr.astrazeneca.com) to explore the pre-calculated exome-wide similarities. This resource enables a user to examine the biological relevance of the most similar genes for Gene Ontology (GO) enrichment and UKB clinical trait enrichment statistics, as well as a detailed breakdown of the traits underpinning a given fingerprint.

Subject(s)

Genome-Wide Association Study , Phenomics , Humans , Genome-Wide Association Study/methods , Phenotype , Exome Sequencing , Exome , Polymorphism, Single Nucleotide

10.

Rare variant contribution to human disease in 281,104 UK Biobank exomes.

Wang, Quanli; Dhindsa, Ryan S; Carss, Keren; Harper, Andrew R; Nag, Abhishek; Tachmazidou, Ioanna; Vitsios, Dimitrios; Deevi, Sri V V; Mackay, Alex; Muthas, Daniel; Hühn, Michael; Monkley, Susan; Olsson, Henric; Wasilewski, Sebastian; Smith, Katherine R; March, Ruth; Platt, Adam; Haefliger, Carolina; Petrovski, Slavé.

Nature ; 597(7877): 527-532, 2021 09.

Article in English | MEDLINE | ID: mdl-34375979

ABSTRACT

Genome-wide association studies have uncovered thousands of common variants associated with human disease, but the contribution of rare variants to common disease remains relatively unexplored. The UK Biobank contains detailed phenotypic data linked to medical records for approximately 500,000 participants, offering an unprecedented opportunity to evaluate the effect of rare variation on a broad collection of traits1,2. Here we study the relationships between rare protein-coding variants and 17,361 binary and 1,419 quantitative phenotypes using exome sequencing data from 269,171 UK Biobank participants of European ancestry. Gene-based collapsing analyses revealed 1,703 statistically significant gene-phenotype associations for binary traits, with a median odds ratio of 12.4. Furthermore, 83% of these associations were undetectable via single-variant association tests, emphasizing the power of gene-based collapsing analysis in the setting of high allelic heterogeneity. Gene-phenotype associations were also significantly enriched for loss-of-function-mediated traits and approved drug targets. Finally, we performed ancestry-specific and pan-ancestry collapsing analyses using exome sequencing data from 11,933 UK Biobank participants of African, East Asian or South Asian ancestry. Our results highlight a significant contribution of rare variants to common disease. Summary statistics are publicly available through an interactive portal ( http://azphewas.com/ ).

Subject(s)

Biological Specimen Banks , Databases, Genetic , Disease/genetics , Exome/genetics , Genetic Variation/genetics , Adult , Aged , Female , Genome-Wide Association Study , Humans , Male , Middle Aged , Phenotype , Proteins/chemistry , Proteins/genetics , United Kingdom , Exome Sequencing

11.

Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning.

Vitsios, Dimitrios; Dhindsa, Ryan S; Middleton, Lawrence; Gussow, Ayal B; Petrovski, Slavé.

Nat Commun ; 12(1): 1504, 2021 03 08.

Article in English | MEDLINE | ID: mdl-33686085

ABSTRACT

Elucidating functionality in non-coding regions is a key challenge in human genomics. It has been shown that intolerance to variation of coding and proximal non-coding sequence is a strong predictor of human disease relevance. Here, we integrate intolerance to variation, functional genomic annotations and primary genomic sequence to build JARVIS: a comprehensive deep learning model to prioritize non-coding regions, outperforming other human lineage-specific scores. Despite being agnostic to evolutionary conservation, JARVIS performs comparably or outperforms conservation-based scores in classifying pathogenic single-nucleotide and structural variants. In constructing JARVIS, we introduce the genome-wide residual variation intolerance score (gwRVIS), applying a sliding-window approach to whole genome sequencing data from 62,784 individuals. gwRVIS distinguishes Mendelian disease genes from more tolerant CCDS regions and highlights ultra-conserved non-coding elements as the most intolerant regions in the human genome. Both JARVIS and gwRVIS capture previously inaccessible human-lineage constraint information and will enhance our understanding of the non-coding genome.

Subject(s)

Deep Learning , Genome, Human , Genomics , DNA, Intergenic , Genetic Variation , Humans , Sequence Analysis, DNA , Whole Genome Sequencing

12.

Identification of a missense variant in SPDL1 associated with idiopathic pulmonary fibrosis.

Dhindsa, Ryan S; Mattsson, Johan; Nag, Abhishek; Wang, Quanli; Wain, Louise V; Allen, Richard; Wigmore, Eleanor M; Ibanez, Kristina; Vitsios, Dimitrios; Deevi, Sri V V; Wasilewski, Sebastian; Karlsson, Maria; Lassi, Glenda; Olsson, Henric; Muthas, Daniel; Monkley, Susan; Mackay, Alex; Murray, Lynne; Young, Simon; Haefliger, Carolina; Maher, Toby M; Belvisi, Maria G; Jenkins, Gisli; Molyneaux, Philip L; Platt, Adam; Petrovski, Slavé.

Commun Biol ; 4(1): 392, 2021 03 23.

Article in English | MEDLINE | ID: mdl-33758299

ABSTRACT

Idiopathic pulmonary fibrosis (IPF) is a fatal disorder characterised by progressive, destructive lung scarring. Despite substantial progress, the genetic determinants of this disease remain incompletely defined. Using whole genome and whole exome sequencing data from 752 individuals with sporadic IPF and 119,055 UK Biobank controls, we performed a variant-level exome-wide association study (ExWAS) and gene-level collapsing analyses. Our variant-level analysis revealed a novel association between a rare missense variant in SPDL1 and IPF (NM_017785.5:g.169588475 G > A p.Arg20Gln; p = 2.4 × 10-7, odds ratio = 2.87, 95% confidence interval: 2.03-4.07). This signal was independently replicated in the FinnGen cohort, which contains 1028 cases and 196,986 controls (combined p = 2.2 × 10-20), firmly associating this variant as an IPF risk allele. SPDL1 encodes Spindly, a protein involved in mitotic checkpoint signalling during cell division that has not been previously described in fibrosis. To the best of our knowledge, these results highlight a novel mechanism underlying IPF, providing the potential for new therapeutic discoveries in a disease of great unmet need.

Subject(s)

Cell Cycle Proteins/genetics , Idiopathic Pulmonary Fibrosis/genetics , Mutation, Missense , Aged , Case-Control Studies , Female , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Idiopathic Pulmonary Fibrosis/diagnosis , Male , Phenotype , Exome Sequencing

13.

Spontaneous Coronary Artery Dissection: Insights on Rare Genetic Variation From Genome Sequencing.

Carss, Keren J; Baranowska, Anna A; Armisen, Javier; Webb, Tom R; Hamby, Stephen E; Premawardhana, Diluka; Al-Hussaini, Abtehale; Wood, Alice; Wang, Quanli; Deevi, Sri V V; Vitsios, Dimitrios; Lewis, Samuel H; Kotecha, Deevia; Bouatia-Naji, Nabila; Hesselson, Stephanie; Iismaa, Siiri E; Tarr, Ingrid; McGrath-Cadell, Lucy; Muller, David W; Dunwoodie, Sally L; Fatkin, Diane; Graham, Robert M; Giannoulatou, Eleni; Samani, Nilesh J; Petrovski, Slavé; Haefliger, Carolina; Adlam, David.

Circ Genom Precis Med ; 13(6): e003030, 2020 12.

Article in English | MEDLINE | ID: mdl-33125268

ABSTRACT

BACKGROUND: Spontaneous coronary artery dissection (SCAD) occurs when an epicardial coronary artery is narrowed or occluded by an intramural hematoma. SCAD mainly affects women and is associated with pregnancy and systemic arteriopathies, particularly fibromuscular dysplasia. Variants in several genes, such as those causing connective tissue disorders, have been implicated; however, the genetic architecture is poorly understood. Here, we aim to better understand the diagnostic yield of rare variant genetic testing among a cohort of SCAD survivors and to identify genes or gene sets that have a significant enrichment of rare variants. METHODS: We sequenced a cohort of 384 SCAD survivors from the United Kingdom, alongside 13 722 UK Biobank controls and a validation cohort of 92 SCAD survivors. We performed a research diagnostic screen for pathogenic variants and exome-wide and gene-set rare variant collapsing analyses. RESULTS: The majority of patients within both cohorts are female, 29% of the study cohort and 14% validation cohort have a remote arteriopathy. Four cases across the 2 cohorts had a diagnosed connective tissue disorder. We identified pathogenic or likely pathogenic variants in 7 genes (PKD1, COL3A1, SMAD3, TGFB2, LOX, MYLK, and YY1AP1) in 14/384 cases in the study cohort and in 1/92 cases in the validation cohort. In our rare variant collapsing analysis, PKD1 was the highest-ranked gene, and several functionally plausible genes were enriched for rare variants, although no gene achieved study-wide statistical significance. Gene-set enrichment analysis suggested a role for additional genes involved in renal function. CONCLUSIONS: By studying the largest sequenced cohort of SCAD survivors, we demonstrate that, based on current knowledge, only a small proportion have a pathogenic variant that could explain their disease. Our findings strengthen the overlap between SCAD and renal and connective tissue disorders, and we highlight several new genes for future validation.

Subject(s)

Coronary Vessel Anomalies/genetics , Exome Sequencing , Genetic Variation , Genome, Human , Vascular Diseases/congenital , Adult , Aged , Cohort Studies , Female , Humans , Machine Learning , Male , Middle Aged , Models, Genetic , United Kingdom , Vascular Diseases/genetics , Young Adult

14.

Mantis-ml: Disease-Agnostic Gene Prioritization from High-Throughput Genomic Screens by Stochastic Semi-supervised Learning.

Vitsios, Dimitrios; Petrovski, Slavé.

Am J Hum Genet ; 106(5): 659-678, 2020 05 07.

Article in English | MEDLINE | ID: mdl-32386536

ABSTRACT

Access to large-scale genomics datasets has increased the utility of hypothesis-free genome-wide analyses. However, gene signals are often insufficiently powered to reach experiment-wide significance, triggering a process of laborious triaging of genomic-association-study results. We introduce mantis-ml, a multi-dimensional, multi-step machine-learning framework that allows objective assessment of the biological relevance of genes to disease studies. Mantis-ml is an automated machine-learning framework that follows a multi-model approach of stochastic semi-supervised learning to rank disease-associated genes through iterative learning sessions on random balanced datasets across the protein-coding exome. When applied to a range of human diseases, including chronic kidney disease (CKD), epilepsy, and amyotrophic lateral sclerosis (ALS), mantis-ml achieved an average area under curve (AUC) prediction performance of 0.81-0.89. Critically, to prove its value as a tool that can be used to interpret exome-wide association studies, we overlapped mantis-ml predictions with data from published cohort-level association studies. We found a statistically significant enrichment of high mantis-ml predictions among the highest-ranked genes from hypothesis-free cohort-level statistics, indicating a substantial improvement over the performance of current state-of-the-art methods and pointing to the capture of true prioritization signals for disease-associated genes. Finally, we introduce a generic mantis-ml score (GMS) trained with over 1,200 features as a generic-disease-likelihood estimator, outperforming published gene-level scores. In addition to our tool, we provide a gene prioritization atlas that includes mantis-ml's predictions across ten disease areas and empowers researchers to interactively navigate through the gene-triaging framework. Mantis-ml is an intuitive tool that supports the objective triaging of large-scale genomic discovery studies and enhances our understanding of complex genotype-phenotype associations.

Subject(s)

Amyotrophic Lateral Sclerosis/genetics , Epilepsy/genetics , Genomics/methods , Renal Insufficiency, Chronic/genetics , Supervised Machine Learning , Animals , Area Under Curve , Deep Learning , Disease Models, Animal , Exome/genetics , Genetic Association Studies , Humans , Mice , Neural Networks, Computer , ROC Curve , Reproducibility of Results , Stochastic Processes

15.

Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in SCN1A.

Steward, Charles A; Roovers, Jolien; Suner, Marie-Marthe; Gonzalez, Jose M; Uszczynska-Ratajczak, Barbara; Pervouchine, Dmitri; Fitzgerald, Stephen; Viola, Margarida; Stamberger, Hannah; Hamdan, Fadi F; Ceulemans, Berten; Leroy, Patricia; Nava, Caroline; Lepine, Anne; Tapanari, Electra; Keiller, Don; Abbs, Stephen; Sanchis-Juan, Alba; Grozeva, Detelina; Rogers, Anthony S; Diekhans, Mark; Guigó, Roderic; Petryszak, Robert; Minassian, Berge A; Cavalleri, Gianpiero; Vitsios, Dimitrios; Petrovski, Slavé; Harrow, Jennifer; Flicek, Paul; Lucy Raymond, F; Lench, Nicholas J; Jonghe, Peter De; Mudge, Jonathan M; Weckhuysen, Sarah; Sisodiya, Sanjay M; Frankish, Adam.

NPJ Genom Med ; 4: 31, 2019.

Article in English | MEDLINE | ID: mdl-31814998

ABSTRACT

The developmental and epileptic encephalopathies (DEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60-65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for a lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 putative transcript models. Our annotations increase the transcriptional 'footprint' of these genes by over 674 kb. Using SCN1A as a case study, due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome or a similar phenotype with a panel of exon sequences representing eight established genes and identified two de novo SCN1A variants that now - through improved gene annotation - are ascribed to residing among our exons. These two (from 122 screened people, 1.6%) molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously classified SCN1A intronic Dravet syndrome-associated variant that now lies within a deeply conserved exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders.

16.

Exome-Based Rare-Variant Analyses in CKD.

Cameron-Christie, Sophia; Wolock, Charles J; Groopman, Emily; Petrovski, Slavé; Kamalakaran, Sitharthan; Povysil, Gundula; Vitsios, Dimitrios; Zhang, Mengqi; Fleckner, Jan; March, Ruth E; Gelfman, Sahar; Marasa, Maddalena; Li, Yifu; Sanna-Cherchi, Simone; Kiryluk, Krzysztof; Allen, Andrew S; Fellström, Bengt C; Haefliger, Carolina; Platt, Adam; Goldstein, David B; Gharavi, Ali G.

J Am Soc Nephrol ; 30(6): 1109-1122, 2019 06.

Article in English | MEDLINE | ID: mdl-31085678

ABSTRACT

BACKGROUND: Studies have identified many common genetic associations that influence renal function and all-cause CKD, but these explain only a small fraction of variance in these traits. The contribution of rare variants has not been systematically examined. METHODS: We performed exome sequencing of 3150 individuals, who collectively encompassed diverse CKD subtypes, and 9563 controls. To detect causal genes and evaluate the contribution of rare variants we used collapsing analysis, in which we compared the proportion of cases and controls carrying rare variants per gene. RESULTS: The analyses captured five established monogenic causes of CKD: variants in PKD1, PKD2, and COL4A5 achieved study-wide significance, and we observed suggestive case enrichment for COL4A4 and COL4A3. Beyond known disease-associated genes, collapsing analyses incorporating regional variant intolerance identified suggestive dominant signals in CPT2 and several other candidate genes. Biallelic mutations in CPT2 cause carnitine palmitoyltransferase II deficiency, sometimes associated with rhabdomyolysis and acute renal injury. Genetic modifier analysis among cases with APOL1 risk genotypes identified a suggestive signal in AHDC1, implicated in Xia-Gibbs syndrome, which involves intellectual disability and other features. On the basis of the observed distribution of rare variants, we estimate that a two- to three-fold larger cohort would provide 80% power to implicate new genes for all-cause CKD. CONCLUSIONS: This study demonstrates that rare-variant collapsing analyses can validate known genes and identify candidate genes and modifiers for kidney disease. In so doing, these findings provide a motivation for larger-scale investigation of rare-variant risk contributions across major clinical CKD categories.

Subject(s)

Collagen Type IV/genetics , Exome Sequencing , Genetic Variation/genetics , Protein Kinases/genetics , Renal Insufficiency, Chronic/genetics , TRPP Cation Channels/genetics , Case-Control Studies , Female , Humans , Male , Prognosis , Protein Kinase D2 , Reference Values , Renal Insufficiency, Chronic/diagnosis

17.

A programmed wave of uridylation-primed mRNA degradation is essential for meiotic progression and mammalian spermatogenesis.

Morgan, Marcos; Kabayama, Yuka; Much, Christian; Ivanova, Ivayla; Di Giacomo, Monica; Auchynnikava, Tatsiana; Monahan, Jack Michael; Vitsios, Dimitrios Michael; Vasiliauskaite, Lina; Comazzetto, Stefano; Rappsilber, Juri; Allshire, Robin Campbell; Porse, Bo Torben; Enright, Anton James; O'Carroll, Dónal.

Cell Res ; 29(3): 221-232, 2019 03.

Article in English | MEDLINE | ID: mdl-30617251

ABSTRACT

Several developmental stages of spermatogenesis are transcriptionally quiescent which presents major challenges associated with the regulation of gene expression. Here we identify that the zygotene to pachytene transition is not only associated with the resumption of transcription but also a wave of programmed mRNA degradation that is essential for meiotic progression. We explored whether terminal uridydyl transferase 4- (TUT4-) or TUT7-mediated 3' mRNA uridylation contributes to this wave of mRNA degradation during pachynema. Indeed, both TUT4 and TUT7 are expressed throughout most of spermatogenesis, however, loss of either TUT4 or TUT7 does not have any major impact upon spermatogenesis. Combined TUT4 and TUT7 (TUT4/7) deficiency results in embryonic growth defects, while conditional gene targeting revealed an essential role for TUT4/7 in pachytene progression. Loss of TUT4/7 results in the reduction of miRNA, piRNA and mRNA 3' uridylation. Although this reduction does not greatly alter miRNA or piRNA expression, TUT4/7-mediated uridylation is required for the clearance of many zygotene-expressed transcripts in pachytene cells. We find that TUT4/7-regulated transcripts in pachytene spermatocytes are characterized by having long 3' UTRs with length-adjusted enrichment for AU-rich elements. We also observed these features in TUT4/7-regulated maternal transcripts whose dosage was recently shown to be essential for sculpting a functional maternal transcriptome and meiosis. Therefore, mRNA 3' uridylation is a critical determinant of both male and female germline transcriptomes. In conclusion, we have identified a novel requirement for 3' uridylation-programmed zygotene mRNA clearance in pachytene spermatocytes that is essential for male meiotic progression.

Subject(s)

Meiotic Prophase I/genetics , Pachytene Stage/genetics , RNA Processing, Post-Transcriptional/physiology , Spermatogenesis/genetics , Animals , Female , Male , Mice , Mice, Inbred C57BL , RNA Stability/genetics , RNA, Messenger/genetics , UDPglucose-Hexose-1-Phosphate Uridylyltransferase/metabolism

18.

RNA-sequencing analysis of umbilical cord plasma microRNAs from healthy newborns.

Brennan, Gary P; Vitsios, Dimitrios M; Casey, Sophie; Looney, Ann-Marie; Hallberg, Boubou; Henshall, David C; Boylan, Geraldine B; Murray, Deirdre M; Mooney, Catherine.

PLoS One ; 13(12): e0207952, 2018.

Article in English | MEDLINE | ID: mdl-30507953

ABSTRACT

MicroRNAs are a class of small non-coding RNA that regulate gene expression at a post-transcriptional level. MicroRNAs have been identified in various body fluids under normal conditions and their stability as well as their dysregulation in disease has led to ongoing interest in their diagnostic and prognostic potential. Circulating microRNAs may be valuable predictors of early-life complications such as birth asphyxia or neonatal seizures but there are relatively few data on microRNA content in plasma from healthy babies. Here we performed small RNA-sequencing analysis of plasma processed from umbilical cord blood in a set of healthy newborns. MicroRNA levels in umbilical cord plasma of four male and four female healthy babies, from two different centres were profiled. A total of 1,004 individual microRNAs were identified, which ranged from 426 to 659 per sample, of which 269 microRNAs were common to all eight samples. Many of these microRNAs are highly expressed and consistent with previous studies using other high throughput platforms. While overall microRNA expression did not differ between male and female cord blood plasma, we did detect differentially edited microRNAs in female plasma compared to male. Of note, and consistent with other studies of this type, adenylation and uridylation were the two most prominent forms of editing. Six microRNAs, miR-128-3p, miR-29a-3p, miR-9-5p, miR-218-5p, 204-5p and miR-132-3p were consistently both uridylated and adenylated in female cord blood plasma. These results provide a benchmark for microRNA profiling and biomarker discovery using umbilical cord plasma and can be used as comparative data for future biomarker profiles from complicated births or those with early-life developmental disorders.

Subject(s)

Circulating MicroRNA/blood , Fetal Blood/chemistry , Infant, Newborn/blood , Adenosine Monophosphate/chemistry , Biomarkers/blood , Biomarkers/chemistry , Circulating MicroRNA/chemistry , Female , Gene Expression Profiling , Gene Expression Regulation, Developmental , Humans , Male , RNA Editing , Sex Factors , Uridine Monophosphate/chemistry

19.

In situ functional dissection of RNA cis-regulatory elements by multiplex CRISPR-Cas9 genome engineering.

Wu, Qianxin; Ferry, Quentin R V; Baeumler, Toni A; Michaels, Yale S; Vitsios, Dimitrios M; Habib, Omer; Arnold, Roland; Jiang, Xiaowei; Maio, Stefano; Steinkraus, Bruno R; Tapia, Marta; Piazza, Paolo; Xu, Ni; Holländer, Georg A; Milne, Thomas A; Kim, Jin-Soo; Enright, Anton J; Bassett, Andrew R; Fulga, Tudor A.

Nat Commun ; 8(1): 2109, 2017 12 13.

Article in English | MEDLINE | ID: mdl-29235467

ABSTRACT

RNA regulatory elements (RREs) are an important yet relatively under-explored facet of gene regulation. Deciphering the prevalence and functional impact of this post-transcriptional control layer requires technologies for disrupting RREs without perturbing cellular homeostasis. Here we describe genome-engineering based evaluation of RNA regulatory element activity (GenERA), a clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 platform for in situ high-content functional analysis of RREs. We use GenERA to survey the entire regulatory landscape of a 3'UTR, and apply it in a multiplex fashion to analyse combinatorial interactions between sets of miRNA response elements (MREs), providing strong evidence for cooperative activity. We also employ this technology to probe the functionality of an entire MRE network under cellular homeostasis, and show that high-resolution analysis of the GenERA dataset can be used to extract functional features of MREs. This study provides a genome editing-based multiplex strategy for direct functional interrogation of RNA cis-regulatory elements in a native cellular environment.

Subject(s)

CRISPR-Cas Systems/genetics , Gene Editing/methods , RNA/genetics , Regulatory Sequences, Nucleic Acid/genetics , 3' Untranslated Regions/genetics , Animals , Clustered Regularly Interspaced Short Palindromic Repeats/genetics , Genome/genetics , Humans , MicroRNAs/genetics , Response Elements/genetics

20.

Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests.

Vitsios, Dimitrios M; Kentepozidou, Elissavet; Quintais, Leonor; Benito-Gutiérrez, Elia; van Dongen, Stijn; Davis, Matthew P; Enright, Anton J.

Nucleic Acids Res ; 45(21): e177, 2017 Dec 01.

Article in English | MEDLINE | ID: mdl-29036314

ABSTRACT

The discovery of microRNAs (miRNAs) remains an important problem, particularly given the growth of high-throughput sequencing, cell sorting and single cell biology. While a large number of miRNAs have already been annotated, there may well be large numbers of miRNAs that are expressed in very particular cell types and remain elusive. Sequencing allows us to quickly and accurately identify the expression of known miRNAs from small RNA-Seq data. The biogenesis of miRNAs leads to very specific characteristics observed in their sequences. In brief, miRNAs usually have a well-defined 5' end and a more flexible 3' end with the possibility of 3' tailing events, such as uridylation. Previous approaches to the prediction of novel miRNAs usually involve the analysis of structural features of miRNA precursor hairpin sequences obtained from genome sequence. We surmised that it may be possible to identify miRNAs by using these biogenesis features observed directly from sequenced reads, solely or in addition to structural analysis from genome data. To this end, we have developed mirnovo, a machine learning based algorithm, which is able to identify known and novel miRNAs in animals and plants directly from small RNA-Seq data, with or without a reference genome. This method performs comparably to existing tools, however is simpler to use with reduced run time. Its performance and accuracy has been tested on multiple datasets, including species with poorly assembled genomes, RNaseIII (Drosha and/or Dicer) deficient samples and single cells (at both embryonic and adult stage).

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Machine Learning , MicroRNAs/chemistry , Sequence Analysis, RNA/methods , Software , Algorithms , Animals , Gene Expression Profiling , Genomics , Humans , Mice , MicroRNAs/metabolism , RNA, Plant/chemistry , RNA, Small Untranslated/chemistry , Ribonuclease III/genetics , Single-Cell Analysis

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL