Search | VHL Regional Portal

Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning.

Althagafi, Azza; Zhapa-Camacho, Fernando; Hoehndorf, Robert.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38696757

ABSTRACT

MOTIVATION: Whole-exome and genome sequencing have become common tools in diagnosing patients with rare diseases. Despite their success, this approach leaves many patients undiagnosed. A common argument is that more disease variants still await discovery, or the novelty of disease phenotypes results from a combination of variants in multiple disease-related genes. Interpreting the phenotypic consequences of genomic variants relies on information about gene functions, gene expression, physiology, and other genomic features. Phenotype-based methods to identify variants involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been successfully applied to prioritizing variants, such methods are based on known gene-disease or gene-phenotype associations as training data and are applicable to genes that have phenotypes associated, thereby limiting their scope. In addition, phenotypes are not assigned uniformly by different clinicians, and phenotype-based methods need to account for this variability. RESULTS: We developed an Embedding-based Phenotype Variant Predictor (EmbedPVP), a computational method to prioritize variants involved in genetic diseases by combining genomic information and clinical phenotypes. EmbedPVP leverages a large amount of background knowledge from human and model organisms about molecular mechanisms through which abnormal phenotypes may arise. Specifically, EmbedPVP incorporates phenotypes linked to genes, functions of gene products, and the anatomical site of gene expression, and systematically relates them to their phenotypic effects through neuro-symbolic, knowledge-enhanced machine learning. We demonstrate EmbedPVP's efficacy on a large set of synthetic genomes and genomes matched with clinical information. AVAILABILITY AND IMPLEMENTATION: EmbedPVP and all evaluation experiments are freely available at https://github.com/bio-ontology-research-group/EmbedPVP.

Subject(s)

Genomics , Humans , Genomics/methods , Phenotype , Genetic Variation , Computational Biology/methods , Machine Learning

Critical assessment of variant prioritization methods for rare disease diagnosis within the rare genomes project.

Stenton, Sarah L; O'Leary, Melanie C; Lemire, Gabrielle; VanNoy, Grace E; DiTroia, Stephanie; Ganesh, Vijay S; Groopman, Emily; O'Heir, Emily; Mangilog, Brian; Osei-Owusu, Ikeoluwa; Pais, Lynn S; Serrano, Jillian; Singer-Berk, Moriel; Weisburd, Ben; Wilson, Michael W; Austin-Tse, Christina; Abdelhakim, Marwa; Althagafi, Azza; Babbi, Giulia; Bellazzi, Riccardo; Bovo, Samuele; Carta, Maria Giulia; Casadio, Rita; Coenen, Pieter-Jan; De Paoli, Federica; Floris, Matteo; Gajapathy, Manavalan; Hoehndorf, Robert; Jacobsen, Julius O B; Joseph, Thomas; Kamandula, Akash; Katsonis, Panagiotis; Kint, Cyrielle; Lichtarge, Olivier; Limongelli, Ivan; Lu, Yulan; Magni, Paolo; Mamidi, Tarun Karthik Kumar; Martelli, Pier Luigi; Mulargia, Marta; Nicora, Giovanna; Nykamp, Keith; Pejaver, Vikas; Peng, Yisu; Pham, Thi Hong Cam; Podda, Maurizio S; Rao, Aditya; Rizzo, Ettore; Saipradeep, Vangala G; Savojardo, Castrense.

Hum Genomics ; 18(1): 44, 2024 Apr 29.

Article in English | MEDLINE | ID: mdl-38685113

ABSTRACT

BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting. METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values. RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency. CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.

Subject(s)

Rare Diseases , Humans , Rare Diseases/genetics , Rare Diseases/diagnosis , Genome, Human/genetics , Genetic Variation/genetics , Computational Biology/methods , Phenotype

CAGI6 ID-Challenge: Assessment of phenotype and variant predictions in 415 children with Neurodevelopmental Disorders (NDDs).

Aspromonte, Maria Cristina; Conte, Alessio Del; Zhu, Shaowen; Tan, Wuwei; Shen, Yang; Zhang, Yexian; Li, Qi; Wang, Maggie Haitian; Babbi, Giulia; Bovo, Samuele; Martelli, Pier Luigi; Casadio, Rita; Althagafi, Azza; Toonsi, Sumyyah; Kulmanov, Maxat; Hoehndorf, Robert; Katsonis, Panagiotis; Williams, Amanda; Lichtarge, Olivier; Xian, Su; Surento, Wesley; Pejaver, Vikas; Mooney, Sean D; Sunderam, Uma; Srinivasan, Rajgopal; Murgia, Alessandra; Piovesan, Damiano; Tosatto, Silvio C E; Leonardi, Emanuela.

Res Sq ; 2023 Aug 02.

Article in English | MEDLINE | ID: mdl-37577579

ABSTRACT

In the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6), the Genetics of Neurodevelopmental Disorders Lab in Padua proposed a new ID-challenge to give the opportunity of developing computational methods for predicting patient's phenotype and the causal variants. Eight research teams and 30 models had access to the phenotype details and real genetic data, based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. In this study we evaluate the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and causal variants. Finally, we asked to develop a method to find new possible genetic causes for patients without a genetic diagnosis. As already done for the CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (causative, putative pathogenic and contributing factors) were provided. Considering the overall clinical manifestation of our cohort, we give out the variant data and phenotypic traits of the 150 patients from CAGI5 ID-Challenge as training and validation for the prediction methods development.

Critical assessment of variant prioritization methods for rare disease diagnosis within the Rare Genomes Project.

Stenton, Sarah L; O'Leary, Melanie; Lemire, Gabrielle; VanNoy, Grace E; DiTroia, Stephanie; Ganesh, Vijay S; Groopman, Emily; O'Heir, Emily; Mangilog, Brian; Osei-Owusu, Ikeoluwa; Pais, Lynn S; Serrano, Jillian; Singer-Berk, Moriel; Weisburd, Ben; Wilson, Michael; Austin-Tse, Christina; Abdelhakim, Marwa; Althagafi, Azza; Babbi, Giulia; Bellazzi, Riccardo; Bovo, Samuele; Carta, Maria Giulia; Casadio, Rita; Coenen, Pieter-Jan; De Paoli, Federica; Floris, Matteo; Gajapathy, Manavalan; Hoehndorf, Robert; Jacobsen, Julius O B; Joseph, Thomas; Kamandula, Akash; Katsonis, Panagiotis; Kint, Cyrielle; Lichtarge, Olivier; Limongelli, Ivan; Lu, Yulan; Magni, Paolo; Mamidi, Tarun Karthik Kumar; Martelli, Pier Luigi; Mulargia, Marta; Nicora, Giovanna; Nykamp, Keith; Pejaver, Vikas; Peng, Yisu; Pham, Thi Hong Cam; Podda, Maurizio S; Rao, Aditya; Rizzo, Ettore; Saipradeep, Vangala G; Savojardo, Castrense.

medRxiv ; 2023 Aug 04.

Article in English | MEDLINE | ID: mdl-37577678

ABSTRACT

Background: A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting. Methods: Predictors were provided a dataset of phenotype terms and variant calls from GS of 175 RGP individuals (65 families), including 35 solved training set families, with causal variants specified, and 30 test set families (14 solved, 16 unsolved). The challenge tasked teams with identifying the causal variants in as many test set families as possible. Ranked variant predictions were submitted with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on rank position of true positive causal variants and maximum F-measure, based on precision and recall of causal variants across EPCR thresholds. Results: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performing teams recalled the causal variants in up to 13 of 14 solved families by prioritizing high quality variant calls that were rare, predicted deleterious, segregating correctly, and consistent with reported phenotype. In unsolved families, newly discovered diagnostic variants were returned to two families following confirmatory RNA sequencing, and two prioritized novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant, in an unsolved proband with phenotype overlap with asparagine synthetase deficiency. Conclusions: By objective assessment of variant predictions, we provide insights into current state-of-the-art algorithms and platforms for genome sequencing analysis for rare disease diagnosis and explore areas for future optimization. Identification of diagnostic variants in unsolved families promotes synergy between researchers with clinical and computational expertise as a means of advancing the field of clinical genome interpretation.

Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes.

Kafkas, Èenay; Abdelhakim, Marwa; Uludag, Mahmut; Althagafi, Azza; Alghamdi, Malak; Hoehndorf, Robert.

BMC Bioinformatics ; 24(1): 294, 2023 Jul 21.

Article in English | MEDLINE | ID: mdl-37479972

ABSTRACT

BACKGROUND: Identifying variants associated with diseases is a challenging task in medical genetics research. Current studies that prioritize variants within individual genomes generally rely on known variants, evidence from literature and genomes, and patient symptoms and clinical signs. The functionalities of the existing tools, which rank variants based on given patient symptoms and clinical signs, are restricted to the coverage of ontologies such as the Human Phenotype Ontology (HPO). However, most clinicians do not limit themselves to HPO while describing patient symptoms/signs and their associated variants/genes. There is thus a need for an automated tool that can prioritize variants based on freely expressed patient symptoms and clinical signs. RESULTS: STARVar is a Symptom-based Tool for Automatic Ranking of Variants using evidence from literature and genomes. STARVar uses patient symptoms and clinical signs, either linked to HPO or expressed in free text format. It returns a ranked list of variants based on a combined score from two classifiers utilizing evidence from genomics and literature. STARVar improves over related tools on a set of synthetic patients. In addition, we demonstrated its distinct contribution to the domain on another synthetic dataset covering publicly available clinical genotype-phenotype associations by using symptoms and clinical signs expressed in free text format. CONCLUSIONS: STARVar stands as a unique and efficient tool that has the advantage of ranking variants with flexibly expressed patient symptoms in free-form text. Therefore, STARVar can be easily integrated into bioinformatics workflows designed to analyze disease-associated genomes. AVAILABILITY: STARVar is freely available from https://github.com/bio-ontology-research-group/STARVar .

Subject(s)

Genomics , Software , Humans , Phenotype , Computational Biology , Genetic Association Studies

DeepSVP: integration of genotype and phenotype for structural variant prioritization using deep learning.

Althagafi, Azza; Alsubaie, Lamia; Kathiresan, Nagarajan; Mineta, Katsuhiko; Aloraini, Taghrid; Al Mutairi, Fuad; Alfadhel, Majid; Gojobori, Takashi; Alfares, Ahmad; Hoehndorf, Robert.

Bioinformatics ; 38(6): 1677-1684, 2022 03 04.

Article in English | MEDLINE | ID: mdl-34951628

ABSTRACT

MOTIVATION: Structural genomic variants account for much of human variability and are involved in several diseases. Structural variants are complex and may affect coding regions of multiple genes, or affect the functions of genomic regions in different ways from single nucleotide variants. Interpreting the phenotypic consequences of structural variants relies on information about gene functions, haploinsufficiency or triplosensitivity and other genomic features. Phenotype-based methods to identifying variants that are involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been applied successfully to single nucleotide variants as well as short insertions and deletions, the complexity of structural variants makes it more challenging to link them to phenotypes. Furthermore, structural variants can affect a large number of coding regions, and phenotype information may not be available for all of them. RESULTS: We developed DeepSVP, a computational method to prioritize structural variants involved in genetic diseases by combining genomic and gene functions information. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual cell types and anatomical sites of expression, and systematically relate them to their phenotypic consequences through ontologies and machine learning. DeepSVP significantly improves the success rate of finding causative variants in several benchmarks and can identify novel pathogenic structural variants in consanguineous families. AVAILABILITY AND IMPLEMENTATION: https://github.com/bio-ontology-research-group/DeepSVP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Deep Learning , Humans , Genotype , Phenotype , Genomics , Nucleotides

Predicting candidate genes from phenotypes, functions and anatomical site of expression.

Chen, Jun; Althagafi, Azza; Hoehndorf, Robert.

Bioinformatics ; 37(6): 853-860, 2021 05 05.

Article in English | MEDLINE | ID: mdl-33051643

ABSTRACT

MOTIVATION: Over the past years, many computational methods have been developed to incorporate information about phenotypes for disease-gene prioritization task. These methods generally compute the similarity between a patient's phenotypes and a database of gene-phenotype to find the most phenotypically similar match. The main limitation in these methods is their reliance on knowledge about phenotypes associated with particular genes, which is not complete in humans as well as in many model organisms, such as the mouse and fish. Information about functions of gene products and anatomical site of gene expression is available for more genes and can also be related to phenotypes through ontologies and machine-learning models. RESULTS: We developed a novel graph-based machine-learning method for biomedical ontologies, which is able to exploit axioms in ontologies and other graph-structured data. Using our machine-learning method, we embed genes based on their associated phenotypes, functions of the gene products and anatomical location of gene expression. We then develop a machine-learning model to predict gene-disease associations based on the associations between genes and multiple biomedical ontologies, and this model significantly improves over state-of-the-art methods. Furthermore, we extend phenotype-based gene prioritization methods significantly to all genes, which are associated with phenotypes, functions or site of expression. AVAILABILITY AND IMPLEMENTATION: Software and data are available at https://github.com/bio-ontology-research-group/DL2Vec. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Biological Ontologies , Animals , Data Management , Machine Learning , Mice , Phenotype , Software

EMC10 homozygous variant identified in a family with global developmental delay, mild intellectual disability, and speech delay.

Umair, Muhammad; Ballow, Mariam; Asiri, Abdulaziz; Alyafee, Yusra; Al Tuwaijri, Abeer; Alhamoudi, Kheloud M; Aloraini, Taghrid; Abdelhakim, Marwa; Althagafi, Azza Thamer; Kafkas, Senay; Alsubaie, Lamia; Alrifai, Muhammad Talal; Hoehndorf, Robert; Alfares, Ahmed; Alfadhel, Majid.

Clin Genet ; 98(6): 555-561, 2020 12.

Article in English | MEDLINE | ID: mdl-32869858

ABSTRACT

In recent years, several genes have been implicated in the variable disease presentation of global developmental delay (GDD) and intellectual disability (ID). The endoplasmic reticulum membrane protein complex (EMC) family is known to be involved in GDD and ID. Homozygous variants of EMC1 are associated with GDD, scoliosis, and cerebellar atrophy, indicating the relevance of this pathway for neurogenetic disorders. EMC10 is a bone marrow-derived angiogenic growth factor that plays an important role in infarct vascularization and promoting tissue repair. However, this gene has not been previously associated with human disease. Herein, we describe a Saudi family with two individuals segregating a recessive neurodevelopmental disorder. Both of the affected individuals showed mild ID, speech delay, and GDD. Whole-exome sequencing (WES) and Sanger sequencing were performed to identify candidate genes. Further, to elucidate the functional effects of the variant, quantitative real-time PCR (RT-qPCR)-based expression analysis was performed. WES revealed a homozygous splice acceptor site variant (c.679-1G>A) in EMC10 (chromosome 19q13.33) that segregated perfectly within the family. RT-qPCR showed a substantial decrease in the relative EMC10 gene expression in the patients, indicating the pathogenicity of the identified variant. For the first time in the literature, the EMC10 gene variant was associated with mild ID, speech delay, and GDD. Thus, this gene plays a key role in developmental milestones, with the potential to cause neurodevelopmental disorders in humans.

Subject(s)

Developmental Disabilities/genetics , Intellectual Disability/genetics , Language Development Disorders/genetics , Membrane Proteins/genetics , Adolescent , Child , Consanguinity , Developmental Disabilities/physiopathology , Genetic Predisposition to Disease , Homozygote , Humans , Intellectual Disability/physiopathology , Language Development Disorders/physiopathology , Male , Mutation/genetics , Pedigree , RNA Splice Sites/genetics , Saudi Arabia/epidemiology , Exome Sequencing

What is the right sequencing approach? Solo VS extended family analysis in consanguineous populations.

Alfares, Ahmed; Alsubaie, Lamia; Aloraini, Taghrid; Alaskar, Aljoharah; Althagafi, Azza; Alahmad, Ahmed; Rashid, Mamoon; Alswaid, Abdulrahman; Alothaim, Ali; Eyaid, Wafaa; Ababneh, Faroug; Albalwi, Mohammed; Alotaibi, Raniah; Almutairi, Mashael; Altharawi, Nouf; Alsamer, Alhanouf; Abdelhakim, Marwa; Kafkas, Senay; Mineta, Katsuhiko; Cheung, Nicole; Abdallah, Abdallah M; Büchmann-Møller, Stine; Fukasawa, Yoshinori; Zhao, Xiang; Rajan, Issaac; Hoehndorf, Robert; Al Mutairi, Fuad; Gojobori, Takashi; Alfadhel, Majid.

BMC Med Genomics ; 13(1): 103, 2020 07 17.

Article in English | MEDLINE | ID: mdl-32680510

ABSTRACT

BACKGROUND: Testing strategies is crucial for genetics clinics and testing laboratories. In this study, we tried to compare the hit rate between solo and trio and trio plus testing and between trio and sibship testing. Finally, we studied the impact of extended family analysis, mainly in complex and unsolved cases. METHODS: Three cohorts were used for this analysis: one cohort to assess the hit rate between solo, trio and trio plus testing, another cohort to examine the impact of the testing strategy of sibship genome vs trio-based analysis, and a third cohort to test the impact of an extended family analysis of up to eight family members to lower the number of candidate variants. RESULTS: The hit rates in solo, trio and trio plus testing were 39, 40, and 41%, respectively. The total number of candidate variants in the sibship testing strategy was 117 variants compared to 59 variants in the trio-based analysis. We noticed that the average number of coding candidate variants in trio-based analysis was 1192 variants and 26,454 noncoding variants, and this number was lowered by 50-75% after adding additional family members, with up to two coding and 66 noncoding homozygous variants only, in families with eight family members. CONCLUSION: There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50-75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes.

Subject(s)

Consanguinity , Exome , Family , Genetic Markers , Genetic Predisposition to Disease , Genetic Testing , Genetic Variation , Adult , Child , Female , Humans , Male , Retrospective Studies , Exome Sequencing

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL