Search | VHL Regional Portal

Transfer learning with false negative control improves polygenic risk prediction.

Jeng, Xinge Jessie; Hu, Yifei; Venkat, Vaishnavi; Lu, Tzu-Pin; Tzeng, Jung-Ying.

PLoS Genet ; 19(11): e1010597, 2023 Nov.

Article in English | MEDLINE | ID: mdl-38011285

ABSTRACT

Polygenic risk score (PRS) is a quantity that aggregates the effects of variants across the genome and estimates an individual's genetic predisposition for a given trait. PRS analysis typically contains two input data sets: base data for effect size estimation and target data for individual-level prediction. Given the availability of large-scale base data, it becomes more common that the ancestral background of base and target data do not perfectly match. In this paper, we treat the GWAS summary information obtained in the base data as knowledge learned from a pre-trained model, and adopt a transfer learning framework to effectively leverage the knowledge learned from the base data that may or may not have similar ancestral background as the target samples to build prediction models for target individuals. Our proposed transfer learning framework consists of two main steps: (1) conducting false negative control (FNC) marginal screening to extract useful knowledge from the base data; and (2) performing joint model training to integrate the knowledge extracted from base data with the target training data for accurate trans-data prediction. This new approach can significantly enhance the computational and statistical efficiency of joint-model training, alleviate over-fitting, and facilitate more accurate trans-data prediction when heterogeneity level between target and base data sets is small or high.

Subject(s)

Genome-Wide Association Study , Polymorphism, Single Nucleotide , Humans , Polymorphism, Single Nucleotide/genetics , Genetic Predisposition to Disease , Phenotype , Multifactorial Inheritance/genetics , Machine Learning , Risk Factors

Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level.

Jeng, Xinge Jessie; Daye, Zhongyin John; Lu, Wenbin; Tzeng, Jung-Ying.

PLoS Comput Biol ; 12(6): e1004993, 2016 06.

Article in English | MEDLINE | ID: mdl-27355347

ABSTRACT

Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information.

Subject(s)

Gene Frequency/genetics , Genetic Association Studies , Genetic Predisposition to Disease/genetics , Genomics , High-Throughput Nucleotide Sequencing , Cardiovascular Diseases/genetics , Computer Simulation , Databases, Factual , Drug Delivery Systems , Humans , Models, Genetic

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL