Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
Sci Rep ; 13(1): 11662, 2023 07 19.
Article in English | MEDLINE | ID: mdl-37468507

ABSTRACT

In this paper we characterize the performance of linear models trained via widely-used sparse machine learning algorithms. We build polygenic scores and examine performance as a function of training set size, genetic ancestral background, and training method. We show that predictor performance is most strongly dependent on size of training data, with smaller gains from algorithmic improvements. We find that LASSO generally performs as well as the best methods, judged by a variety of metrics. We also investigate performance characteristics of predictors trained on one genetic ancestry group when applied to another. Using LASSO, we develop a novel method for projecting AUC and correlation as a function of data size (i.e., for new biobanks) and characterize the asymptotic limit of performance. Additionally, for LASSO (compressed sensing) we show that performance metrics and predictor sparsity are in agreement with theoretical predictions from the Donoho-Tanner phase transition. Specifically, a future predictor trained in the Taiwan Precision Medicine Initiative for asthma can achieve an AUC of [Formula: see text] and for height a correlation of [Formula: see text] for a Taiwanese population. This is above the measured values of [Formula: see text] and [Formula: see text], respectively, for UK Biobank trained predictors applied to a European population.


Subject(s)
Asthma , Biological Specimen Banks , Humans , Machine Learning , Forecasting , Algorithms
2.
Sci Rep ; 12(1): 18173, 2022 10 28.
Article in English | MEDLINE | ID: mdl-36307513

ABSTRACT

We construct a polygenic health index as a weighted sum of polygenic risk scores for 20 major disease conditions, including, e.g., coronary artery disease, type 1 and 2 diabetes, schizophrenia, etc. Individual weights are determined by population-level estimates of impact on life expectancy. We validate this index in odds ratios and selection experiments using unrelated individuals and siblings (pairs and trios) from the UK Biobank. Individuals with higher index scores have decreased disease risk across almost all 20 diseases (no significant risk increases), and longer calculated life expectancy. When estimated Disability Adjusted Life Years (DALYs) are used as the performance metric, the gain from selection among ten individuals (highest index score vs average) is found to be roughly 4 DALYs. We find no statistical evidence for antagonistic trade-offs in risk reduction across these diseases. Correlations between genetic disease risks are found to be mostly positive and generally mild. These results have important implications for public health and also for fundamental issues such as pleiotropy and genetic architecture of human disease conditions.


Subject(s)
Diabetes Mellitus, Type 1 , Diabetes Mellitus, Type 2 , Humans , Siblings , Multifactorial Inheritance , Life Expectancy , Risk Reduction Behavior , Risk Factors
3.
Methods Mol Biol ; 2467: 421-446, 2022.
Article in English | MEDLINE | ID: mdl-35451785

ABSTRACT

Decoding the genome confers the capability to predict characteristics of the organism (phenotype) from DNA (genotype). We describe the present status and future prospects of genomic prediction of complex traits in humans. Some highly heritable complex phenotypes such as height and other quantitative traits can already be predicted with reasonable accuracy from DNA alone. For many diseases, including important common conditions such as coronary artery disease, breast cancer, type I and II diabetes, individuals with outlier polygenic scores (e.g., top few percent) have been shown to have 5 or even 10 times higher risk than average. Several psychiatric conditions such as schizophrenia and autism also fall into this category. We discuss related topics such as the genetic architecture of complex traits, sibling validation of polygenic scores, and applications to adult health, in vitro fertilization (embryo selection), and genetic engineering.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance , Genomics , Genotype , Humans , Models, Genetic , Phenotype , Polymorphism, Single Nucleotide
4.
Phys Rev Lett ; 128(11): 111301, 2022 Mar 18.
Article in English | MEDLINE | ID: mdl-35362995

ABSTRACT

We explore the relationship between the quantum state of a compact matter source and of its asymptotic graviton field. For a matter source in an energy eigenstate, the graviton state is determined at leading order by the energy eigenvalue. Insofar as there are no accidental energy degeneracies there is a one to one map between graviton states on the boundary of spacetime and the matter source states. Effective field theory allows us to compute a purely quantum gravitational effect which causes the subleading asymptotic behavior of the graviton state to depend on the internal structure of the source. This establishes the existence of ubiquitous quantum hair due to gravitational effects.

6.
Genes (Basel) ; 12(7)2021 06 29.
Article in English | MEDLINE | ID: mdl-34209487

ABSTRACT

We use UK Biobank data to train predictors for 65 blood and urine markers such as HDL, LDL, lipoprotein A, glycated haemoglobin, etc. from SNP genotype. For example, our Polygenic Score (PGS) predictor correlates ∼0.76 with lipoprotein A level, which is highly heritable and an independent risk factor for heart disease. This may be the most accurate genomic prediction of a quantitative trait that has yet been produced (specifically, for European ancestry groups). We also train predictors of common disease risk using blood and urine biomarkers alone (no DNA information); we call these predictors biomarker risk scores, BMRS. Individuals who are at high risk (e.g., odds ratio of >5× population average) can be identified for conditions such as coronary artery disease (AUC∼0.75), diabetes (AUC∼0.95), hypertension, liver and kidney problems, and cancer using biomarkers alone. Our atherosclerotic cardiovascular disease (ASCVD) predictor uses ∼10 biomarkers and performs in UKB evaluation as well as or better than the American College of Cardiology ASCVD Risk Estimator, which uses quite different inputs (age, diagnostic history, BMI, smoking status, statin usage, etc.). We compare polygenic risk scores (risk conditional on genotype: PRS) for common diseases to the risk predictors which result from the concatenation of learned functions BMRS and PGS, i.e., applying the BMRS predictors to the PGS output.


Subject(s)
Atherosclerosis/epidemiology , Biomarkers/blood , Biomarkers/urine , Cardiovascular Diseases/epidemiology , Lipoprotein(a)/blood , Adult , Atherosclerosis/blood , Atherosclerosis/urine , Biological Specimen Banks , Calcium/blood , Calcium/urine , Cardiovascular Diseases/blood , Female , Heart Disease Risk Factors , Hemoglobins/genetics , Humans , Lipoproteins, HDL/blood , Lipoproteins, LDL/blood , Machine Learning , Male , Middle Aged , Multifactorial Inheritance/genetics , Risk Assessment , United Kingdom/epidemiology , United States/epidemiology
7.
Sci Rep ; 10(1): 13190, 2020 08 06.
Article in English | MEDLINE | ID: mdl-32764582

ABSTRACT

We test 26 polygenic predictors using tens of thousands of genetic siblings from the UK Biobank (UKB), for whom we have SNP genotypes, health status, and phenotype information in late adulthood. Siblings have typically experienced similar environments during childhood, and exhibit negligible population stratification relative to each other. Therefore, the ability to predict differences in disease risk or complex trait values between siblings is a strong test of genomic prediction in humans. We compare validation results obtained using non-sibling subjects to those obtained among siblings and find that typically most of the predictive power persists in between-sibling designs. In the case of disease risk we test the extent to which higher polygenic risk score (PRS) identifies the affected sibling, and also compute Relative Risk Reduction as a function of risk score threshold. For quantitative traits we examine between-sibling differences in trait values as a function of predicted differences, and compare to performance in non-sibling pairs. Example results: Given 1 sibling with normal-range PRS score (< 84 percentile, < + 1 SD) and 1 sibling with high PRS score (top few percentiles, i.e. > + 2 SD), the predictors identify the affected sibling about 70-90% of the time across a variety of disease conditions, including Breast Cancer, Heart Attack, Diabetes, etc. 55-65% of the time the higher PRS sibling is the case. For quantitative traits such as height, the predictor correctly identifies the taller sibling roughly 80 percent of the time when the (male) height difference is 2 inches or more.


Subject(s)
Computational Biology , Disease/genetics , Genetic Predisposition to Disease/genetics , Phenotype , Siblings , Biological Specimen Banks , Female , Humans , Male , Polymorphism, Single Nucleotide
8.
Sci Rep ; 10(1): 12055, 2020 07 21.
Article in English | MEDLINE | ID: mdl-32694572

ABSTRACT

Genomic prediction of complex human traits (e.g., height, cognitive ability, bone density) and disease risks (e.g., breast cancer, diabetes, heart disease, atrial fibrillation) has advanced considerably in recent years. Using data from the UK Biobank, predictors have been constructed using penalized algorithms that favor sparsity: i.e., which use as few genetic variants as possible. We analyze the specific genetic variants (SNPs) utilized in these predictors, which can vary from dozens to as many as thirty thousand. We find that the fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits-i.e., existing PRS cannot be computed from exome data alone. We also study the fraction of SNPs and of variance that is in common between pairs of predictors. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.


Subject(s)
Genetic Association Studies , Genetic Predisposition to Disease , Models, Genetic , Multifactorial Inheritance , Quantitative Trait, Heritable , Algorithms , Cluster Analysis , Humans , Polymorphism, Single Nucleotide , Exome Sequencing
9.
Sci Rep ; 9(1): 17515, 2019 Nov 20.
Article in English | MEDLINE | ID: mdl-31748697

ABSTRACT

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

10.
Sci Rep ; 9(1): 15286, 2019 10 25.
Article in English | MEDLINE | ID: mdl-31653892

ABSTRACT

We construct risk predictors using polygenic scores (PGS) computed from common Single Nucleotide Polymorphisms (SNPs) for a number of complex disease conditions, using L1-penalized regression (also known as LASSO) on case-control data from UK Biobank. Among the disease conditions studied are Hypothyroidism, (Resistant) Hypertension, Type 1 and 2 Diabetes, Breast Cancer, Prostate Cancer, Testicular Cancer, Gallstones, Glaucoma, Gout, Atrial Fibrillation, High Cholesterol, Asthma, Basal Cell Carcinoma, Malignant Melanoma, and Heart Attack. We obtain values for the area under the receiver operating characteristic curves (AUC) in the range ~0.58-0.71 using SNP data alone. Substantially higher predictor AUCs are obtained when incorporating additional variables such as age and sex. Some SNP predictors alone are sufficient to identify outliers (e.g., in the 99th percentile of polygenic score, or PGS) with 3-8 times higher risk than typical individuals. We validate predictors out-of-sample using the eMERGE dataset, and also with different ancestry subgroups within the UK Biobank population. Our results indicate that substantial improvements in predictive power are attainable using training sets with larger case populations. We anticipate rapid improvement in genomic prediction as more case-control data become available for analysis.


Subject(s)
Breast Neoplasms/genetics , Diabetes Mellitus, Type 1/genetics , Diabetes Mellitus, Type 2/genetics , Genomics/methods , Myocardial Infarction/genetics , Prostatic Neoplasms/genetics , Algorithms , Breast Neoplasms/diagnosis , Case-Control Studies , Diabetes Mellitus, Type 1/diagnosis , Diabetes Mellitus, Type 2/diagnosis , Female , Genetic Predisposition to Disease/genetics , Humans , Male , Models, Genetic , Multifactorial Inheritance , Myocardial Infarction/diagnosis , Polymorphism, Single Nucleotide , Prognosis , Prostatic Neoplasms/diagnosis , ROC Curve , Risk Assessment/methods , Risk Assessment/statistics & numerical data , Risk Factors
11.
Genetics ; 210(2): 477-497, 2018 10.
Article in English | MEDLINE | ID: mdl-30150289

ABSTRACT

We construct genomic predictors for heritable but extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). The constructed predictors explain, respectively, ∼40, 20, and 9% of total variance for the three traits, in data not used for training. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few centimeters of the prediction. The proportion of variance explained for height is comparable to the estimated common SNP heritability from genome-wide complex trait analysis (GCTA), and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for SNPs. Thus, our results close the gap between prediction R-squared and common SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common variants. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier genome-wide association studies (GWAS) for out-of-sample validation of our results.


Subject(s)
Body Height/genetics , Models, Genetic , Genome, Human , Humans , Multifactorial Inheritance , Polymorphism, Single Nucleotide , Quantitative Trait, Heritable
12.
Gigascience ; 4: 44, 2015.
Article in English | MEDLINE | ID: mdl-26380078

ABSTRACT

BACKGROUND: One of the fundamental problems of modern genomics is to extract the genetic architecture of a complex trait from a data set of individual genotypes and trait values. Establishing this important connection between genotype and phenotype is complicated by the large number of candidate genes, the potentially large number of causal loci, and the likely presence of some nonlinear interactions between different genes. Compressed Sensing methods obtain solutions to under-constrained systems of linear equations. These methods can be applied to the problem of determining the best model relating genotype to phenotype, and generally deliver better performance than simply regressing the phenotype against each genetic variant, one at a time. We introduce a Compressed Sensing method that can reconstruct nonlinear genetic models (i.e., including epistasis, or gene-gene interactions) from phenotype-genotype (GWAS) data. Our method uses L1-penalized regression applied to nonlinear functions of the sensing matrix. RESULTS: The computational and data resource requirements for our method are similar to those necessary for reconstruction of linear genetic models (or identification of gene-trait associations), assuming a condition of generalized sparsity, which limits the total number of gene-gene interactions. An example of a sparse nonlinear model is one in which a typical locus interacts with several or even many others, but only a small subset of all possible interactions exist. It seems plausible that most genetic architectures fall in this category. We give theoretical arguments suggesting that the method is nearly optimal in performance, and demonstrate its effectiveness on broad classes of nonlinear genetic models using simulated human genomes and the small amount of currently available real data. A phase transition (i.e., dramatic and qualitative change) in the behavior of the algorithm indicates when sufficient data is available for its successful application. CONCLUSION: Our results indicate that predictive models for many complex traits, including a variety of human disease susceptibilities (e.g., with additive heritability h (2)∼0.5), can be extracted from data sets comprised of n ⋆∼100s individuals, where s is the number of distinct causal variants influencing the trait. For example, given a trait controlled by ∼10 k loci, roughly a million individuals would be sufficient for application of the method.


Subject(s)
Models, Genetic , Algorithms , Genome
13.
Gigascience ; 3: 10, 2014.
Article in English | MEDLINE | ID: mdl-25002967

ABSTRACT

BACKGROUND: The aim of a genome-wide association study (GWAS) is to isolate DNA markers for variants affecting phenotypes of interest. This is constrained by the fact that the number of markers often far exceeds the number of samples. Compressed sensing (CS) is a body of theory regarding signal recovery when the number of predictor variables (i.e., genotyped markers) exceeds the sample size. Its applicability to GWAS has not been investigated. RESULTS: Using CS theory, we show that all markers with nonzero coefficients can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability equal to one (h (2) = 1), there is a sharp phase transition from poor performance to complete selection as the sample size is increased. For heritability below one, complete selection still occurs, but the transition is smoothed. We find for h (2) ∼ 0.5 that a sample size of approximately thirty times the number of markers with nonzero coefficients is sufficient for full selection. This boundary is only weakly dependent on the number of genotyped markers. CONCLUSION: Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region. Given a limited sample size, it is possible to discover a phase transition by increasing the penalization; in this case a subset of the support may be recovered. Applying this approach to the GWAS analysis of height, we show that 70-100% of the selected markers are strongly correlated with height-associated markers identified by the GIANT Consortium.

14.
Phys Rev Lett ; 101(17): 171802, 2008 Oct 24.
Article in English | MEDLINE | ID: mdl-18999739

ABSTRACT

In grand unified theories with large numbers of fields, renormalization effects significantly modify the scale at which quantum gravity becomes strong. This in turn can modify the boundary conditions for coupling constant unification, if higher dimensional operators induced by gravity are taken into consideration. We show that the generic size of, and the uncertainty in, these effects from gravity can be larger than the two-loop corrections typically considered in renormalization group analyses of unification. In some cases, gravitational effects of modest size can render unification impossible.

15.
Phys Rev Lett ; 93(21): 211101, 2004 Nov 19.
Article in English | MEDLINE | ID: mdl-15600988

ABSTRACT

We derive fundamental limits on measurements of position, arising from quantum mechanics and classical general relativity. First, we show that any primitive probe or target used in an experiment must be larger than the Planck length lP. This suggests a Planck-size minimum ball of uncertainty in any measurement. Next, we study interferometers (such as LIGO) whose precision is much finer than the size of any individual components and hence are not obviously limited by the minimum ball. Nevertheless, we deduce a fundamental limit on their accuracy of order lP. Our results imply a device independent limit on possible position measurements.

SELECTION OF CITATIONS
SEARCH DETAIL
...