Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Brain ; 147(6): 1996-2008, 2024 Jun 03.
Article in English | MEDLINE | ID: mdl-38804604

ABSTRACT

The LRRK2 G2019S variant is the most common cause of monogenic Parkinson's disease (PD); however, questions remain regarding the penetrance, clinical phenotype and natural history of carriers. We performed a 3.5-year prospective longitudinal online study in a large number of 1286 genotyped LRRK2 G2019S carriers and 109 154 controls, with and without PD, recruited from the 23andMe Research Cohort. We collected self-reported motor and non-motor symptoms every 6 months, as well as demographics, family histories and environmental risk factors. Incident cases of PD (phenoconverters) were identified at follow-up. We determined lifetime risk of PD using accelerated failure time modelling and explored the impact of polygenic risk on penetrance. We also computed the genetic ancestry of all LRRK2 G2019S carriers in the 23andMe database and identified regions of the world where carrier frequencies are highest. We observed that despite a 1 year longer disease duration (P = 0.016), LRRK2 G2019S carriers with PD had similar burden of motor symptoms, yet significantly fewer non-motor symptoms including cognitive difficulties, REM sleep behaviour disorder (RBD) and hyposmia (all P-values ≤ 0.0002). The cumulative incidence of PD in G2019S carriers by age 80 was 49%. G2019S carriers had a 10-fold risk of developing PD versus non-carriers. This rose to a 27-fold risk in G2019S carriers with a PD polygenic risk score in the top 25% versus non-carriers in the bottom 25%. In addition to identifying ancient founding events in people of North African and Ashkenazi descent, our genetic ancestry analyses infer that the G2019S variant was later introduced to Spanish colonial territories in the Americas. Our results suggest LRRK2 G2019S PD appears to be a slowly progressive predominantly motor subtype of PD with a lower prevalence of hyposmia, RBD and cognitive impairment. This suggests that the current prodromal criteria, which are based on idiopathic PD, may lack sensitivity to detect the early phases of LRRK2 PD in G2019S carriers. We show that polygenic burden may contribute to the development of PD in the LRRK2 G2019S carrier population. Collectively, the results should help support screening programmes and candidate enrichment strategies for upcoming trials of LRRK2 inhibitors in early-stage disease.


Subject(s)
Leucine-Rich Repeat Serine-Threonine Protein Kinase-2 , Parkinson Disease , Humans , Leucine-Rich Repeat Serine-Threonine Protein Kinase-2/genetics , Parkinson Disease/genetics , Female , Male , Middle Aged , Aged , Longitudinal Studies , Genetic Predisposition to Disease/genetics , Adult , Prospective Studies , Heterozygote , Penetrance , Aged, 80 and over , REM Sleep Behavior Disorder/genetics , Mutation
2.
Nat Commun ; 14(1): 6172, 2023 10 04.
Article in English | MEDLINE | ID: mdl-37794016

ABSTRACT

Atopic dermatitis (AD) is a common inflammatory skin condition and prior genome-wide association studies (GWAS) have identified 71 associated loci. In the current study we conducted the largest AD GWAS to date (discovery N = 1,086,394, replication N = 3,604,027), combining previously reported cohorts with additional available data. We identified 81 loci (29 novel) in the European-only analysis (which all replicated in a separate European analysis) and 10 additional loci in the multi-ancestry analysis (3 novel). Eight variants from the multi-ancestry analysis replicated in at least one of the populations tested (European, Latino or African), while two may be specific to individuals of Japanese ancestry. AD loci showed enrichment for DNAse I hypersensitivity and eQTL associations in blood. At each locus we prioritised candidate genes by integrating multi-omic data. The implicated genes are predominantly in immune pathways of relevance to atopic inflammation and some offer drug repurposing opportunities.


Subject(s)
Dermatitis, Atopic , Genome-Wide Association Study , Humans , Dermatitis, Atopic/genetics , Genetic Predisposition to Disease/genetics , Hispanic or Latino/genetics , Black People , Polymorphism, Single Nucleotide
3.
Front Genet ; 13: 871260, 2022.
Article in English | MEDLINE | ID: mdl-35559025

ABSTRACT

A substantial proportion of the adult United States population with type 2 diabetes (T2D) are undiagnosed, calling into question the comprehensiveness of current screening practices, which primarily rely on age, family history, and body mass index (BMI). We hypothesized that a polygenic score (PGS) may serve as a complementary tool to identify high-risk individuals. The T2D polygenic score maintained predictive utility after adjusting for family history and combining genetics with family history led to even more improved disease risk prediction. We observed that the PGS was meaningfully related to age of onset with implications for screening practices: there was a linear and statistically significant relationship between the PGS and T2D onset (-1.3 years per standard deviation of the PGS). Evaluation of U.S. Preventive Task Force and a simplified version of American Diabetes Association screening guidelines showed that addition of a screening criterion for those above the 90th percentile of the PGS provided a small increase the sensitivity of the screening algorithm. Among T2D-negative individuals, the T2D PGS was associated with prediabetes, where each standard deviation increase of the PGS was associated with a 23% increase in the odds of prediabetes diagnosis. Additionally, each standard deviation increase in the PGS corresponded to a 43% increase in the odds of incident T2D at one-year follow-up. Using complications and forms of clinical intervention (i.e., lifestyle modification, metformin treatment, or insulin treatment) as proxies for advanced illness we also found statistically significant associations between the T2D PGS and insulin treatment and diabetic neuropathy. Importantly, we were able to replicate many findings in a Hispanic/Latino cohort from our database, highlighting the value of the T2D PGS as a clinical tool for individuals with ancestry other than European. In this group, the T2D PGS provided additional disease risk information beyond that offered by traditional screening methodologies. The T2D PGS also had predictive value for the age of onset and for prediabetes among T2D-negative Hispanic/Latino participants. These findings strengthen the notion that a T2D PGS could play a role in the clinical setting across multiple ancestries, potentially improving T2D screening practices, risk stratification, and disease management.

4.
Commun Biol ; 4(1): 1269, 2021 11 05.
Article in English | MEDLINE | ID: mdl-34741098

ABSTRACT

There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.


Subject(s)
Genome, Human , Genotype , Adult , Black or African American , Aged , Aged, 80 and over , Humans , Middle Aged , United States , Whole Genome Sequencing , Young Adult
5.
Nat Genet ; 53(11): 1543-1552, 2021 11.
Article in English | MEDLINE | ID: mdl-34741163

ABSTRACT

Irritable bowel syndrome (IBS) results from disordered brain-gut interactions. Identifying susceptibility genes could highlight the underlying pathophysiological mechanisms. We designed a digestive health questionnaire for UK Biobank and combined identified cases with IBS with independent cohorts. We conducted a genome-wide association study with 53,400 cases and 433,201 controls and replicated significant associations in a 23andMe panel (205,252 cases and 1,384,055 controls). Our study identified and confirmed six genetic susceptibility loci for IBS. Implicated genes included NCAM1, CADM2, PHF2/FAM120A, DOCK9, CKAP2/TPTE2P3 and BAG6. The first four are associated with mood and anxiety disorders, expressed in the nervous system, or both. Mirroring this, we also found strong genome-wide correlation between the risk of IBS and anxiety, neuroticism and depression (rg > 0.5). Additional analyses suggested this arises due to shared pathogenic pathways rather than, for example, anxiety causing abdominal symptoms. Implicated mechanisms require further exploration to help understand the altered brain-gut interactions underlying IBS.


Subject(s)
Anxiety Disorders/genetics , Irritable Bowel Syndrome/genetics , Mood Disorders/genetics , Aged , CD56 Antigen/genetics , Cell Adhesion Molecules/genetics , Cytoskeletal Proteins/genetics , Female , Genetic Predisposition to Disease , Genome-Wide Association Study , Guanine Nucleotide Exchange Factors/genetics , Homeodomain Proteins/genetics , Humans , Irritable Bowel Syndrome/epidemiology , Male , Middle Aged , Molecular Chaperones/genetics , Polymorphism, Single Nucleotide , United Kingdom/epidemiology
6.
Nat Neurosci ; 24(7): 954-963, 2021 07.
Article in English | MEDLINE | ID: mdl-34045744

ABSTRACT

Major depressive disorder is the most common neuropsychiatric disorder, affecting 11% of veterans. Here we report results of a large meta-analysis of depression using data from the Million Veteran Program, 23andMe, UK Biobank and FinnGen, including individuals of European ancestry (n = 1,154,267; 340,591 cases) and African ancestry (n = 59,600; 25,843 cases). Transcriptome-wide association study analyses revealed significant associations with expression of NEGR1 in the hypothalamus and DRD2 in the nucleus accumbens, among others. We fine-mapped 178 genomic risk loci, and we identified likely pathogenicity in these variants and overlapping gene expression for 17 genes from our transcriptome-wide association study, including TRAF3. Finally, we were able to show substantial replications of our findings in a large independent cohort (n = 1,342,778) provided by 23andMe. This study sheds light on the genetic architecture of depression and provides new insight into the interrelatedness of complex psychiatric traits.


Subject(s)
Depressive Disorder, Major/genetics , Genetic Predisposition to Disease/genetics , Female , Genome-Wide Association Study , Humans , Male , Veterans
7.
Nat Hum Behav ; 5(10): 1432-1442, 2021 10.
Article in English | MEDLINE | ID: mdl-33859377

ABSTRACT

Depression and anxiety are highly prevalent and comorbid psychiatric traits that cause considerable burden worldwide. Here we use factor analysis and genomic structural equation modelling to investigate the genetic factor structure underlying 28 items assessing depression, anxiety and neuroticism, a closely related personality trait. Symptoms of depression and anxiety loaded on two distinct, although highly genetically correlated factors, and neuroticism items were partitioned between them. We used this factor structure to conduct genome-wide association analyses on latent factors of depressive symptoms (89 independent variants, 61 genomic loci) and anxiety symptoms (102 variants, 73 loci) in the UK Biobank. Of these associated variants, 72% and 78%, respectively, replicated in an independent cohort of approximately 1.9 million individuals with self-reported diagnosis of depression and anxiety. We use these results to characterize shared and trait-specific genetic associations. Our findings provide insight into the genetic architecture of depression and anxiety and comorbidity between them.


Subject(s)
Anxiety , Behavioral Symptoms , Depression , Neuroticism/physiology , Anxiety/diagnosis , Anxiety/epidemiology , Anxiety/genetics , Behavioral Symptoms/diagnosis , Behavioral Symptoms/psychology , Comorbidity , Depression/diagnosis , Depression/epidemiology , Depression/genetics , Factor Analysis, Statistical , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Latent Class Analysis , Symptom Assessment/methods , Symptom Assessment/statistics & numerical data
8.
Mol Biol Evol ; 38(5): 2131-2151, 2021 05 04.
Article in English | MEDLINE | ID: mdl-33355662

ABSTRACT

Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors, we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally, we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale data sets with millions of samples. Furthermore, we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis, exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for noncommercial use in the code repository (https://github.com/23andMe/phasedibd, last accessed January 11, 2021).


Subject(s)
Genome, Human , Haplotypes , Software , Algorithms , False Negative Reactions , False Positive Reactions , Humans , Mexico , Phylogeography
9.
Methods Mol Biol ; 2090: 67-86, 2020.
Article in English | MEDLINE | ID: mdl-31975164

ABSTRACT

Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. Understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. Here we provide a protocol for running principal component analysis (PCA) and admixture proportion inference-two of the most commonly used approaches in describing population structure. Along with hands-on examples with CEPH-Human Genome Diversity Panel and pragmatic caveats, readers will learn to analyze and visualize population structure on their own data.


Subject(s)
Genetics, Population/methods , Polymorphism, Single Nucleotide , Computational Biology , Genome, Human , Humans , Models, Genetic , Principal Component Analysis
10.
PLoS Genet ; 15(9): e1008293, 2019 09.
Article in English | MEDLINE | ID: mdl-31539367

ABSTRACT

Sex-biased demographic events ("sex-bias") involve unequal numbers of females and males. These events are typically inferred from the relative amount of X-chromosomal to autosomal genetic variation and have led to conflicting conclusions about human demographic history. Though population size changes alter the relative amount of X-chromosomal to autosomal genetic diversity even in the absence of sex-bias, this has generally not been accounted for in sex-bias estimators to date. Here, we present a novel method to identify sex-bias from genetic sequence data that models population size changes and estimates the female fraction of the effective population size during each time epoch. Compared to recent sex-bias inference methods, our approach can detect sex-bias that changes on a single population branch without requiring data from an outgroup or knowledge of divergence events. When applied to simulated data, conventional sex-bias estimators are biased by population size changes, especially recent growth or bottlenecks, while our estimator is unbiased. We next apply our method to high-coverage exome data from the 1000 Genomes Project and estimate a male bias in Yorubans (47% female) and Europeans (44%), possibly due to stronger background selection on the X chromosome than on the autosomes. Finally, we apply our method to the 1000 Genomes Project Phase 3 high-coverage Complete Genomics whole-genome data and estimate a female bias in Yorubans (63% female), Europeans (84%), Punjabis (82%), as well as Peruvians (56%), and a male bias in the Southern Han Chinese (45%). Our method additionally identifies a male-biased migration out of Africa based on data from Europeans (20% female). Our results demonstrate that modeling population size change is necessary to estimate sex-bias parameters accurately. Our approach gives insight into signatures of sex-bias in sexual species, and the demographic models it produces can serve as more accurate null models for tests of selection.


Subject(s)
Demography/methods , Genetics, Population/methods , Sequence Analysis, DNA/methods , Bias , Chromosomes, Human, X/genetics , Female , Genetic Variation/genetics , Genome/genetics , Humans , Male , Models, Genetic , Population Density , Selection, Genetic/genetics , Whole Genome Sequencing/methods
11.
G3 (Bethesda) ; 8(10): 3255-3267, 2018 10 03.
Article in English | MEDLINE | ID: mdl-30131328

ABSTRACT

The emergence of very large cohorts in genomic research has facilitated a focus on genotype-imputation strategies to power rare variant association. These strategies have benefited from improvements in imputation methods and association tests, however little attention has been paid to ways in which array design can increase rare variant association power. Therefore, we developed a novel framework to select tag SNPs using the reference panel of 26 populations from Phase 3 of the 1000 Genomes Project. We evaluate tag SNP performance via mean imputed r2 at untyped sites using leave-one-out internal validation and standard imputation methods, rather than pairwise linkage disequilibrium. Moving beyond pairwise metrics allows us to account for haplotype diversity across the genome for improve imputation accuracy and demonstrates population-specific biases from pairwise estimates. We also examine array design strategies that contrast multi-ethnic cohorts vs. single populations, and show a boost in performance for the former can be obtained by prioritizing tag SNPs that contribute information across multiple populations simultaneously. Using our framework, we demonstrate increased imputation accuracy for rare variants (frequency < 1%) by 0.5-3.1% for an array of one million sites and 0.7-7.1% for an array of 500,000 sites, depending on the population. Finally, we show how recent explosive growth in non-African populations means tag SNPs capture on average 30% fewer other variants than in African populations. The unified framework presented here will enable investigators to make informed decisions for the design of new arrays, and help empower the next phase of rare variant association for global health.


Subject(s)
Ethnicity/genetics , Genetic Association Studies , Genetics, Population , Polymorphism, Single Nucleotide , Selection, Genetic , Computational Biology/methods , Databases, Nucleic Acid , Genome-Wide Association Study , Humans , Linkage Disequilibrium , Models, Genetic , Reproducibility of Results
12.
Lancet Oncol ; 19(6): 785-798, 2018 06.
Article in English | MEDLINE | ID: mdl-29753700

ABSTRACT

BACKGROUND: Medulloblastoma is associated with rare hereditary cancer predisposition syndromes; however, consensus medulloblastoma predisposition genes have not been defined and screening guidelines for genetic counselling and testing for paediatric patients are not available. We aimed to assess and define these genes to provide evidence for future screening guidelines. METHODS: In this international, multicentre study, we analysed patients with medulloblastoma from retrospective cohorts (International Cancer Genome Consortium [ICGC] PedBrain, Medulloblastoma Advanced Genomics International Consortium [MAGIC], and the CEFALO series) and from prospective cohorts from four clinical studies (SJMB03, SJMB12, SJYC07, and I-HIT-MED). Whole-genome sequences and exome sequences from blood and tumour samples were analysed for rare damaging germline mutations in cancer predisposition genes. DNA methylation profiling was done to determine consensus molecular subgroups: WNT (MBWNT), SHH (MBSHH), group 3 (MBGroup3), and group 4 (MBGroup4). Medulloblastoma predisposition genes were predicted on the basis of rare variant burden tests against controls without a cancer diagnosis from the Exome Aggregation Consortium (ExAC). Previously defined somatic mutational signatures were used to further classify medulloblastoma genomes into two groups, a clock-like group (signatures 1 and 5) and a homologous recombination repair deficiency-like group (signatures 3 and 8), and chromothripsis was investigated using previously established criteria. Progression-free survival and overall survival were modelled for patients with a genetic predisposition to medulloblastoma. FINDINGS: We included a total of 1022 patients with medulloblastoma from the retrospective cohorts (n=673) and the four prospective studies (n=349), from whom blood samples (n=1022) and tumour samples (n=800) were analysed for germline mutations in 110 cancer predisposition genes. In our rare variant burden analysis, we compared these against 53 105 sequenced controls from ExAC and identified APC, BRCA2, PALB2, PTCH1, SUFU, and TP53 as consensus medulloblastoma predisposition genes according to our rare variant burden analysis and estimated that germline mutations accounted for 6% of medulloblastoma diagnoses in the retrospective cohort. The prevalence of genetic predispositions differed between molecular subgroups in the retrospective cohort and was highest for patients in the MBSHH subgroup (20% in the retrospective cohort). These estimates were replicated in the prospective clinical cohort (germline mutations accounted for 5% of medulloblastoma diagnoses, with the highest prevalence [14%] in the MBSHH subgroup). Patients with germline APC mutations developed MBWNT and accounted for most (five [71%] of seven) cases of MBWNT that had no somatic CTNNB1 exon 3 mutations. Patients with germline mutations in SUFU and PTCH1 mostly developed infant MBSHH. Germline TP53 mutations presented only in childhood patients in the MBSHH subgroup and explained more than half (eight [57%] of 14) of all chromothripsis events in this subgroup. Germline mutations in PALB2 and BRCA2 were observed across the MBSHH, MBGroup3, and MBGroup4 molecular subgroups and were associated with mutational signatures typical of homologous recombination repair deficiency. In patients with a genetic predisposition to medulloblastoma, 5-year progression-free survival was 52% (95% CI 40-69) and 5-year overall survival was 65% (95% CI 52-81); these survival estimates differed significantly across patients with germline mutations in different medulloblastoma predisposition genes. INTERPRETATION: Genetic counselling and testing should be used as a standard-of-care procedure in patients with MBWNT and MBSHH because these patients have the highest prevalence of damaging germline mutations in known cancer predisposition genes. We propose criteria for routine genetic screening for patients with medulloblastoma based on clinical and molecular tumour characteristics. FUNDING: German Cancer Aid; German Federal Ministry of Education and Research; German Childhood Cancer Foundation (Deutsche Kinderkrebsstiftung); European Research Council; National Institutes of Health; Canadian Institutes for Health Research; German Cancer Research Center; St Jude Comprehensive Cancer Center; American Lebanese Syrian Associated Charities; Swiss National Science Foundation; European Molecular Biology Organization; Cancer Research UK; Hertie Foundation; Alexander and Margaret Stewart Trust; V Foundation for Cancer Research; Sontag Foundation; Musicians Against Childhood Cancer; BC Cancer Foundation; Swedish Council for Health, Working Life and Welfare; Swedish Research Council; Swedish Cancer Society; the Swedish Radiation Protection Authority; Danish Strategic Research Council; Swiss Federal Office of Public Health; Swiss Research Foundation on Mobile Communication; Masaryk University; Ministry of Health of the Czech Republic; Research Council of Norway; Genome Canada; Genome BC; Terry Fox Research Institute; Ontario Institute for Cancer Research; Pediatric Oncology Group of Ontario; The Family of Kathleen Lorette and the Clark H Smith Brain Tumour Centre; Montreal Children's Hospital Foundation; The Hospital for Sick Children: Sonia and Arthur Labatt Brain Tumour Research Centre, Chief of Research Fund, Cancer Genetics Program, Garron Family Cancer Centre, MDT's Garron Family Endowment; BC Childhood Cancer Parents Association; Cure Search Foundation; Pediatric Brain Tumor Foundation; Brainchild; and the Government of Ontario.


Subject(s)
Biomarkers, Tumor/genetics , Cerebellar Neoplasms/genetics , DNA Methylation , Genetic Testing/methods , Germ-Line Mutation , Medulloblastoma/genetics , Models, Genetic , Adolescent , Adult , Cerebellar Neoplasms/mortality , Cerebellar Neoplasms/pathology , Cerebellar Neoplasms/therapy , Child , Child, Preschool , DNA Mutational Analysis , Female , Gene Expression Profiling , Genetic Predisposition to Disease , Heredity , Humans , Infant , Male , Medulloblastoma/mortality , Medulloblastoma/pathology , Medulloblastoma/therapy , Pedigree , Phenotype , Predictive Value of Tests , Progression-Free Survival , Prospective Studies , Reproducibility of Results , Retrospective Studies , Risk Factors , Transcriptome , Exome Sequencing , Young Adult
13.
J Am Med Inform Assoc ; 24(4): 799-805, 2017 Jul 01.
Article in English | MEDLINE | ID: mdl-28339683

ABSTRACT

The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context-a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or "beacon") is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards.While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual's whole genome sequence), the individual's membership in a beacon can be inferred through repeated queries for variants present in the individual's genome.In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets.


Subject(s)
Data Anonymization , Genetic Privacy , Information Dissemination , Genomics , Humans
14.
Bioinformatics ; 33(8): 1147-1153, 2017 04 15.
Article in English | MEDLINE | ID: mdl-28035032

ABSTRACT

Motivation: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. Availability and Implementation: Code is available on Github at: https://github.com/suyashss/variant_validation. Contacts: suyashs@stanford.edu or mtaub@jhsph.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Polymorphism, Single Nucleotide , Whole Genome Sequencing/methods , Data Accuracy , Genome, Human , Genomics/methods , Genomics/standards , Genotype , Genotyping Techniques/methods , Genotyping Techniques/standards , High-Throughput Nucleotide Sequencing/standards , Humans , Whole Genome Sequencing/standards
15.
Nat Commun ; 7: 12522, 2016 10 11.
Article in English | MEDLINE | ID: mdl-27725671

ABSTRACT

The African Diaspora in the Western Hemisphere represents one of the largest forced migrations in history and had a profound impact on genetic diversity in modern populations. To date, the fine-scale population structure of descendants of the African Diaspora remains largely uncharacterized. Here we present genetic variation from deeply sequenced genomes of 642 individuals from North and South American, Caribbean and West African populations, substantially increasing the lexicon of human genomic variation and suggesting much variation remains to be discovered in African-admixed populations in the Americas. We summarize genetic variation in these populations, quantifying the postcolonial sex-biased European gene flow across multiple regions. Moreover, we refine estimates on the burden of deleterious variants carried across populations and how this varies with African ancestry. Our data are an important resource for empowering disease mapping studies in African-admixed individuals and will facilitate gene discovery for diseases disproportionately affecting individuals of African ancestry.


Subject(s)
Black People/genetics , Gene Flow , Genome, Human , Human Migration , Base Sequence , DNA, Intergenic/genetics , Female , Genetic Heterogeneity , Geography , Humans , Male , Phylogeny , Polymorphism, Single Nucleotide/genetics , Sexism
16.
BMC Bioinformatics ; 17: 218, 2016 May 23.
Article in English | MEDLINE | ID: mdl-27216439

ABSTRACT

BACKGROUND: A number of large genomic datasets are being generated for studies of human ancestry and diseases. The ADMIXTURE program is commonly used to infer individual ancestry from genomic data. RESULTS: We describe two improvements to the ADMIXTURE software. The first enables ADMIXTURE to infer ancestry for a new set of individuals using cluster allele frequencies from a reference set of individuals. Using data from the 1000 Genomes Project, we show that this allows ADMIXTURE to infer ancestry for 10,920 individuals in a few hours (a 5 × speedup). This mode also allows ADMIXTURE to correctly estimate individual ancestry and allele frequencies from a set of related individuals. The second modification allows ADMIXTURE to correctly handle X-chromosome (and other haploid) data from both males and females. We demonstrate increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project using this extension. CONCLUSIONS: These modifications make ADMIXTURE more efficient and versatile, allowing users to extract more information from large genomic datasets.


Subject(s)
Genetics, Population , Genomics/methods , Software , Black or African American/genetics , Female , Gene Frequency , HapMap Project , Humans , Male , Southwestern United States
17.
PLoS Genet ; 12(5): e1006059, 2016 05.
Article in English | MEDLINE | ID: mdl-27232753

ABSTRACT

We present a comprehensive assessment of genomic diversity in the African-American population by studying three genotyped cohorts comprising 3,726 African-Americans from across the United States that provide a representative description of the population across all US states and socioeconomic status. An estimated 82.1% of ancestors to African-Americans lived in Africa prior to the advent of transatlantic travel, 16.7% in Europe, and 1.2% in the Americas, with increased African ancestry in the southern United States compared to the North and West. Combining demographic models of ancestry and those of relatedness suggests that admixture occurred predominantly in the South prior to the Civil War and that ancestry-biased migration is responsible for regional differences in ancestry. We find that recent migrations also caused a strong increase in genetic relatedness among geographically distant African-Americans. Long-range relatedness among African-Americans and between African-Americans and European-Americans thus track north- and west-bound migration routes followed during the Great Migration of the twentieth century. By contrast, short-range relatedness patterns suggest comparable mobility of ∼15-16km per generation for African-Americans and European-Americans, as estimated using a novel analytical model of isolation-by-distance.


Subject(s)
Black or African American/genetics , Genetics, Population , Genomics , Black People/genetics , Demography , Europe , Gene Frequency , Genotype , Human Migration , Humans , Polymorphism, Single Nucleotide/genetics , United States
18.
Am J Hum Genet ; 97(5): 631-46, 2015 Nov 05.
Article in English | MEDLINE | ID: mdl-26522470

ABSTRACT

The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries--such as "Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?"--with either "yes" or "no." Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes.


Subject(s)
Genetic Privacy , Genetic Variation , Genome, Human , Information Dissemination/methods , Task Performance and Analysis , Haplotypes , High-Throughput Nucleotide Sequencing , Humans
19.
Science ; 349(6250): aab3884, 2015 Aug 21.
Article in English | MEDLINE | ID: mdl-26198033

ABSTRACT

How and when the Americas were populated remains contentious. Using ancient and modern genome-wide data, we found that the ancestors of all present-day Native Americans, including Athabascans and Amerindians, entered the Americas as a single migration wave from Siberia no earlier than 23 thousand years ago (ka) and after no more than an 8000-year isolation period in Beringia. After their arrival to the Americas, ancestral Native Americans diversified into two basal genetic branches around 13 ka, one that is now dispersed across North and South America and the other restricted to North America. Subsequent gene flow resulted in some Native Americans sharing ancestry with present-day East Asians (including Siberians) and, more distantly, Australo-Melanesians. Putative "Paleoamerican" relict populations, including the historical Mexican Pericúes and South American Fuego-Patagonians, are not directly related to modern Australo-Melanesians as suggested by the Paleoamerican Model.


Subject(s)
Human Migration/history , Indians, North American/history , Americas , Gene Flow , Genomics , History, Ancient , Humans , Indians, North American/genetics , Models, Genetic , Siberia
20.
PLoS One ; 10(6): e0129277, 2015.
Article in English | MEDLINE | ID: mdl-26110529

ABSTRACT

Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands of genomes is time-consuming, expensive, and not easily reproducible given the myriad components of a variant calling pipeline. Here, we describe a cloud-based pipeline for joint variant calling in large samples using the Real Time Genomics population caller. We deployed the population caller on the Amazon cloud with the DNAnexus platform in order to achieve low-cost variant calling. Using our pipeline, we were able to identify 68.3 million variants in 2,535 samples from Phase 3 of the 1000 Genomes Project. By performing the variant calling in a parallel manner, the data was processed within 5 days at a compute cost of $7.33 per sample (a total cost of $18,590 for completed jobs and $21,805 for all jobs). Analysis of cost dependence and running time on the data size suggests that, given near linear scalability, cloud computing can be a cheap and efficient platform for analyzing even larger sequencing studies in the future.


Subject(s)
Genetic Variation , Genome, Human , High-Throughput Nucleotide Sequencing/methods , Cloud Computing/economics , Databases, Genetic , High-Throughput Nucleotide Sequencing/economics , Humans , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...