Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 56
Filter
1.
Nat Biotechnol ; 40(6): 932-937, 2022 06.
Article in English | MEDLINE | ID: mdl-35190689

ABSTRACT

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.


Subject(s)
Deep Learning , Amino Acid Sequence , Databases, Protein , Humans , Molecular Sequence Annotation , Proteome/metabolism , Proteomics
2.
Nature ; 601(7893): 422-427, 2022 01.
Article in English | MEDLINE | ID: mdl-34987224

ABSTRACT

Maternal morbidity and mortality continue to rise, and pre-eclampsia is a major driver of this burden1. Yet the ability to assess underlying pathophysiology before clinical presentation to enable identification of pregnancies at risk remains elusive. Here we demonstrate the ability of plasma cell-free RNA (cfRNA) to reveal patterns of normal pregnancy progression and determine the risk of developing pre-eclampsia months before clinical presentation. Our results centre on comprehensive transcriptome data from eight independent prospectively collected cohorts comprising 1,840 racially diverse pregnancies and retrospective analysis of 2,539 banked plasma samples. The pre-eclampsia data include 524 samples (72 cases and 452 non-cases) from two diverse independent cohorts collected 14.5 weeks (s.d., 4.5 weeks) before delivery. We show that cfRNA signatures from a single blood draw can track pregnancy progression at the placental, maternal and fetal levels and can robustly predict pre-eclampsia, with a sensitivity of 75% and a positive predictive value of 32.3% (s.d., 3%), which is superior to the state-of-the-art method2. cfRNA signatures of normal pregnancy progression and pre-eclampsia are independent of clinical factors, such as maternal age, body mass index and race, which cumulatively account for less than 1% of model variance. Further, the cfRNA signature for pre-eclampsia contains gene features linked to biological processes implicated in the underlying pathophysiology of pre-eclampsia.


Subject(s)
Cell-Free Nucleic Acids , Pre-Eclampsia , RNA , Cell-Free Nucleic Acids/blood , Female , Humans , Pre-Eclampsia/diagnosis , Pre-Eclampsia/genetics , Predictive Value of Tests , Pregnancy , RNA/blood , Retrospective Studies , Sensitivity and Specificity
4.
Nat Biotechnol ; 37(10): 1155-1162, 2019 10.
Article in English | MEDLINE | ID: mdl-31406327

ABSTRACT

The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the 'genome in a bottle' (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.


Subject(s)
DNA, Circular/genetics , Genome, Human , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Base Sequence , Genetic Variation , Haplotypes , Humans
5.
Pac Symp Biocomput ; 24: 224-235, 2019.
Article in English | MEDLINE | ID: mdl-30864325

ABSTRACT

Copy number variants (CNVs) are an important type of genetic variation that play a causal role in many diseases. The ability to identify high quality CNVs is of substantial clinical relevance. However, CNVs are notoriously difficult to identify accurately from array-based methods and next-generation sequencing (NGS) data, particularly for small (< 10kbp) CNVs. Manual curation by experts widely remains the gold standard but cannot scale with the pace of sequencing, particularly in fast-growing clinical applications. We present the first proof-of-principle study demonstrating high throughput manual curation of putative CNVs by non-experts. We developed a crowdsourcing framework, called CrowdVariant, that leverages Google's high-throughput crowdsourcing platform to create a high confidence set of deletions for NA24385 (NIST HG002/RM 8391), an Ashkenazim reference sample developed in partnership with the Genome In A Bottle (GIAB) Consortium. We show that non-experts tend to agree both with each other and with experts on putative CNVs. We show that crowdsourced non-expert classifications can be used to accurately assign copy number status to putative CNV calls and identify 1,781 high confidence deletions in a reference sample. Multiple lines of evidence suggest these calls are a substantial improvement over existing CNV callsets and can also be useful in benchmarking and improving CNV calling algorithms. Our crowdsourcing methodology takes the first step toward showing the clinical potential for manual curation of CNVs at scale and can further guide other crowdsourcing genomics applications.


Subject(s)
Crowdsourcing/methods , DNA Copy Number Variations , Algorithms , Computational Biology/methods , Data Curation , Genome, Human , Genomics/methods , Genomics/statistics & numerical data , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Sequence Analysis, DNA/statistics & numerical data
6.
Bioinformatics ; 35(21): 4389-4391, 2019 11 01.
Article in English | MEDLINE | ID: mdl-30916319

ABSTRACT

SUMMARY: Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome. AVAILABILITY AND IMPLEMENTATION: GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Software , Genome, Human , Humans
7.
Nat Med ; 25(1): 24-29, 2019 01.
Article in English | MEDLINE | ID: mdl-30617335

ABSTRACT

Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep-learning methods for genomics are reviewed.


Subject(s)
Deep Learning , Delivery of Health Care , Diagnostic Imaging , Electronic Health Records , Humans , Natural Language Processing
8.
Nat Biotechnol ; 36(10): 983-987, 2018 11.
Article in English | MEDLINE | ID: mdl-30247488

ABSTRACT

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.


Subject(s)
Genome, Human , Mammals/genetics , Neural Networks, Computer , Polymorphism, Single Nucleotide , Animals , DNA Mutational Analysis , Genomics , Genotype , High-Throughput Nucleotide Sequencing , Humans , INDEL Mutation , Sequence Analysis, DNA , Software
9.
Hum Mol Genet ; 27(R1): R63-R71, 2018 05 01.
Article in English | MEDLINE | ID: mdl-29648622

ABSTRACT

The human genome is now investigated through high-throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (e.g. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of genome and exome data sets, modeling of population data and machine-learning strategies to solve complex genomic sequence regions. The review also portrays the rapid adoption of artificial intelligence/deep neural networks in genomics; in particular, deep learning approaches are well suited to model the complex dependencies in the regulatory landscape of the genome, and to provide predictors for genetic variant calling and interpretation.


Subject(s)
Deep Learning/trends , Gene Regulatory Networks/genetics , Genome, Human/genetics , Genomics/trends , Exome/genetics , High-Throughput Nucleotide Sequencing/trends , Humans , Sequence Analysis, DNA , Software
10.
Proc Natl Acad Sci U S A ; 115(2): 379-384, 2018 01 09.
Article in English | MEDLINE | ID: mdl-29279374

ABSTRACT

A major challenge in evaluating the contribution of rare variants to complex disease is identifying enough copies of the rare alleles to permit informative statistical analysis. To investigate the contribution of rare variants to the risk of type 2 diabetes (T2D) and related traits, we performed deep whole-genome analysis of 1,034 members of 20 large Mexican-American families with high prevalence of T2D. If rare variants of large effect accounted for much of the diabetes risk in these families, our experiment was powered to detect association. Using gene expression data on 21,677 transcripts for 643 pedigree members, we identified evidence for large-effect rare-variant cis-expression quantitative trait loci that could not be detected in population studies, validating our approach. However, we did not identify any rare variants of large effect associated with T2D, or the related traits of fasting glucose and insulin, suggesting that large-effect rare variants account for only a modest fraction of the genetic risk of these traits in this sample of families. Reliable identification of large-effect rare variants will require larger samples of extended pedigrees or different study designs that further enrich for such variants.


Subject(s)
Diabetes Mellitus, Type 2/genetics , Genetic Predisposition to Disease/genetics , Genetic Variation , Mexican Americans/genetics , Diabetes Mellitus, Type 2/ethnology , Diabetes Mellitus, Type 2/pathology , Family Health , Female , Gene Frequency , Genetic Predisposition to Disease/ethnology , Genome-Wide Association Study/methods , Genotype , Humans , Male , Pedigree , Phenotype , Quantitative Trait Loci/genetics , Whole Genome Sequencing/methods
11.
Eur J Hum Genet ; 25(2): 227-233, 2017 02.
Article in English | MEDLINE | ID: mdl-27876817

ABSTRACT

Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from sequence data collected in parent-offspring trios. We compute the joint probability of the data given the genotype likelihoods in the individual family members, the known familial relationships and a prior probability for the mutation rate. Candidate de novo mutations (DNMs) are reported along with their posterior probability, providing a systematic way to prioritize them for validation. Our tool is integrated in the Genome Analysis Toolkit and can be used together with the ReadBackedPhasing module to infer the parental origin of DNMs based on phase-informative reads. Using simulated data, we show that PBT outperforms existing tools, especially in low coverage data and on the X chromosome. We further show that PBT displays high validation rates on empirical parent-offspring sequencing data for whole-exome data from 104 trios and X-chromosome data from 249 parent-offspring families. Finally, we demonstrate an association between father's age at conception and the number of DNMs in female offspring's X chromosome, consistent with previous literature reports.


Subject(s)
Genome-Wide Association Study/methods , Germ-Line Mutation , Pedigree , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Software , Adult , Child , Chromosomes, Human, X/genetics , Exome , Female , Genotype , Humans , Male , Models, Genetic
12.
Nature ; 536(7616): 285-91, 2016 08 18.
Article in English | MEDLINE | ID: mdl-27535533

ABSTRACT

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.


Subject(s)
Exome/genetics , Genetic Variation/genetics , DNA Mutational Analysis , Datasets as Topic , Humans , Phenotype , Proteome/genetics , Rare Diseases/genetics , Sample Size
13.
Nature ; 518(7537): 102-6, 2015 Feb 05.
Article in English | MEDLINE | ID: mdl-25487149

ABSTRACT

Myocardial infarction (MI), a leading cause of death around the world, displays a complex pattern of inheritance. When MI occurs early in life, genetic inheritance is a major component to risk. Previously, rare mutations in low-density lipoprotein (LDL) genes have been shown to contribute to MI risk in individual families, whereas common variants at more than 45 loci have been associated with MI risk in the population. Here we evaluate how rare mutations contribute to early-onset MI risk in the population. We sequenced the protein-coding regions of 9,793 genomes from patients with MI at an early age (≤50 years in males and ≤60 years in females) along with MI-free controls. We identified two genes in which rare coding-sequence mutations were more frequent in MI cases versus controls at exome-wide significance. At low-density lipoprotein receptor (LDLR), carriers of rare non-synonymous mutations were at 4.2-fold increased risk for MI; carriers of null alleles at LDLR were at even higher risk (13-fold difference). Approximately 2% of early MI cases harbour a rare, damaging mutation in LDLR; this estimate is similar to one made more than 40 years ago using an analysis of total cholesterol. Among controls, about 1 in 217 carried an LDLR coding-sequence mutation and had plasma LDL cholesterol > 190 mg dl(-1). At apolipoprotein A-V (APOA5), carriers of rare non-synonymous mutations were at 2.2-fold increased risk for MI. When compared with non-carriers, LDLR mutation carriers had higher plasma LDL cholesterol, whereas APOA5 mutation carriers had higher plasma triglycerides. Recent evidence has connected MI risk with coding-sequence mutations at two genes functionally related to APOA5, namely lipoprotein lipase and apolipoprotein C-III (refs 18, 19). Combined, these observations suggest that, as well as LDL cholesterol, disordered metabolism of triglyceride-rich lipoproteins contributes to MI risk.


Subject(s)
Alleles , Apolipoproteins A/genetics , Exome/genetics , Genetic Predisposition to Disease/genetics , Myocardial Infarction/genetics , Receptors, LDL/genetics , Age Factors , Age of Onset , Apolipoprotein A-V , Case-Control Studies , Cholesterol, LDL/blood , Coronary Artery Disease/genetics , Female , Genetics, Population , Heterozygote , Humans , Male , Middle Aged , Mutation/genetics , Myocardial Infarction/blood , National Heart, Lung, and Blood Institute (U.S.) , Triglycerides/blood , United States
14.
Nat Genet ; 46(9): 944-50, 2014 Sep.
Article in English | MEDLINE | ID: mdl-25086666

ABSTRACT

Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied this framework to de novo mutations collected from 1,078 ASD family trios, and, whereas we affirmed a significant role for loss-of-function mutations, we found no excess of de novo loss-of-function mutations in cases with IQ above 100, suggesting that the role of de novo mutations in ASDs might reside in fundamental neurodevelopmental processes. We also used our model to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases.


Subject(s)
Child Development Disorders, Pervasive/genetics , Mutation , Exome , Female , Genetic Code , Genetic Predisposition to Disease , Genetics, Medical/methods , Humans , Male
15.
Genome Biol ; 15(6): R88, 2014 Jun 30.
Article in English | MEDLINE | ID: mdl-24980144

ABSTRACT

BACKGROUND: Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. RESULTS: We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. CONCLUSIONS: We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.


Subject(s)
Genome, Human , INDEL Mutation , Polymorphism, Single Nucleotide , Gene Frequency , Genetic Drift , Humans , Selection, Genetic , Sequence Analysis, DNA
16.
N Engl J Med ; 371(1): 22-31, 2014 Jul 03.
Article in English | MEDLINE | ID: mdl-24941081

ABSTRACT

BACKGROUND: Plasma triglyceride levels are heritable and are correlated with the risk of coronary heart disease. Sequencing of the protein-coding regions of the human genome (the exome) has the potential to identify rare mutations that have a large effect on phenotype. METHODS: We sequenced the protein-coding regions of 18,666 genes in each of 3734 participants of European or African ancestry in the Exome Sequencing Project. We conducted tests to determine whether rare mutations in coding sequence, individually or in aggregate within a gene, were associated with plasma triglyceride levels. For mutations associated with triglyceride levels, we subsequently evaluated their association with the risk of coronary heart disease in 110,970 persons. RESULTS: An aggregate of rare mutations in the gene encoding apolipoprotein C3 (APOC3) was associated with lower plasma triglyceride levels. Among the four mutations that drove this result, three were loss-of-function mutations: a nonsense mutation (R19X) and two splice-site mutations (IVS2+1G→A and IVS3+1G→T). The fourth was a missense mutation (A43T). Approximately 1 in 150 persons in the study was a heterozygous carrier of at least one of these four mutations. Triglyceride levels in the carriers were 39% lower than levels in noncarriers (P<1×10(-20)), and circulating levels of APOC3 in carriers were 46% lower than levels in noncarriers (P=8×10(-10)). The risk of coronary heart disease among 498 carriers of any rare APOC3 mutation was 40% lower than the risk among 110,472 noncarriers (odds ratio, 0.60; 95% confidence interval, 0.47 to 0.75; P=4×10(-6)). CONCLUSIONS: Rare mutations that disrupt APOC3 function were associated with lower levels of plasma triglycerides and APOC3. Carriers of these mutations were found to have a reduced risk of coronary heart disease. (Funded by the National Heart, Lung, and Blood Institute and others.).


Subject(s)
Apolipoprotein C-III/genetics , Coronary Disease/genetics , Mutation , Triglycerides/blood , Apolipoprotein C-III/blood , Black People/genetics , Coronary Disease/blood , Exome , Genotype , Heterozygote , Humans , Liver/pathology , Risk Factors , Sequence Analysis, DNA , White People/genetics
17.
Nature ; 506(7487): 185-90, 2014 Feb 13.
Article in English | MEDLINE | ID: mdl-24463508

ABSTRACT

Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes. Particularly enriched gene sets include the voltage-gated calcium ion channel and the signalling complex formed by the activity-regulated cytoskeleton-associated scaffold protein (ARC) of the postsynaptic density, sets previously implicated by genome-wide association and copy-number variation studies. Similar to reports in autism, targets of the fragile X mental retardation protein (FMRP, product of FMR1) are enriched for case mutations. No individual gene-based test achieves significance after correction for multiple testing and we do not detect any alleles of moderately low frequency (approximately 0.5 to 1 per cent) and moderately large effect. Taken together, these data suggest that population-based exome sequencing can discover risk alleles and complements established gene-mapping paradigms in neuropsychiatric disease.


Subject(s)
Multifactorial Inheritance/genetics , Mutation/genetics , Schizophrenia/genetics , Autistic Disorder/genetics , Calcium Channels/genetics , Cytoskeletal Proteins/genetics , DNA Copy Number Variations/genetics , Disks Large Homolog 4 Protein , Female , Fragile X Mental Retardation Protein/metabolism , Genome-Wide Association Study , Humans , Intellectual Disability/genetics , Intracellular Signaling Peptides and Proteins/genetics , Male , Membrane Proteins/genetics , Nerve Tissue Proteins/genetics , Receptors, N-Methyl-D-Aspartate/genetics
18.
PLoS Genet ; 9(4): e1003443, 2013 Apr.
Article in English | MEDLINE | ID: mdl-23593035

ABSTRACT

We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.


Subject(s)
Child Development Disorders, Pervasive/genetics , Exome , Genome-Wide Association Study , Case-Control Studies , Child , Child Development Disorders, Pervasive/physiopathology , Genetic Predisposition to Disease , Genetic Variation , Humans , Population Control , Sequence Analysis, DNA , Software
19.
Neuron ; 77(2): 235-42, 2013 Jan 23.
Article in English | MEDLINE | ID: mdl-23352160

ABSTRACT

To characterize the role of rare complete human knockouts in autism spectrum disorders (ASDs), we identify genes with homozygous or compound heterozygous loss-of-function (LoF) variants (defined as nonsense and essential splice sites) from exome sequencing of 933 cases and 869 controls. We identify a 2-fold increase in complete knockouts of autosomal genes with low rates of LoF variation (≤ 5% frequency) in cases and estimate a 3% contribution to ASD risk by these events, confirming this observation in an independent set of 563 probands and 4,605 controls. Outside the pseudoautosomal regions on the X chromosome, we similarly observe a significant 1.5-fold increase in rare hemizygous knockouts in males, contributing to another 2% of ASDs in males. Taken together, these results provide compelling evidence that rare autosomal and X chromosome complete gene knockouts are important inherited risk factors for ASD.


Subject(s)
Child Development Disorders, Pervasive/diagnosis , Child Development Disorders, Pervasive/genetics , Demography/methods , Gene Deletion , Loss of Heterozygosity/genetics , Case-Control Studies , Child Development Disorders, Pervasive/epidemiology , Child, Preschool , Chromosomes, Human, X/genetics , Female , Genetic Variation/genetics , Homozygote , Humans , Linkage Disequilibrium/genetics , Male , Risk Factors
20.
Curr Protoc Bioinformatics ; 43: 11.10.1-11.10.33, 2013.
Article in English | MEDLINE | ID: mdl-25431634

ABSTRACT

This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK.


Subject(s)
Genetic Variation , Genome, Human , Software , Calibration , Databases, Genetic , Haploidy , Haplotypes/genetics , Humans , Molecular Sequence Annotation , Polymorphism, Single Nucleotide/genetics , Sequence Alignment
SELECTION OF CITATIONS
SEARCH DETAIL
...