Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
Sci Rep ; 7: 46148, 2017 04 07.
Article in English | MEDLINE | ID: mdl-28387241

ABSTRACT

The Personal Genome Project (PGP) is an effort to enroll many participants to create an open-access repository of genome, health and trait data for research. However, PGP participants are not enrolled for studying any specific traits and participants choose the phenotypes to disclose. To measure the extent and willingness and to encourage and guide participants to contribute phenotypes, we developed an algorithm to score and rank the phenotypes and participants of the PGP. The scoring algorithm calculates the participation index (P-index) for every participant, where 0 indicates no reported phenotypes and 100 indicate complete phenotype reporting. We calculated the P-index for all 5,015 participants in the PGP and they ranged from 0 to 96.7. We found that participants mainly have either high scores (P-index > 90, 29.5%) or low scores (P-index < 10, 57.8%). While, there are significantly more males than female participants (1,793 versus 1,271), females tend to have on average higher P-indexes (P = 0.015). We also reported the P-indexes of participants based on demographics and states like Missouri and Massachusetts have better P-indexes than states like Utah and Minnesota. The P-index can therefore be used as an unbiased way to measure and rank participant's phenotypic contribution towards the PGP.


Subject(s)
Phenotype , Algorithms , Cohort Studies , Disease , Female , Genome, Human , Geography , Humans , Male , Quantitative Trait, Heritable , Surveys and Questionnaires , United States
2.
J Mol Diagn ; 19(3): 417-426, 2017 05.
Article in English | MEDLINE | ID: mdl-28315672

ABSTRACT

A national workgroup convened by the Centers for Disease Control and Prevention identified principles and made recommendations for standardizing the description of sequence data contained within the variant file generated during the course of clinical next-generation sequence analysis for diagnosing human heritable conditions. The specifications for variant files were initially developed to be flexible with regard to content representation to support a variety of research applications. This flexibility permits variation with regard to how sequence findings are described and this depends, in part, on the conventions used. For clinical laboratory testing, this poses a problem because these differences can compromise the capability to compare sequence findings among laboratories to confirm results and to query databases to identify clinically relevant variants. To provide for a more consistent representation of sequence findings described within variant files, the workgroup made several recommendations that considered alignment to a common reference sequence, variant caller settings, use of genomic coordinates, and gene and variant naming conventions. These recommendations were considered with regard to the existing variant file specifications presently used in the clinical setting. Adoption of these recommendations is anticipated to reduce the potential for ambiguity in describing sequence findings and facilitate the sharing of genomic data among clinical laboratories and other entities.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Databases, Genetic , Genetic Variation/genetics , Humans , Software
3.
Gigascience ; 5(1): 42, 2016 10 11.
Article in English | MEDLINE | ID: mdl-27724973

ABSTRACT

BACKGROUND: Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information. FINDINGS: As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics' Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics' standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data. CONCLUSIONS: These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function.


Subject(s)
Genome, Human , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , DNA/blood , Haplotypes , Humans , Reproducibility of Results
4.
Sci Data ; 3: 160025, 2016 Jun 07.
Article in English | MEDLINE | ID: mdl-27271295

ABSTRACT

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


Subject(s)
Benchmarking , Genome, Human , Exome , Genomics , Humans , INDEL Mutation
5.
Genome Med ; 6(2): 10, 2014 Feb 28.
Article in English | MEDLINE | ID: mdl-24713084

ABSTRACT

BACKGROUND: Since its initiation in 2005, the Harvard Personal Genome Project has enrolled thousands of volunteers interested in publicly sharing their genome, health and trait data. Because these data are highly identifiable, we use an 'open consent' framework that purposefully excludes promises about privacy and requires participants to demonstrate comprehension prior to enrollment. DISCUSSION: Our model of non-anonymous, public genomes has led us to a highly participatory model of researcher-participant communication and interaction. The participants, who are highly committed volunteers, self-pursue and donate research-relevant datasets, and are actively engaged in conversations with both our staff and other Personal Genome Project participants. We have quantitatively assessed these communications and donations, and report our experiences with returning research-grade whole genome data to participants. We also observe some of the community growth and discussion that has occurred related to our project. SUMMARY: We find that public non-anonymous data is valuable and leads to a participatory research model, which we encourage others to consider. The implementation of this model is greatly facilitated by web-based tools and methods and participant education. Project results are long-term proactive participant involvement and the growth of a community that benefits both researchers and participants.

6.
Nature ; 487(7406): 190-5, 2012 Jul 11.
Article in English | MEDLINE | ID: mdl-22785314

ABSTRACT

Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ∼100 picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases. Cost-effective and accurate genome sequencing and haplotyping from 10-20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.


Subject(s)
Genome, Human , Genomics/methods , Sequence Analysis, DNA/methods , Alleles , Cell Line , Female , Gene Silencing , Genetic Variation , Haplotypes , Humans , Mutation , Reproducibility of Results , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/standards
7.
Proc Natl Acad Sci U S A ; 109(30): 11920-7, 2012 Jul 24.
Article in English | MEDLINE | ID: mdl-22797899

ABSTRACT

Rapid advances in DNA sequencing promise to enable new diagnostics and individualized therapies. Achieving personalized medicine, however, will require extensive research on highly reidentifiable, integrated datasets of genomic and health information. To assist with this, participants in the Personal Genome Project choose to forgo privacy via our institutional review board- approved "open consent" process. The contribution of public data and samples facilitates both scientific discovery and standardization of methods. We present our findings after enrollment of more than 1,800 participants, including whole-genome sequencing of 10 pilot participant genomes (the PGP-10). We introduce the Genome-Environment-Trait Evidence (GET-Evidence) system. This tool automatically processes genomes and prioritizes both published and novel variants for interpretation. In the process of reviewing the presumed healthy PGP-10 genomes, we find numerous literature references implying serious disease. Although it is sometimes impossible to rule out a late-onset effect, stringent evidence requirements can address the high rate of incidental findings. To that end we develop a peer production system for recording and organizing variant evaluations according to standard evidence guidelines, creating a public forum for reaching consensus on interpretation of clinically relevant variants. Genome analysis becomes a two-step process: using a prioritized list to record variant evaluations, then automatically sorting reviewed variants using these annotations. Genome data, health and trait information, participant samples, and variant interpretations are all shared in the public domain-we invite others to review our results using our participant samples and contribute to our interpretations. We offer our public resource and methods to further personalized medical research.


Subject(s)
Databases, Genetic , Genetic Variation , Genome, Human/genetics , Phenotype , Precision Medicine/methods , Software , Cell Line , Data Collection , Humans , Precision Medicine/trends , Sequence Analysis, DNA
8.
Hum Mutat ; 33(5): 809-12, 2012 May.
Article in English | MEDLINE | ID: mdl-22431014

ABSTRACT

In the traditional medical genetics setting, metabolic disorders, identified either clinically or through biochemical screening, undergo subsequent single gene testing to molecularly confirm diagnosis, provide further insight on natural disease history, and inform on disease management, treatment, familial testing, and reproductive options. For decades now, this process has been responsible for saving many lives worldwide. Only recently, though, has it become possible to move in the opposite direction by starting with an individual's whole genome or exome, and, guided by this data, study more minor perturbations in the absolute values and substrate ratios of clinically important biochemical analytes. Genomic individuality can also be used to guide more detailed phenotyping aimed at uncovering milder manifestations of known metabolic diseases. Metabolomic phenotyping in the Personal Genome Project for our first 200+ participants-all of whom are scheduled to have full genome sequence at more than 40× coverage available by May 2012-is aimed at uncovering potential subclinical and preclinical disease states in carriers of known pathogenic mutations and in lesser known rare variants that are protein predicted to be pathogenic. Our initial focus targets 88 genes involved in 68 metabolic disturbances with established evidence-based nutritional and/or pharmacological therapy as part of standard medical care.


Subject(s)
Metabolic Diseases/genetics , Metabolome/genetics , Databases, Genetic , Genetic Association Studies , Genetic Testing , Genome, Human , Genome-Wide Association Study , Humans , Phenotype , Sequence Analysis, DNA
9.
PLoS Genet ; 7(9): e1002280, 2011 Sep.
Article in English | MEDLINE | ID: mdl-21935354

ABSTRACT

Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.


Subject(s)
DNA Mutational Analysis/methods , Genes, Synthetic , Genetic Variation , Genome-Wide Association Study/methods , Thrombophilia/genetics , Alleles , Base Sequence , Female , Genetic Predisposition to Disease , Genome, Human , Genotype , Haplotypes , Humans , Male , Pedigree , Reference Standards , Risk Assessment , Sequence Alignment , Sequence Analysis, DNA
10.
PLoS Genet ; 6(5): e1000954, 2010 May 20.
Article in English | MEDLINE | ID: mdl-20531933

ABSTRACT

While it is widely held that an organism's genomic information should remain constant, several protein families are known to modify it. Members of the AID/APOBEC protein family can deaminate DNA. Similarly, members of the ADAR family can deaminate RNA. Characterizing the scope of these events is challenging. Here we use large genomic data sets, such as the two billion sequences in the NCBI Trace Archive, to look for clusters of mismatches of the same type, which are a hallmark of editing events caused by APOBEC3 and ADAR. We align 603,249,815 traces from the NCBI trace archive to their reference genomes. In clusters of mismatches of increasing size, at least one systematic sequencing error dominates the results (G-to-A). It is still present in mismatches with 99% accuracy and only vanishes in mismatches at 99.99% accuracy or higher. The error appears to have entered into about 1% of the HapMap, possibly affecting other users that rely on this resource. Further investigation, using stringent quality thresholds, uncovers thousands of mismatch clusters with no apparent defects in their chromatograms. These traces provide the first reported candidates of endogenous DNA editing in human, further elucidating RNA editing in human and mouse and also revealing, for the first time, extensive RNA editing in Xenopus tropicalis. We show that the NCBI Trace Archive provides a valuable resource for the investigation of the phenomena of DNA and RNA editing, as well as setting the stage for a comprehensive mapping of editing events in large-scale genomic datasets.


Subject(s)
DNA/genetics , Genomics , RNA Editing , APOBEC Deaminases , Adenosine Deaminase/genetics , Base Pair Mismatch , Cytidine Deaminase , Cytosine Deaminase/genetics , Humans , Multigene Family , RNA-Binding Proteins
11.
Lancet ; 375(9725): 1525-35, 2010 May 01.
Article in English | MEDLINE | ID: mdl-20435227

ABSTRACT

BACKGROUND: The cost of genomic information has fallen steeply, but the clinical translation of genetic risk estimates remains unclear. We aimed to undertake an integrated analysis of a complete human genome in a clinical context. METHODS: We assessed a patient with a family history of vascular disease and early sudden death. Clinical assessment included analysis of this patient's full genome sequence, risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome and clinical risk. Disease and risk analysis focused on prediction of genetic risk of variants associated with mendelian disease, recognised drug responses, and pathogenicity for novel variants. We queried disease-specific mutation databases and pharmacogenomics databases to identify genes and mutations with known associations with disease and drug response. We estimated post-test probabilities of disease by applying likelihood ratios derived from integration of multiple common variants to age-appropriate and sex-appropriate pre-test probabilities. We also accounted for gene-environment interactions and conditionally dependent risks. FINDINGS: Analysis of 2.6 million single nucleotide polymorphisms and 752 copy number variations showed increased genetic risk for myocardial infarction, type 2 diabetes, and some cancers. We discovered rare variants in three genes that are clinically associated with sudden cardiac death-TMEM43, DSP, and MYBPC3. A variant in LPA was consistent with a family history of coronary artery disease. The patient had a heterozygous null mutation in CYP2C19 suggesting probable clopidogrel resistance, several variants associated with a positive response to lipid-lowering therapy, and variants in CYP4F2 and VKORC1 that suggest he might have a low initial dosing requirement for warfarin. Many variants of uncertain importance were reported. INTERPRETATION: Although challenges remain, our results suggest that whole-genome sequencing can yield useful and clinically relevant information for individual patients. FUNDING: National Institute of General Medical Sciences; National Heart, Lung And Blood Institute; National Human Genome Research Institute; Howard Hughes Medical Institute; National Library of Medicine, Lucile Packard Foundation for Children's Health; Hewlett Packard Foundation; Breetwor Family Foundation.


Subject(s)
Genetic Predisposition to Disease/genetics , Genetic Testing , Genome, Human , Sequence Analysis, DNA , Vascular Diseases/genetics , Adult , Aryl Hydrocarbon Hydroxylases/genetics , Carrier Proteins/genetics , Cytochrome P-450 CYP2C19 , Cytochrome P-450 Enzyme System/genetics , Cytochrome P450 Family 4 , Death, Sudden, Cardiac , Desmoplakins/genetics , Environment , Family Health , Genetic Counseling , Humans , Lipoprotein(a)/genetics , Male , Membrane Proteins/genetics , Mixed Function Oxygenases/genetics , Mutation , Osteoarthritis/genetics , Pedigree , Pharmacogenetics , Polymorphism, Single Nucleotide , Risk Assessment , Vitamin K Epoxide Reductases
12.
Nature ; 460(7258): 1011-5, 2009 Aug 20.
Article in English | MEDLINE | ID: mdl-19587683

ABSTRACT

Recent advances in sequencing technologies have initiated an era of personal genome sequences. To date, human genome sequences have been reported for individuals with ancestry in three distinct geographical regions: a Yoruba African, two individuals of northwest European origin, and a person from China. Here we provide a highly annotated, whole-genome sequence for a Korean individual, known as AK1. The genome of AK1 was determined by an exacting, combined approach that included whole-genome shotgun sequencing (27.8x coverage), targeted bacterial artificial chromosome sequencing, and high-resolution comparative genomic hybridization using custom microarrays featuring more than 24 million probes. Alignment to the NCBI reference, a composite of several ethnic clades, disclosed nearly 3.45 million single nucleotide polymorphisms (SNPs), including 10,162 non-synonymous SNPs, and 170,202 deletion or insertion polymorphisms (indels). SNP and indel densities were strongly correlated genome-wide. Applying very conservative criteria yielded highly reliable copy number variants for clinical considerations. Potential medical phenotypes were annotated for non-synonymous SNPs, coding domain indels, and structural variants. The integration of several human whole-genome sequences derived from several ethnic groups will assist in understanding genetic ancestry, migration patterns and population bottlenecks.


Subject(s)
Asian People/genetics , Genome, Human/genetics , Chromosomes, Artificial, Bacterial/genetics , Comparative Genomic Hybridization , Computational Biology , Humans , INDEL Mutation/genetics , Korea , Oligonucleotide Array Sequence Analysis , Polymorphism, Single Nucleotide/genetics , Sequence Analysis, DNA
13.
Genome Res ; 19(9): 1606-15, 2009 Sep.
Article in English | MEDLINE | ID: mdl-19525355

ABSTRACT

Utilizing the full power of next-generation sequencing often requires the ability to perform large-scale multiplex enrichment of many specific genomic loci in multiple samples. Several technologies have been recently developed but await substantial improvements. We report the 10,000-fold improvement of a previously developed padlock-based approach, and apply the assay to identifying genetic variations in hypermutable CpG regions across human chromosome 21. From approximately 3 million reads derived from a single Illumina Genome Analyzer lane, approximately 94% (approximately 50,500) target sites can be observed with at least one read. The uniformity of coverage was also greatly improved; up to 93% and 57% of all targets fell within a 100- and 10-fold coverage range, respectively. Alleles at >400,000 target base positions were determined across six subjects and examined for single nucleotide polymorphisms (SNPs), and the concordance with independently obtained genotypes was 98.4%-100%. We detected >500 SNPs not currently in dbSNP, 362 of which were in targeted CpG locations. Transitions in CpG sites were at least 13.7 times more abundant than non-CpG transitions. Fractions of polymorphic CpG sites are lower in CpG-rich regions and show higher correlation with human-chimpanzee divergence within CpG versus non-CpG sites. This is consistent with the hypothesis that methylation rate heterogeneity along chromosomes contributes to mutation rate variation in humans. Our success suggests that targeted CpG resequencing is an efficient way to identify common and rare genetic variations. In addition, the significantly improved padlock capture technology can be readily applied to other projects that require multiplex sample preparation.


Subject(s)
Chromosomes, Human, Pair 21/genetics , CpG Islands/genetics , DNA Probes/genetics , Genetic Variation , Genome, Human/genetics , Mutation , Sequence Analysis, DNA/methods , Animals , Computational Biology/methods , Genotype , Humans , Polymorphism, Single Nucleotide , Reproducibility of Results , Sensitivity and Specificity
14.
Bioinformatics ; 25(17): 2194-9, 2009 Sep 01.
Article in English | MEDLINE | ID: mdl-19549630

ABSTRACT

MOTIVATION: Primary data analysis methods are of critical importance in second generation DNA sequencing. Improved methods have the potential to increase yield and reduce the error rates. Openly documented analysis tools enable the user to understand the primary data, this is important for the optimization and validity of their scientific work. RESULTS: In this article, we describe Swift, a new tool for performing primary data analysis on the Illumina Solexa Sequencing Platform. Swift is the first tool, outside of the vendors own software, which completes the full analysis process, from raw images through to base calls. As such it provides an alternative to, and independent validation of, the vendor supplied tool. Our results show that Swift is able to increase yield by 13.8%, at comparable error rate.


Subject(s)
Sequence Analysis, DNA/methods , Software , Base Sequence , Computational Biology , Molecular Sequence Data
15.
Proc USENIX Annu Tech Conf ; 2008: 391-404, 2008 May 01.
Article in English | MEDLINE | ID: mdl-20514356

ABSTRACT

We introduce the Free Factory, a platform for deploying data-intensive web services using small clusters of commodity hardware and free software. Independently administered virtual machines called Freegols give application developers the flexibility of a general purpose web server, along with access to distributed batch processing, cache and storage services. Each cluster exploits idle RAM and disk space for cache, and reserves disks in each node for high bandwidth storage. The batch processing service uses a variation of the MapReduce model. Virtualization allows every CPU in the cluster to participate in batch jobs. Each 48-node cluster can achieve 4-8 gigabytes per second of disk I/O. Our intent is to use multiple clusters to process hundreds of simultaneous requests on multi-hundred terabyte data sets. Currently, our applications achieve 1 gigabyte per second of I/O with 123 disks by scheduling batch jobs on two clusters, one of which is located in a remote data center.

SELECTION OF CITATIONS
SEARCH DETAIL
...