Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 5 de 5
Filter
Add more filters










Database
Language
Publication year range
1.
Nature ; 487(7406): 190-5, 2012 Jul 11.
Article in English | MEDLINE | ID: mdl-22785314

ABSTRACT

Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ∼100 picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases. Cost-effective and accurate genome sequencing and haplotyping from 10-20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.


Subject(s)
Genome, Human , Genomics/methods , Sequence Analysis, DNA/methods , Alleles , Cell Line , Female , Gene Silencing , Genetic Variation , Haplotypes , Humans , Mutation , Reproducibility of Results , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/standards
2.
J Comput Biol ; 19(3): 279-92, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22175250

ABSTRACT

Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.


Subject(s)
Contig Mapping/methods , Genome, Human , Sequence Analysis, DNA/methods , Algorithms , Alleles , Base Sequence , Bayes Theorem , Chromosome Mapping , Computer Simulation , Data Interpretation, Statistical , Humans , Models, Genetic
3.
Science ; 327(5961): 78-81, 2010 Jan 01.
Article in English | MEDLINE | ID: mdl-19892942

ABSTRACT

Genome sequencing of large numbers of individuals promises to advance the understanding, treatment, and prevention of human diseases, among other applications. We describe a genome sequencing platform that achieves efficient imaging and low reagent consumption with combinatorial probe anchor ligation chemistry to independently assay each base from patterned nanoarrays of self-assembling DNA nanoballs. We sequenced three human genomes with this platform, generating an average of 45- to 87-fold coverage per genome and identifying 3.2 to 4.5 million sequence variants per genome. Validation of one genome data set demonstrates a sequence accuracy of about 1 false variant per 100 kilobases. The high accuracy, affordable cost of $4400 for sequencing consumables, and scalability of this platform enable complete human genome sequencing for the detection of rare variants in large-scale genetic studies.


Subject(s)
DNA/chemistry , Genome, Human , Microarray Analysis , Sequence Analysis, DNA/methods , Base Sequence , Computational Biology , Costs and Cost Analysis , DNA/genetics , Databases, Nucleic Acid , Genomic Library , Genotype , Haplotypes , Human Genome Project , Humans , Male , Nanostructures , Nanotechnology , Nucleic Acid Amplification Techniques , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/standards , Software
4.
Protein Sci ; 17(1): 54-65, 2008 Jan.
Article in English | MEDLINE | ID: mdl-18042678

ABSTRACT

Metals play a variety of roles in biological processes, and hence their presence in a protein structure can yield vital functional information. Because the residues that coordinate a metal often undergo conformational changes upon binding, detection of binding sites based on simple geometric criteria in proteins without bound metal is difficult. However, aspects of the physicochemical environment around a metal binding site are often conserved even when this structural rearrangement occurs. We have developed a Bayesian classifier using known zinc binding sites as positive training examples and nonmetal binding regions that nonetheless contain residues frequently observed in zinc sites as negative training examples. In order to allow variation in the exact positions of atoms, we average a variety of biochemical and biophysical properties in six concentric spherical shells around the site of interest. At a specificity of 99.8%, this method achieves 75.5% sensitivity in unbound proteins at a positive predictive value of 73.6%. We also test its accuracy on predicted protein structures obtained by homology modeling using templates with 30%-50% sequence identity to the target sequences. At a specificity of 99.8%, we correctly identify at least one zinc binding site in 65.5% of modeled proteins. Thus, in many cases, our model is accurate enough to identify metal binding sites in proteins of unknown structure for which no high sequence identity homologs of known structure exist. Both the source code and a Web interface are available to the public at http://feature.stanford.edu/metals.


Subject(s)
Carrier Proteins/chemistry , Carrier Proteins/metabolism , Zinc/chemistry , Zinc/metabolism , Binding Sites , Carrier Proteins/genetics , Genomics , Models, Biological , Models, Molecular , Protein Conformation , Sensitivity and Specificity
5.
BMC Bioinformatics ; 8 Suppl 4: S10, 2007 May 22.
Article in English | MEDLINE | ID: mdl-17570144

ABSTRACT

BACKGROUND: Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified. RESULTS: We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs. CONCLUSION: Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.


Subject(s)
Algorithms , Models, Chemical , Models, Molecular , Proteins/chemistry , Proteins/ultrastructure , Sequence Analysis, Protein/methods , Amino Acid Motifs , Binding Sites , Computer Simulation , Imaging, Three-Dimensional/methods , Ligands , Protein Binding , Protein Conformation
SELECTION OF CITATIONS
SEARCH DETAIL
...