Search | VHL Regional Portal

An open resource for accurately benchmarking small variant and reference calls.

Zook, Justin M; McDaniel, Jennifer; Olson, Nathan D; Wagner, Justin; Parikh, Hemang; Heaton, Haynes; Irvine, Sean A; Trigg, Len; Truty, Rebecca; McLean, Cory Y; De La Vega, Francisco M; Xiao, Chunlin; Sherry, Stephen; Salit, Marc.

Nat Biotechnol ; 37(5): 561-566, 2019 05.

Article in English | MEDLINE | ID: mdl-30936564

ABSTRACT

Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a 'first of its kind' resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.

Subject(s)

Benchmarking , Computational Biology/trends , Genome, Human/genetics , Genomics/trends , Genetic Variation/genetics , High-Throughput Nucleotide Sequencing , Humans , INDEL Mutation/genetics , Polymorphism, Single Nucleotide , Software/trends

SureChEMBL: a large-scale, chemically annotated patent document database.

Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.

Nucleic Acids Res ; 44(D1): D1220-8, 2016 Jan 04.

Article in English | MEDLINE | ID: mdl-26582922

ABSTRACT

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/.

Subject(s)

Databases, Chemical , Patents as Topic , Data Mining , Pharmaceutical Preparations/chemistry

Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data.

Cleary, John G; Braithwaite, Ross; Gaastra, Kurt; Hilbush, Brian S; Inglis, Stuart; Irvine, Sean A; Jackson, Alan; Littin, Richard; Nohzadeh-Malakshah, Sahar; Rathod, Mehul; Ware, David; Trigg, Len; De La Vega, Francisco M.

J Comput Biol ; 21(6): 405-19, 2014 Jun.

Article in English | MEDLINE | ID: mdl-24874280

ABSTRACT

The analysis of whole-genome or exome sequencing data from trios and pedigrees has been successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next-generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyzes data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detect de novo mutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of whole-genome sequencing (WGS) data from a 17-individual, 3-generation CEPH pedigree sequenced to 50× average depth. Compared with singleton calling, our family caller produced more high-quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP array data concordance, and by comparing to ground truth variant sets available for this sample. We identify all previously validated de novo mutations in NA12878, concurrent with a 7× precision improvement. Our results show that our method is scalable to large genomics and human disease studies.

Subject(s)

Genome, Human , High-Throughput Nucleotide Sequencing , Mutation , Pedigree , DNA Mutational Analysis/methods , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL