Benchmark datasets for SARS-CoV-2 surveillance bioinformatics.

Xiaoli, Lingzi; Hagey, Jill V; Park, Daniel J; Gulvik, Christopher A; Young, Erin L; Alikhan, Nabil-Fareed; Lawsin, Adrian; Hassell, Norman; Knipe, Kristen; Oakeson, Kelly F; Retchless, Adam C; Shakya, Migun; Lo, Chien-Chi; Chain, Patrick; Page, Andrew J; Metcalf, Benjamin J; Su, Michelle; Rowell, Jessica; Vidyaprakash, Eshaw; Paden, Clinton R; Huang, Andrew D; Roellig, Dawn; Patel, Ketan; Winglee, Kathryn; Weigand, Michael R; Katz, Lee S

Xiaoli, Lingzi; Hagey, Jill V; Park, Daniel J; Gulvik, Christopher A; Young, Erin L; Alikhan, Nabil-Fareed; Lawsin, Adrian; Hassell, Norman; Knipe, Kristen; Oakeson, Kelly F; Retchless, Adam C; Shakya, Migun; Lo, Chien-Chi; Chain, Patrick; Page, Andrew J; Metcalf, Benjamin J; Su, Michelle; Rowell, Jessica; Vidyaprakash, Eshaw; Paden, Clinton R; Huang, Andrew D; Roellig, Dawn; Patel, Ketan; Winglee, Kathryn; Weigand, Michael R; Katz, Lee S.

Xiaoli L; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Hagey JV; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Park DJ; Broad Institute of MIT and Harvard, Cambridge, MA, United States of America.
Gulvik CA; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Young EL; Utah Public Health Laboratory, Salt Lake City, UT, United States of America.
Alikhan NF; Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom.
Lawsin A; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Hassell N; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Knipe K; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Oakeson KF; Utah Public Health Laboratory, Salt Lake City, UT, United States of America.
Retchless AC; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Shakya M; Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
Lo CC; Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
Chain P; Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
Page AJ; Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom.
Metcalf BJ; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Su M; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Rowell J; SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Vidyaprakash E; SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Paden CR; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Huang AD; SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Roellig D; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Patel K; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Winglee K; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Weigand MR; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
Katz LS; Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.

PeerJ ; 10: e13821, 2022.

Article in English | MEDLINE | ID: covidwho-2010486

ABSTRACT

ABSTRACT

Background:

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable

results:

maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.

Methods:

We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study.

Results:

The benchmark datasets focus on the two most widely used sequencing platforms long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub https//github.com/CDCgov/datasets-sars-cov-2.

Discussion:

The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.

Keywords

Benchmarking; COVID-19; Standardization; WGS; sha256

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Type of study: Reviews Topics: Variants Language: English Journal: PeerJ Year: 2022 Document Type: Article Affiliation country: Peerj.13821

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google