Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.
Patterns (N Y)
; 3(9): 100562, 2022 Sep 09.
Article
in English
| MEDLINE | ID: covidwho-1914886
ABSTRACT
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unreliable, largely due to suboptimal choice of compression method. We evaluated the available compressors and found that Nucleotide Archival Format (NAF) would provide a drastic improvement compared with current methods. For Global Initiative on Sharing Avian Flu Data's (GISAID) pre-compressed datasets, NAF would increase efficiency 52.2 times for gzip-compressed data and 3.7 times for xz-compressed data. For DNA DataBank of Japan (DDBJ), NAF would improve throughput 40 times for gzip-compressed data. For GenBank and European Nucleotide Archive (ENA), NAF would accelerate data distribution by a factor of 29.3 times compared with uncompressed FASTA. This article provides a tutorial for installing and using NAF. Offering a NAF download option in sequence databases would provide a significant saving of time, bandwidth, and disk space and accelerate biological and medical research worldwide.
Full text:
Available
Collection:
International databases
Database:
MEDLINE
Type of study:
Experimental Studies
Topics:
Vaccines
/
Variants
Language:
English
Journal:
Patterns (N Y)
Year:
2022
Document Type:
Article
Affiliation country:
J.patter.2022.100562
Similar
MEDLINE
...
LILACS
LIS