Your browser doesn't support javascript.
Catwalk: identifying closely related sequences in large microbial sequence databases.
Volk, Denis; Yang-Turner, Fan; Didelot, Xavier; Crook, Derrick W; Wyllie, David.
  • Volk D; Nuffield Department of Medicine, University of Oxford, Oxford, UK.
  • Yang-Turner F; Nuffield Department of Medicine, University of Oxford, Oxford, UK.
  • Didelot X; Present address: UKRI Science and Technologies Facilities Council, Harwell, UK.
  • Crook DW; School of Life Sciences, University of Warwick, Coventry CV4 7AL, UK.
  • Wyllie D; Department of Statistics, University of Warwick, Coventry CV4 7AL, UK.
Microb Genom ; 8(6)2022 06.
Article in English | MEDLINE | ID: covidwho-1909085
ABSTRACT
There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 µs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples.
Subject(s)
Keywords

Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 / Mycobacterium tuberculosis Type of study: Cohort study / Observational study / Prognostic study Limits: Humans Language: English Year: 2022 Document Type: Article Affiliation country: Mgen.0.000850

Similar

MEDLINE

...
LILACS

LIS


Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 / Mycobacterium tuberculosis Type of study: Cohort study / Observational study / Prognostic study Limits: Humans Language: English Year: 2022 Document Type: Article Affiliation country: Mgen.0.000850