GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.

Max T. Zvyagin; Alexander Brace; Kyle Hippe; Yuntian Deng; Bin Zhang; Cindy Orozco Bohorquez; Austin Clyde; Bharat Kale; Danilo Perez-Rivera; Heng Ma; Carla M. Mann; Michael Irvin; J. Gregory Pauloski; Logan Ward; Valerie Hayot; Murali Emani; Sam Foreman; Zhen Xie; Diangen Lin; Maulik Shukla; Weili Nie; Josh Romero; Christian Dallago; Arash Vahdat; Chaowei Xiao; Thomas Gibbs; Ian Foster; James J. Davis; Michael E. Papka; Thomas Brettin; Anima Anandkumar; Venkatram Vishwanath; Arvind Ramanathan

This article is a Preprint

Preprints are preliminary research reports that have not been certified by peer review. They should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Preprints posted online allow authors to receive rapid feedback and the entire scientific community can appraise the work for themselves and respond appropriately. Those comments are posted alongside the preprints for anyone to read them and serve as a post publication assessment.

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.

Max T. Zvyagin; Alexander Brace; Kyle Hippe; Yuntian Deng; Bin Zhang; Cindy Orozco Bohorquez; Austin Clyde; Bharat Kale; Danilo Perez-Rivera; Heng Ma; Carla M. Mann; Michael Irvin; J. Gregory Pauloski; Logan Ward; Valerie Hayot; Murali Emani; Sam Foreman; Zhen Xie; Diangen Lin; Maulik Shukla; Weili Nie; Josh Romero; Christian Dallago; Arash Vahdat; Chaowei Xiao; Thomas Gibbs; Ian Foster; James J. Davis; Michael E. Papka; Thomas Brettin; Anima Anandkumar; Venkatram Vishwanath; Arvind Ramanathan.

Affiliation

Max T. Zvyagin; Argonne National Laboratory
Alexander Brace; University of Chicago
Kyle Hippe; Argonne National Laboratory
Yuntian Deng; NVIDIA Inc
Bin Zhang; Cerebras Systems
Cindy Orozco Bohorquez; Cerebras Systems
Austin Clyde; University of Chicago
Bharat Kale; Northern Illinois University
Danilo Perez-Rivera; Argonne National Laboratory
Heng Ma; Argonne National Laboratory
Carla M. Mann; Argonne National Laboratory
Michael Irvin; Argonne National Laboratory
J. Gregory Pauloski; University of Chicago
Logan Ward; Argonne National Laboratory
Valerie Hayot; Argonne National Laboratory
Murali Emani; Argonne National Laboratory
Sam Foreman; Argonne National Laboratory
Zhen Xie; Argonne National Laboratory
Diangen Lin; University of Chicago
Maulik Shukla; Argonne National Laboratory
Weili Nie; NVIDIA Inc
Josh Romero; NVIDIA Inc
Christian Dallago; NVIDIA Inc
Arash Vahdat; NVIDIA Inc
Chaowei Xiao; NVIDIA Inc
Thomas Gibbs; NVIDIA Inc
Ian Foster; Argonne National Laboratory
James J. Davis; Argonne National Laboratory
Michael E. Papka; Argonne National Laboratory
Thomas Brettin; Argonne National Laboratory
Anima Anandkumar; NVIDIA Inc
Venkatram Vishwanath; Argonne National Laboratory
Arvind Ramanathan; Argonne National Laboratory

Preprint in English | bioRxiv | ID: ppbiorxiv-511571

ABSTRACT

ABSTRACT

We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

License

cc_by_nc_nd

Fulltext

Add to My VHL

XML

Search on Google

Full text: Available Collection: Preprints Database: bioRxiv Type of study: Prognostic study Language: English Year: 2022 Document type: Preprint

Fulltext

Add to My VHL

XML

Search on Google

Full text: Available Collection: Preprints Database: bioRxiv Type of study: Prognostic study Language: English Year: 2022 Document type: Preprint