Your browser doesn't support javascript.
loading
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.
Max T. Zvyagin; Alexander Brace; Kyle Hippe; Yuntian Deng; Bin Zhang; Cindy Orozco Bohorquez; Austin Clyde; Bharat Kale; Danilo Perez-Rivera; Heng Ma; Carla M. Mann; Michael Irvin; J. Gregory Pauloski; Logan Ward; Valerie Hayot; Murali Emani; Sam Foreman; Zhen Xie; Diangen Lin; Maulik Shukla; Weili Nie; Josh Romero; Christian Dallago; Arash Vahdat; Chaowei Xiao; Thomas Gibbs; Ian Foster; James J. Davis; Michael E. Papka; Thomas Brettin; Anima Anandkumar; Venkatram Vishwanath; Arvind Ramanathan.
Affiliation
  • Max T. Zvyagin; Argonne National Laboratory
  • Alexander Brace; University of Chicago
  • Kyle Hippe; Argonne National Laboratory
  • Yuntian Deng; NVIDIA Inc
  • Bin Zhang; Cerebras Systems
  • Cindy Orozco Bohorquez; Cerebras Systems
  • Austin Clyde; University of Chicago
  • Bharat Kale; Northern Illinois University
  • Danilo Perez-Rivera; Argonne National Laboratory
  • Heng Ma; Argonne National Laboratory
  • Carla M. Mann; Argonne National Laboratory
  • Michael Irvin; Argonne National Laboratory
  • J. Gregory Pauloski; University of Chicago
  • Logan Ward; Argonne National Laboratory
  • Valerie Hayot; Argonne National Laboratory
  • Murali Emani; Argonne National Laboratory
  • Sam Foreman; Argonne National Laboratory
  • Zhen Xie; Argonne National Laboratory
  • Diangen Lin; University of Chicago
  • Maulik Shukla; Argonne National Laboratory
  • Weili Nie; NVIDIA Inc
  • Josh Romero; NVIDIA Inc
  • Christian Dallago; NVIDIA Inc
  • Arash Vahdat; NVIDIA Inc
  • Chaowei Xiao; NVIDIA Inc
  • Thomas Gibbs; NVIDIA Inc
  • Ian Foster; Argonne National Laboratory
  • James J. Davis; Argonne National Laboratory
  • Michael E. Papka; Argonne National Laboratory
  • Thomas Brettin; Argonne National Laboratory
  • Anima Anandkumar; NVIDIA Inc
  • Venkatram Vishwanath; Argonne National Laboratory
  • Arvind Ramanathan; Argonne National Laboratory
Preprint in English | bioRxiv | ID: ppbiorxiv-511571
ABSTRACT
We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
License
cc_by_nc_nd
Full text: Available Collection: Preprints Database: bioRxiv Type of study: Prognostic study Language: English Year: 2022 Document type: Preprint
Full text: Available Collection: Preprints Database: bioRxiv Type of study: Prognostic study Language: English Year: 2022 Document type: Preprint
...