GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. (preprint)

Max T. Zvyagin; Alexander Brace; Kyle Hippe; Yuntian Deng; Bin Zhang; Cindy Orozco Bohorquez; Austin Clyde; Bharat Kale; Danilo Perez-Rivera; Heng Ma; Carla M. Mann; Michael Irvin; J. Gregory Pauloski; Logan Ward; Valerie Hayot; Murali Emani; Sam Foreman; Zhen Xie; Diangen Lin; Maulik Shukla; Weili Nie; Josh Romero; Christian Dallago; Arash Vahdat; Chaowei Xiao; Thomas Gibbs; Ian Foster; James J. Davis; Michael E. Papka; Thomas Brettin; Anima Anandkumar; Venkatram Vishwanath; Arvind Ramanathan

Cet article est une Preprint

Les preprints sont des rapports de recherche préliminaires qui n'ont pas été certifiés par l’évaluation par les pairs. Ils ne devraient pas être considérés comme guidant la pratique clinique ou les comportements liés à la santé et ne devraient pas être rapportés dans les médias comme des informations établies.

Les preprints publiées en ligne permettent aux auteurs de recevoir des commentaires rapidement, et toute la communauté scientifique peut évaluer indépendamment le travail et répondre en conséquence. Ces commentaires sont publiés avec les preprints que quiconque peut lire et servir d’évaluation post-publication.

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. (preprint)

Max T. Zvyagin; Alexander Brace; Kyle Hippe; Yuntian Deng; Bin Zhang; Cindy Orozco Bohorquez; Austin Clyde; Bharat Kale; Danilo Perez-Rivera; Heng Ma; Carla M. Mann; Michael Irvin; J. Gregory Pauloski; Logan Ward; Valerie Hayot; Murali Emani; Sam Foreman; Zhen Xie; Diangen Lin; Maulik Shukla; Weili Nie; Josh Romero; Christian Dallago; Arash Vahdat; Chaowei Xiao; Thomas Gibbs; Ian Foster; James J. Davis; Michael E. Papka; Thomas Brettin; Anima Anandkumar; Venkatram Vishwanath; Arvind Ramanathan.

biorxiv; 2022.

Preprint Dans Anglais | bioRxiv | ID: ppzbmed-10.1101.2022.10.10.511571

ABSTRACT

ABSTRACT

Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLM represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate the scaling of GenSLMs on both GPU-based supercomputers and AI-hardware accelerators, achieving over 1.54 zettaflops in training runs. We present initial scientific insights gleaned from examining GenSLMs in tracking the evolutionary dynamics of SARS-CoV-2, noting that its full potential on large biological data is yet to be realized.

Texte intégral

Imprimer

XML

Texte intégral: Disponible Collection: Preprints Base de données: bioRxiv langue: Anglais Année: 2022 Type de document: Preprint

Documents relatifs à ce sujet

MEDLINE

LILACS

LIS

Texte intégral

Imprimer

XML

Texte intégral: Disponible Collection: Preprints Base de données: bioRxiv langue: Anglais Année: 2022 Type de document: Preprint