Your browser doesn't support javascript.
ABSTRACT
Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLM represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate the scaling of GenSLMs on both GPU-based supercomputers and AI-hardware accelerators, achieving over 1.54 zettaflops in training runs. We present initial scientific insights gleaned from examining GenSLMs in tracking the evolutionary dynamics of SARS-CoV-2, noting that its full potential on large biological data is yet to be realized.

Texte intégral: Disponible Collection: Preprints Base de données: bioRxiv langue: Anglais Année: 2022 Type de document: Preprint

Documents relatifs à ce sujet

MEDLINE

...
LILACS

LIS


Texte intégral: Disponible Collection: Preprints Base de données: bioRxiv langue: Anglais Année: 2022 Type de document: Preprint