Evaluating large language models for annotating proteins.

Vitale, Rosario; Bugnon, Leandro A; Fenoy, Emilio Luis; Milone, Diego H; Stegmayer, Georgina

Vitale, Rosario; Bugnon, Leandro A; Fenoy, Emilio Luis; Milone, Diego H; Stegmayer, Georgina.

Afiliação

Vitale R; Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.
Bugnon LA; Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.
Fenoy EL; Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.
Milone DH; Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.
Stegmayer G; Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.

Brief Bioinform ; 25(3)2024 Mar 27.

Article em En | MEDLINE | ID: mdl-38706315

ABSTRACT

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https//github.com/sinc-lab/llm4pfam.

Assuntos

Bases de Dados de Proteínas; Proteínas; Proteínas/química; Anotação de Sequência Molecular/métodos; Biologia Computacional/métodos; Aprendizado de Máquina

Palavras-chave

large language models; protein annotations; protein families; transfer learning

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Proteínas / Bases de Dados de Proteínas Idioma: En Revista: Brief Bioinform Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Argentina País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google