Machine learning for the identification of respiratory viral attachment machinery from sequences data.

Walker, Kenji C; Shwarts, Maïa; Demidikin, Stepan; Chakravarty, Arijit; Joseph-McCarthy, Diane

Walker, Kenji C; Shwarts, Maïa; Demidikin, Stepan; Chakravarty, Arijit; Joseph-McCarthy, Diane.

Walker KC; Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America.
Shwarts M; Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America.
Demidikin S; Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America.
Chakravarty A; Fractal Therapeutics, Cambridge, Massachusetts, United States of America.
Joseph-McCarthy D; Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America.

PLoS One ; 18(3): e0281642, 2023.

Article in English | MEDLINE | ID: covidwho-2285870

ABSTRACT

ABSTRACT

At the outset of an emergent viral respiratory pandemic, sequence data is among the first molecular information available. As viral attachment machinery is a key target for therapeutic and prophylactic interventions, rapid identification of viral "spike" proteins from sequence can significantly accelerate the development of medical countermeasures. For six families of respiratory viruses, covering the vast majority of airborne and droplet-transmitted diseases, host cell entry is mediated by the binding of viral surface glycoproteins that interact with a host cell receptor. In this report it is shown that sequence data for an unknown virus belonging to one of the six families above provides sufficient information to identify the protein(s) responsible for viral attachment. Random forest models that take as input a set of respiratory viral sequences can classify the protein as "spike" vs. non-spike based on predicted secondary structure elements alone (with 97.3% correctly classified) or in combination with N-glycosylation related features (with 97.0% correctly classified). Models were validated through 10-fold cross-validation, bootstrapping on a class-balanced set, and an out-of-sample extra-familial validation set. Surprisingly, we showed that secondary structural elements and N-glycosylation features were sufficient for model generation. The ability to rapidly identify viral attachment machinery directly from sequence data holds the potential to accelerate the design of medical countermeasures for future pandemics. Furthermore, this approach may be extendable for the identification of other potential viral targets and for viral sequence annotation in general in the future.

Subject(s)

Medical Countermeasures; Viruses; Virus Attachment; Machine Learning; Glycosylation

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Main subject: Viruses / Medical Countermeasures Type of study: Prognostic study / Randomized controlled trials Language: English Journal: PLoS One Journal subject: Science / Medicine Year: 2023 Document Type: Article Affiliation country: Journal.pone.0281642

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google