Search | VHL Regional Portal

Large language models improve annotation of prokaryotic viral proteins.

Flamholz, Zachary N; Biller, Steven J; Kelly, Libusha.

Nat Microbiol ; 9(2): 537-549, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38287147

ABSTRACT

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.

Subject(s)

Prokaryotic Cells , Viral Proteins , Viral Proteins/genetics , Genomics , Capsid Proteins/genetics , Metagenomics

H/ACA snRNP-dependent ribosome biogenesis regulates translation of polyglutamine proteins.

Breznak, Shane M; Peng, Yingshi; Deng, Limin; Kotb, Noor M; Flamholz, Zachary; Rapisarda, Ian T; Martin, Elliot T; LaBarge, Kara A; Fabris, Dan; Gavis, Elizabeth R; Rangan, Prashanth.

Sci Adv ; 9(25): eade5492, 2023 06 23.

Article in English | MEDLINE | ID: mdl-37343092

ABSTRACT

Stem cells in many systems, including Drosophila germline stem cells (GSCs), increase ribosome biogenesis and translation during terminal differentiation. Here, we show that the H/ACA small nuclear ribonucleoprotein (snRNP) complex that promotes pseudouridylation of ribosomal RNA (rRNA) and ribosome biogenesis is required for oocyte specification. Reducing ribosome levels during differentiation decreased the translation of a subset of messenger RNAs that are enriched for CAG trinucleotide repeats and encode polyglutamine-containing proteins, including differentiation factors such as RNA-binding Fox protein 1. Moreover, ribosomes were enriched at CAG repeats within transcripts during oogenesis. Increasing target of rapamycin (TOR) activity to elevate ribosome levels in H/ACA snRNP complex-depleted germlines suppressed the GSC differentiation defects, whereas germlines treated with the TOR inhibitor rapamycin had reduced levels of polyglutamine-containing proteins. Thus, ribosome biogenesis and ribosome levels can control stem cell differentiation via selective translation of CAG repeat-containing transcripts.

Subject(s)

Ribonucleoproteins, Small Nuclear , Ribosomes , Ribonucleoproteins, Small Nuclear/metabolism , Ribosomes/metabolism , RNA, Ribosomal , Proteins/metabolism , Sirolimus

Large language models improve annotation of viral proteins.

Flamholz, Zachary N; Biller, Steve J; Kelly, Libusha.

Res Sq ; 2023 May 02.

Article in English | MEDLINE | ID: mdl-37205395

ABSTRACT

Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.

Flamholz, Zachary N; Crane-Droesch, Andrew; Ungar, Lyle H; Weissman, Gary E.

J Biomed Inform ; 125: 103971, 2022 01.

Article in English | MEDLINE | ID: mdl-34920127

ABSTRACT

OBJECTIVE: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.

Subject(s)

Electronic Health Records , Publications , Humans , Natural Language Processing , PubMed , Reproducibility of Results

FAIRshake: Toolkit to Evaluate the FAIRness of Research Digital Resources.

Clarke, Daniel J B; Wang, Lily; Jones, Alex; Wojciechowicz, Megan L; Torre, Denis; Jagodnik, Kathleen M; Jenkins, Sherry L; McQuilton, Peter; Flamholz, Zachary; Silverstein, Moshe C; Schilder, Brian M; Robasky, Kimberly; Castillo, Claris; Idaszak, Ray; Ahalt, Stanley C; Williams, Jason; Schurer, Stephan; Cooper, Daniel J; de Miranda Azevedo, Ricardo; Klenk, Juergen A; Haendel, Melissa A; Nedzel, Jared; Avillach, Paul; Shimoyama, Mary E; Harris, Rayna M; Gamble, Meredith; Poten, Rudy; Charbonneau, Amanda L; Larkin, Jennie; Brown, C Titus; Bonazzi, Vivien R; Dumontier, Michel J; Sansone, Susanna-Assunta; Ma'ayan, Avi.

Cell Syst ; 9(5): 417-421, 2019 11 27.

Article in English | MEDLINE | ID: mdl-31677972

ABSTRACT

As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.

Subject(s)

Information Dissemination/methods , Internet/trends , Online Systems/standards , Health Resources/standards , Humans

modEnrichr: a suite of gene set enrichment analysis tools for model organisms.

Kuleshov, Maxim V; Diaz, Jennifer E L; Flamholz, Zachary N; Keenan, Alexandra B; Lachmann, Alexander; Wojciechowicz, Megan L; Cagan, Ross L; Ma'ayan, Avi.

Nucleic Acids Res ; 47(W1): W183-W190, 2019 07 02.

Article in English | MEDLINE | ID: mdl-31069376

ABSTRACT

High-throughput experiments produce increasingly large datasets that are difficult to analyze and integrate. While most data integration approaches focus on aligning metadata, data integration can be achieved by abstracting experimental results into gene sets. Such gene sets can be made available for reuse through gene set enrichment analysis tools such as Enrichr. Enrichr currently only supports gene sets compiled from human and mouse, limiting accessibility for investigators that study other model organisms. modEnrichr is an expansion of Enrichr for four model organisms: fish, fly, worm and yeast. The gene set libraries within FishEnrichr, FlyEnrichr, WormEnrichr and YeastEnrichr are created from the Gene Ontology, mRNA expression profiles, GeneRIF, pathway databases, protein domain databases and other organism-specific resources. Additionally, libraries were created by predicting gene function from RNA-seq co-expression data processed uniformly from the gene expression omnibus for each organism. The modEnrichr suite of tools provides the ability to convert gene lists across species using an ortholog conversion tool that automatically detects the species. For complex analyses, modEnrichr provides API access that enables submitting batch queries. In summary, modEnrichr leverages existing model organism databases and other resources to facilitate comprehensive hypothesis generation through data integration.

Subject(s)

Databases, Genetic , Gene Expression/genetics , Gene Library , Genomic Library , Software , Animals , Computational Biology , Gene Ontology , Humans , Metadata

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL