Search | VHL Regional Portal

Effective training of nanopore callers for epigenetic marks with limited labelled data.

Yao, Brian; Hsu, Chloe; Goldner, Gal; Michaeli, Yael; Ebenstein, Yuval; Listgarten, Jennifer.

Open Biol ; 14(6): 230449, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38862018

ABSTRACT

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.

Subject(s)

5-Methylcytosine , DNA Methylation , Epigenesis, Genetic , Neural Networks, Computer , 5-Methylcytosine/analogs & derivatives , 5-Methylcytosine/chemistry , 5-Methylcytosine/metabolism , Nanopore Sequencing/methods , Nanopores , Humans , Markov Chains , DNA/chemistry , DNA/genetics

Generative models for protein structures and sequences.

Hsu, Chloe; Fannjiang, Clara; Listgarten, Jennifer.

Nat Biotechnol ; 42(2): 196-199, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38361069

Subject(s)

Computational Biology , Models, Statistical

Learning protein fitness models from evolutionary and assay-labeled data.

Hsu, Chloe; Nisonoff, Hunter; Fannjiang, Clara; Listgarten, Jennifer.

Nat Biotechnol ; 40(7): 1114-1122, 2022 07.

Article in English | MEDLINE | ID: mdl-35039677

ABSTRACT

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

Subject(s)

Machine Learning , Proteins , Proteins/chemistry , Proteins/genetics

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL