Search | VHL Regional Portal

1.

Author Correction: Predicting the efficiency of prime editing guide RNAs in human cells.

Kim, Hui Kwon; Yu, Goosang; Park, Jinman; Min, Seonwoo; Lee, Sungtae; Yoon, Sungroh; Kim, Hyongbum Henry.

Nat Biotechnol ; 42(3): 529, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38332117

2.

Mass spectra prediction with structural motif-based graph neural networks.

Park, Jiwon; Jo, Jeonghee; Yoon, Sungroh.

Sci Rep ; 14(1): 1400, 2024 Jan 16.

Article in English | MEDLINE | ID: mdl-38228685

ABSTRACT

Mass spectra, which are agglomerations of ionized fragments from targeted molecules, play a crucial role across various fields for the identification of molecular structures. A prevalent analysis method involves spectral library searches, where unknown spectra are cross-referenced with a database. The effectiveness of such search-based approaches, however, is restricted by the scope of the existing mass spectra database, underscoring the need to expand the database via mass spectra prediction. In this research, we propose the Motif-based Mass Spectrum prediction Network (MoMS-Net), a GNN-based architecture to predict the mass spectra pattern utilizing the structural motif information of the molecule. MoMS-Net considers both a molecule and its substructures as a graph form, which facilitates the incorporation of long-range dependencies while using less memory compared to the graph transformer model. We evaluated our model over various types of mass spectra and showed the validity and superiority over the conventional models.

3.

GeoT: A Geometry-Aware Transformer for Reliable Molecular Property Prediction and Chemically Interpretable Representation Learning.

Kwak, Bumju; Park, Jiwon; Kang, Taewon; Jo, Jeonghee; Lee, Byunghan; Yoon, Sungroh.

ACS Omega ; 8(42): 39759-39769, 2023 Oct 24.

Article in English | MEDLINE | ID: mdl-37901490

ABSTRACT

In recent years, molecular representation learning has emerged as a key area of focus in various chemical tasks. However, many existing models fail to fully consider the geometric information on molecular structures, resulting in less intuitive representations. Moreover, the widely used message passing mechanism is limited to providing the interpretation of experimental results from a chemical perspective. To address these challenges, we introduce a novel transformer-based framework for molecular representation learning, named the geometry-aware transformer (GeoT). The GeoT learns molecular graph structures through attention-based mechanisms specifically designed to offer reliable interpretability as well as molecular property prediction. Consequently, the GeoT can generate attention maps of the interatomic relationships associated with training objectives. In addition, the GeoT demonstrates performance comparable to that of MPNN-based models while achieving reduced computational complexity. Our comprehensive experiments, including an empirical simulation, reveal that the GeoT effectively learns chemical insights into molecular structures, bridging the gap between artificial intelligence and molecular sciences.

4.

Real-world prediction of preclinical Alzheimer's disease with a deep generative model.

Hwang, Uiwon; Kim, Sung-Woo; Jung, Dahuin; Kim, SeungWook; Lee, Hyejoo; Seo, Sang Won; Seong, Joon-Kyung; Yoon, Sungroh.

Artif Intell Med ; 144: 102654, 2023 10.

Article in English | MEDLINE | ID: mdl-37783547

ABSTRACT

Amyloid positivity is an early indicator of Alzheimer's disease and is necessary to determine the disease. In this study, a deep generative model is utilized to predict the amyloid positivity of cognitively normal individuals using proxy measures, such as structural MRI scans, demographic variables, and cognitive scores, instead of invasive direct measurements. Through its remarkable efficacy in handling imperfect datasets caused by missing data or labels, and imbalanced classes, the model outperforms previous studies and widely used machine learning approaches with an AUROC of 0.8609. Furthermore, this study illuminates the model's adaptability to diverse clinical scenarios, even when feature sets or diagnostic criteria differ from the training data. We identify the brain regions and variables that contribute most to classification, including the lateral occipital lobes, posterior temporal lobe, and APOE Ïµ4 allele. Taking advantage of deep generative models, our approach can not only provide inexpensive, non-invasive, and accurate diagnostics for preclinical Alzheimer's disease, but also meet real-world requirements for clinical translation of a deep learning model, including transferability and interpretability.

Subject(s)

Alzheimer Disease , Cognitive Dysfunction , Humans , Alzheimer Disease/diagnostic imaging , Alzheimer Disease/genetics , Cognitive Dysfunction/diagnosis , Brain/diagnostic imaging , Magnetic Resonance Imaging , Machine Learning

5.

AIVariant: a deep learning-based somatic variant detector for highly contaminated tumor samples.

Jeon, Hyeonseong; Ahn, Junhak; Na, Byunggook; Hong, Soona; Sael, Lee; Kim, Sun; Yoon, Sungroh; Baek, Daehyun.

Exp Mol Med ; 55(8): 1734-1742, 2023 08.

Article in English | MEDLINE | ID: mdl-37524869

ABSTRACT

The detection of somatic DNA variants in tumor samples with low tumor purity or sequencing depth remains a daunting challenge despite numerous attempts to address this problem. In this study, we constructed a substantially extended set of actual positive variants originating from a wide range of tumor purities and sequencing depths, as well as actual negative variants derived from sequencer-specific sequencing errors. A deep learning model named AIVariant, trained on this extended dataset, outperforms previously reported methods when tested under various tumor purities and sequencing depths, especially low tumor purity and sequencing depth.

Subject(s)

Deep Learning , Neoplasms , Humans , Gene Frequency , Computational Biology/methods , Algorithms , Neoplasms/genetics , Neoplasms/diagnosis , Mutation

6.

Deep learning-based prediction for significant coronary artery stenosis on coronary computed tomography angiography in asymptomatic populations.

Lee, Heesun; Kang, Bong Gyun; Jo, Jeonghee; Park, Hyo Eun; Yoon, Sungroh; Choi, Su-Yeon; Kim, Min Joo.

Front Cardiovasc Med ; 10: 1167468, 2023.

Article in English | MEDLINE | ID: mdl-37416918

ABSTRACT

Background: Although coronary computed tomography angiography (CCTA) is currently utilized as the frontline test to accurately diagnose coronary artery disease (CAD) in clinical practice, there are still debates regarding its use as a screening tool for the asymptomatic population. Using deep learning (DL), we sought to develop a prediction model for significant coronary artery stenosis on CCTA and identify the individuals who would benefit from undergoing CCTA among apparently healthy asymptomatic adults. Methods: We retrospectively reviewed 11,180 individuals who underwent CCTA as part of routine health check-ups between 2012 and 2019. The main outcome was the presence of coronary artery stenosis of ≥70% on CCTA. We developed a prediction model using machine learning (ML), including DL. Its performance was compared with pretest probabilities, including the pooled cohort equation (PCE), CAD consortium, and updated Diamond-Forrester (UDF) scores. Results: In the cohort of 11,180 apparently healthy asymptomatic individuals (mean age 56.1 years; men 69.8%), 516 (4.6%) presented with significant coronary artery stenosis on CCTA. Among the ML methods employed, a neural network with multi-task learning (19 selected features), one of the DL methods, was selected due to its superior performance, with an area under the curve (AUC) of 0.782 and a high diagnostic accuracy of 71.6%. Our DL-based model demonstrated a better prediction than the PCE (AUC, 0.719), CAD consortium score (AUC, 0.696), and UDF score (AUC, 0.705). Age, sex, HbA1c, and HDL cholesterol were highly ranked features. Personal education and monthly income levels were also included as important features of the model. Conclusion: We successfully developed the neural network with multi-task learning for the detection of CCTA-derived stenosis of ≥70% in asymptomatic populations. Our findings suggest that this model may provide more precise indications for the use of CCTA as a screening tool to identify individuals at a higher risk, even in asymptomatic populations, in clinical practice.

7.

Improving generalization performance of electrocardiogram classification models.

Han, Hyeongrok; Park, Seongjae; Min, Seonwoo; Kim, Eunji; Kim, HyunGi; Park, Sangha; Kim, Jin-Kook; Park, Junsang; An, Junho; Lee, Kwanglo; Jeong, Wonsun; Chon, Sangil; Ha, Kwon-Woo; Han, Myungkyu; Choi, Hyun-Soo; Yoon, Sungroh.

Physiol Meas ; 44(5)2023 05 10.

Article in English | MEDLINE | ID: mdl-36638544

ABSTRACT

Objective.Recently, many electrocardiogram (ECG) classification algorithms using deep learning have been proposed. Because the ECG characteristics vary across datasets owing to variations in factors such as recorded hospitals and the race of participants, the model needs to have a consistently high generalization performance across datasets. In this study, as part of the PhysioNet/Computing in Cardiology Challenge (PhysioNet Challenge) 2021, we present a model to classify cardiac abnormalities from the 12- and the reduced-lead ECGs.Approach.To improve the generalization performance of our earlier proposed model, we adopted a practical suite of techniques, i.e. constant-weighted cross-entropy loss, additional features, mixup augmentation, squeeze/excitation block, and OneCycle learning rate scheduler. We evaluated its generalization performance using the leave-one-dataset-out cross-validation setting. Furthermore, we demonstrate that the knowledge distillation from the 12-lead and large-teacher models improved the performance of the reduced-lead and small-student models.Main results.With the proposed model, our DSAIL SNU team has received Challenge scores of 0.55, 0.58, 0.58, 0.57, and 0.57 (ranked 2nd, 1st, 1st, 2nd, and 2nd of 39 teams) for the 12-, 6-, 4-, 3-, and 2-lead versions of the hidden test set, respectively.Significance.The proposed model achieved a higher generalization performance over six different hidden test datasets than the one we submitted to the PhysioNet Challenge 2020.

Subject(s)

Atrial Fibrillation , Humans , Algorithms , Electrocardiography/methods , Entropy

8.

Deep contrastive learning of molecular conformation for efficient property prediction.

Park, Yang Jeong; Kim, HyunGi; Jo, Jeonghee; Yoon, Sungroh.

Nat Comput Sci ; 3(12): 1015-1022, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38177719

ABSTRACT

Data-driven deep learning algorithms provide accurate prediction of high-level quantum-chemical molecular properties. However, their inputs must be constrained to the same quantum-chemical level of geometric relaxation as the training dataset, limiting their flexibility. Adopting alternative cost-effective conformation generative methods introduces domain-shift problems, deteriorating prediction accuracy. Here we propose a deep contrastive learning-based domain-adaptation method called Local Atomic environment Contrastive Learning (LACL). LACL learns to alleviate the disparities in distribution between the two geometric conformations by comparing different conformation-generation methods. We found that LACL forms a domain-agnostic latent space that encapsulates the semantics of an atom's local atomic environment. LACL achieves quantum-chemical accuracy while circumventing the geometric relaxation bottleneck and could enable future application scenarios such as inverse molecular engineering and large-scale screening. Our approach is also generalizable from small organic molecules to long chains of biological and pharmacological molecules.

Subject(s)

Algorithms , Engineering , Molecular Conformation , Relaxation , Semantics

9.

Brain-inspired Predictive Coding Improves the Performance of Machine Challenging Tasks.

Lee, Jangho; Jo, Jeonghee; Lee, Byounghwa; Lee, Jung-Hoon; Yoon, Sungroh.

Front Comput Neurosci ; 16: 1062678, 2022.

Article in English | MEDLINE | ID: mdl-36465966

ABSTRACT

Backpropagation has been regarded as the most favorable algorithm for training artificial neural networks. However, it has been criticized for its biological implausibility because its learning mechanism contradicts the human brain. Although backpropagation has achieved super-human performance in various machine learning applications, it often shows limited performance in specific tasks. We collectively referred to such tasks as machine-challenging tasks (MCTs) and aimed to investigate methods to enhance machine learning for MCTs. Specifically, we start with a natural question: Can a learning mechanism that mimics the human brain lead to the improvement of MCT performances? We hypothesized that a learning mechanism replicating the human brain is effective for tasks where machine intelligence is difficult. Multiple experiments corresponding to specific types of MCTs where machine intelligence has room to improve performance were performed using predictive coding, a more biologically plausible learning algorithm than backpropagation. This study regarded incremental learning, long-tailed, and few-shot recognition as representative MCTs. With extensive experiments, we examined the effectiveness of predictive coding that robustly outperformed backpropagation-trained networks for the MCTs. We demonstrated that predictive coding-based incremental learning alleviates the effect of catastrophic forgetting. Next, predictive coding-based learning mitigates the classification bias in long-tailed recognition. Finally, we verified that the network trained with predictive coding could correctly predict corresponding targets with few samples. We analyzed the experimental result by drawing analogies between the properties of predictive coding networks and those of the human brain and discussing the potential of predictive coding networks in general machine learning.

10.

Silent Speech Recognition with Strain Sensors and Deep Learning Analysis of Directional Facial Muscle Movement.

Yoo, Hyunjun; Kim, Eunji; Chung, Jong Won; Cho, Hyeon; Jeong, Sujin; Kim, Heeseung; Jang, Dongju; Kim, Hayun; Yoon, Jinsu; Lee, Gae Hwang; Kang, Hyunbum; Kim, Joo-Young; Yun, Youngjun; Yoon, Sungroh; Hong, Yongtaek.

ACS Appl Mater Interfaces ; 14(48): 54157-54169, 2022 Dec 07.

Article in English | MEDLINE | ID: mdl-36413961

ABSTRACT

Silent communication based on biosignals from facial muscle requires accurate detection of its directional movement and thus optimally positioning minimum numbers of sensors for higher accuracy of speech recognition with a minimal person-to-person variation. So far, previous approaches based on electromyogram or pressure sensors are ineffective in detecting the directional movement of facial muscles. Therefore, in this study, high-performance strain sensors are used for separately detecting x- and y-axis strain. Directional strain distribution data of facial muscle is obtained by applying three-dimensional digital image correlation. Deep learning analysis is utilized for identifying optimal positions of directional strain sensors. The recognition system with four directional strain sensors conformably attached to the face shows silent vowel recognition with 85.24% accuracy and even 76.95% for completely nonobserved subjects. These results show that detection of the directional strain distribution at the optimal facial points will be the key enabling technology for highly accurate silent speech recognition.

Subject(s)

Deep Learning , Speech Perception , Humans , Facial Muscles

11.

Anti-Adversarially Manipulated Attributions for Weakly Supervised Semantic Segmentation and Object Localization.

Lee, Jungbeom; Kim, Eunji; Mok, Jisoo; Yoon, Sungroh.

IEEE Trans Pattern Anal Mach Intell ; PP2022 Apr 12.

Article in English | MEDLINE | ID: mdl-35412975

ABSTRACT

Obtaining accurate pixel-level localization from class labels is a crucial process in weakly supervised semantic segmentation and object localization. Attribution maps from a trained classifier are widely used to provide pixel-level localization, but their focus tends to be restricted to a small discriminative region of the target object. AdvCAM is an attribution map of an image that is manipulated to increase the classification score produced by a classifier. This manipulation is realized in an anti-adversarial manner, so that the original image is perturbed along pixel gradients in the opposite directions from those used in an adversarial attack. This process enhances non-discriminative yet class-relevant features, which used to make an insufficient contribution to previous attribution maps, so that the resulting AdvCAM identifies more regions of the target object. In addition, we introduce a new regularization procedure that inhibits the incorrect attribution of regions unrelated to the target object and the excessive concentration of attributions on a small region of the target object. In weakly and semi-supervised semantic segmentation, our method achieved a new state-of-the-art performance on both the PASCAL VOC and MS COCO datasets. In weakly supervised object localization, it achieved a new state-of-the-art performance on the CUB-200-2011 and ImageNet-1K datasets.

12.

Flexible Dual-Branched Message-Passing Neural Network for a Molecular Property Prediction.

Jo, Jeonghee; Kwak, Bumju; Lee, Byunghan; Yoon, Sungroh.

ACS Omega ; 7(5): 4234-4244, 2022 Feb 08.

Article in English | MEDLINE | ID: mdl-35155916

ABSTRACT

A molecule is a complex of heterogeneous components, and the spatial arrangements of these components determine the whole molecular properties and characteristics. With the advent of deep learning in computational chemistry, several studies have focused on how to predict molecular properties based on molecular configurations. MA message-passing neural network provides an effective framework for capturing molecular geometric features with the perspective of a molecule as a graph. However, most of these studies assumed that all heterogeneous molecular features, such as atomic charge, bond length, or other geometric features, always contribute equivalently to the target prediction, regardless of the task type. In this study, we propose a dual-branched neural network for molecular property prediction based on both the message-passing framework and standard multilayer perceptron neural networks. Our model learns heterogeneous molecular features with different scales, which are trained flexibly according to each prediction target. In addition, we introduce a discrete branch to learn single-atom features without local aggregation, apart from message-passing steps. We verify that this novel structure can improve the model performance. The proposed model outperforms other recent models with sparser representations. Our experimental results indicate that, in the chemical property prediction tasks, the diverse chemical nature of targets should be carefully considered for both model performance and generalizability. Finally, we provide the intuitive analysis between the experimental results and the chemical meaning of the target.

13.

Prediction of clinically significant prostate cancer using polygenic risk models in Asians.

Song, Sang Hun; Kim, Eunae; Woo, Eunjin; Kwon, Eunkyung; Yoon, Sungroh; Kim, Jung Kwon; Lee, Hakmin; Oh, Jong Jin; Lee, Sangchul; Hong, Sung Kyu; Byun, Seok-Soo.

Investig Clin Urol ; 63(1): 42-52, 2022 01.

Article in English | MEDLINE | ID: mdl-34983122

ABSTRACT

PURPOSE: To develop and evaluate the performance of a polygenic risk score (PRS) constructed in a Korean male population to predict clinically significant prostate cancer (csPCa). MATERIALS AND METHODS: Total 2,702 PCa samples and 7,485 controls were used to discover csPCa susceptible single nucleotide polymorphisms (SNPs). Males with biopsy-proven or post-radical prostatectomy Gleason score 7 or higher were included for analysis. After genotype imputation for quality control, logistic regression models were applied to test association and calculate effect size. Extracted candidate SNPs were further tested to compare predictive performance according to number of SNPs included in the PRS. The best-fit model was validated in an independent cohort of 311 cases and 822 controls. RESULTS: Of the 83 candidate SNPs with significant PCa association reported in previous literature, rs72725879 located in PRNCR1 showed the highest significance for PCa risk (odds ratio, 0.597; 95% confidence interval [CI], 0.555-0.641; p=4.3×10-45). Thirty-two SNPs within 26 distinct loci were further selected for PRS construction. Best performance was found with the top 29 SNPs, with AUC found to be 0.700 (95% CI, 0.667-0.734). Males with very-high PRS (above the 95th percentile) had a 4.92-fold increased risk for csPCa. CONCLUSIONS: Ethnic-specific PRS was developed and validated in Korean males to predict csPCa susceptibility using the largest csPCa sample size in Asia. PRS can be a potential biomarker to predict individual risk. Future multi-ethnic trials are required to further validate our results.

Subject(s)

Multifactorial Inheritance , Prostatic Neoplasms/genetics , Aged , Asian People , Cohort Studies , Genetic Predisposition to Disease , Humans , Male , Middle Aged , Polymorphism, Single Nucleotide , Risk Factors

14.

DNA Privacy: Analyzing Malicious DNA Sequences Using Deep Neural Networks.

Bae, Ho; Min, Seonwoo; Choi, Hyun-Soo; Yoon, Sungroh.

IEEE/ACM Trans Comput Biol Bioinform ; 19(2): 888-898, 2022.

Article in English | MEDLINE | ID: mdl-32809941

ABSTRACT

Recent advances in next-generation sequencing technologies have led to the successful insertion of video information into DNA using synthesized oligonucleotides. Several attempts have been made to embed larger data into living organisms. This process of embedding messages is called steganography and it is used for hiding and watermarking data to protect intellectual property. In contrast, steganalysis is a group of algorithms that serves to detect hidden information from covert media. Various methods have been developed to detect messages embedded in conventional covert channels. However, conventional steganalysis algorithms are mostly limited to common covert media. Most common detection approaches, such as frequency analysis-based methods, often overlook important signals when directly applied to DNA steganography and are easily bypassed by recently developed steganography techniques. To address the limitations of conventional approaches, a sequence-learning-based malicious DNA sequence analysis method based on neural networks has been proposed. The proposed method learns intrinsic distributions and identifies distribution variations using a classification score to predict whether a sequence is to be a coding or non-coding sequence. Based on our experiments and results, we have developed a framework to safeguard security against DNA steganography.

Subject(s)

Neural Networks, Computer , Privacy , Algorithms , Base Sequence , DNA/genetics

15.

Imbalanced Data Classification via Cooperative Interaction Between Classifier and Generator.

Choi, Hyun-Soo; Jung, Dahuin; Kim, Siwon; Yoon, Sungroh.

IEEE Trans Neural Netw Learn Syst ; 33(8): 3343-3356, 2022 Aug.

Article in English | MEDLINE | ID: mdl-33531305

ABSTRACT

Learning classifiers with imbalanced data can be strongly biased toward the majority class. To address this issue, several methods have been proposed using generative adversarial networks (GANs). Existing GAN-based methods, however, do not effectively utilize the relationship between a classifier and a generator. This article proposes a novel three-player structure consisting of a discriminator, a generator, and a classifier, along with decision boundary regularization. Our method is distinctive in which the generator is trained in cooperation with the classifier to provide minority samples that gradually expand the minority decision region, improving performance for imbalanced data classification. The proposed method outperforms the existing methods on real data sets as well as synthetic imbalanced data sets.

16.

TargetNet: functional microRNA target prediction with deep neural networks.

Min, Seonwoo; Lee, Byunghan; Yoon, Sungroh.

Bioinformatics ; 38(3): 671-677, 2022 01 12.

Article in English | MEDLINE | ID: mdl-34677573

ABSTRACT

MOTIVATION: MicroRNAs (miRNAs) play pivotal roles in gene expression regulation by binding to target sites of messenger RNAs (mRNAs). While identifying functional targets of miRNAs is of utmost importance, their prediction remains a great challenge. Previous computational algorithms have major limitations. They use conservative candidate target site (CTS) selection criteria mainly focusing on canonical site types, rely on laborious and time-consuming manual feature extraction, and do not fully capitalize on the information underlying miRNA-CTS interactions. RESULTS: In this article, we introduce TargetNet, a novel deep learning-based algorithm for functional miRNA target prediction. To address the limitations of previous approaches, TargetNet has three key components: (i) relaxed CTS selection criteria accommodating irregularities in the seed region, (ii) a novel miRNA-CTS sequence encoding scheme incorporating extended seed region alignments and (iii) a deep residual network-based prediction model. The proposed model was trained with miRNA-CTS pair datasets and evaluated with miRNA-mRNA pair datasets. TargetNet advances the previous state-of-the-art algorithms used in functional miRNA target classification. Furthermore, it demonstrates great potential for distinguishing high-functional miRNA targets. AVAILABILITY AND IMPLEMENTATION: The codes and pre-trained models are available at https://github.com/mswzeus/TargetNet.

Subject(s)

MicroRNAs , MicroRNAs/genetics , MicroRNAs/metabolism , Neural Networks, Computer , Algorithms , RNA, Messenger/genetics , Gene Expression Regulation , Computational Biology

17.

Generation of a more efficient prime editor 2 by addition of the Rad51 DNA-binding domain.

Song, Myungjae; Lim, Jung Min; Min, Seonwoo; Oh, Jeong-Seok; Kim, Dong Young; Woo, Jae-Sung; Nishimasu, Hiroshi; Cho, Sung-Rae; Yoon, Sungroh; Kim, Hyongbum Henry.

Nat Commun ; 12(1): 5617, 2021 09 23.

Article in English | MEDLINE | ID: mdl-34556671

ABSTRACT

Although prime editing is a promising genome editing method, the efficiency of prime editor 2 (PE2) is often insufficient. Here we generate a more efficient variant of PE2, named hyPE2, by adding the Rad51 DNA-binding domain. When tested at endogenous sites, hyPE2 shows a median of 1.5- or 1.4- fold (range, 0.99- to 2.6-fold) higher efficiencies than PE2; furthermore, at sites where PE2-induced prime editing is very inefficient (efficiency < 1%), hyPE2 enables prime editing with efficiencies ranging from 1.1% to 2.9% at up to 34% of target sequences, potentially facilitating prime editing applications.

Subject(s)

Algorithms , CRISPR-Cas Systems , DNA/metabolism , Gene Editing/methods , Models, Genetic , Rad51 Recombinase/metabolism , Amino Acid Sequence , Binding Sites/genetics , DNA/genetics , HCT116 Cells , HEK293 Cells , Humans , Rad51 Recombinase/genetics , Reproducibility of Results

18.

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets.

Dallago, Christian; Schütze, Konstantin; Heinzinger, Michael; Olenyi, Tobias; Littmann, Maria; Lu, Amy X; Yang, Kevin K; Min, Seonwoo; Yoon, Sungroh; Morton, James T; Rost, Burkhard.

Curr Protoc ; 1(5): e113, 2021 May.

Article in English | MEDLINE | ID: mdl-33961736

ABSTRACT

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.

Subject(s)

Artificial Intelligence , Deep Learning , Machine Learning , Natural Language Processing , Proteins

19.

Protein transfer learning improves identification of heat shock protein families.

Min, Seonwoo; Kim, HyunGi; Lee, Byunghan; Yoon, Sungroh.

PLoS One ; 16(5): e0251865, 2021.

Article in English | MEDLINE | ID: mdl-34003870

ABSTRACT

Heat shock proteins (HSPs) play a pivotal role as molecular chaperones against unfavorable conditions. Although HSPs are of great importance, their computational identification remains a significant challenge. Previous studies have two major limitations. First, they relied heavily on amino acid composition features, which inevitably limited their prediction performance. Second, their prediction performance was overestimated because of the independent two-stage evaluations and train-test data redundancy. To overcome these limitations, we introduce two novel deep learning algorithms: (1) time-efficient DeepHSP and (2) high-performance DeeperHSP. We propose a convolutional neural network (CNN)-based DeepHSP that classifies both non-HSPs and six HSP families simultaneously. It outperforms state-of-the-art algorithms, despite taking 14-15 times less time for both training and inference. We further improve the performance of DeepHSP by taking advantage of protein transfer learning. While DeepHSP is trained on raw protein sequences, DeeperHSP is trained on top of pre-trained protein representations. Therefore, DeeperHSP remarkably outperforms state-of-the-art algorithms increasing F1 scores in both cross-validation and independent test experiments by 20% and 10%, respectively. We envision that the proposed algorithms can provide a proteome-wide prediction of HSPs and help in various downstream analyses for pathology and clinical research.

Subject(s)

Heat-Shock Proteins/genetics , Machine Learning , Molecular Chaperones/genetics , Neural Networks, Computer , Algorithms , Amino Acid Sequence/genetics , Computational Biology/trends , Deep Learning , Heat-Shock Proteins/isolation & purification , Humans , Protein Transport/genetics

20.

Recording of elapsed time and temporal information about biological events using Cas9.

Park, Jihye; Lim, Jung Min; Jung, Inkyung; Heo, Seok-Jae; Park, Jinman; Chang, Yoojin; Kim, Hui Kwon; Jung, Dongmin; Yu, Ji Hea; Min, Seonwoo; Yoon, Sungroh; Cho, Sung-Rae; Park, Taeyoung; Kim, Hyongbum Henry.

Cell ; 184(4): 1047-1063.e23, 2021 02 18.

Article in English | MEDLINE | ID: mdl-33539780

ABSTRACT

DNA has not been utilized to record temporal information, although DNA has been used to record biological information and to compute mathematical problems. Here, we found that indel generation by Cas9 and guide RNA can occur at steady rates, in contrast to typical dynamic biological reactions, and the accumulated indel frequency can be a function of time. By measuring indel frequencies, we developed a method for recording and measuring absolute time periods over hours to weeks in mammalian cells. These time-recordings were conducted in several cell types, with different promoters and delivery vectors for Cas9, and in both cultured cells and cells of living mice. As applications, we recorded the duration of chemical exposure and the lengths of elapsed time since the onset of biological events (e.g., heat exposure and inflammation). We propose that our systems could serve as synthetic "DNA clocks."

Subject(s)

CRISPR-Associated Protein 9/metabolism , Animals , Base Sequence , Cellular Microenvironment , Computer Simulation , HEK293 Cells , Half-Life , Humans , INDEL Mutation/genetics , Inflammation/pathology , Integrases/metabolism , Male , Mice, Nude , Promoter Regions, Genetic/genetics , RNA, Guide, Kinetoplastida/genetics , Reproducibility of Results , Time Factors

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL