Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 30
Filter
Add more filters










Publication year range
1.
Comput Struct Biotechnol J ; 23: 1786-1795, 2024 Dec.
Article in English | MEDLINE | ID: mdl-38707535

ABSTRACT

The rapid growth of spatially resolved transcriptomics technology provides new perspectives on spatial tissue architecture. Deep learning has been widely applied to derive useful representations for spatial transcriptome analysis. However, effectively integrating spatial multi-modal data remains challenging. Here, we present ConGcR, a contrastive learning-based model for integrating gene expression, spatial location, and tissue morphology for data representation and spatial tissue architecture identification. Graph convolution and ResNet were used as encoders for gene expression with spatial location and histological image inputs, respectively. We further enhanced ConGcR with a graph auto-encoder as ConGaR to better model spatially embedded representations. We validated our models using 16 human brains, four chicken hearts, eight breast tumors, and 30 human lung spatial transcriptomics samples. The results showed that our models generated more effective embeddings for obtaining tissue architectures closer to the ground truth than other methods. Overall, our models not only can improve tissue architecture identification's accuracy but also may provide valuable insights and effective data representation for other tasks in spatial transcriptome analyses.

2.
Nat Rev Bioeng ; 2(2): 136-154, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38576453

ABSTRACT

Denoising diffusion models embody a type of generative artificial intelligence that can be applied in computer vision, natural language processing and bioinformatics. In this Review, we introduce the key concepts and theoretical foundations of three diffusion modelling frameworks (denoising diffusion probabilistic models, noise-conditioned scoring networks and score stochastic differential equations). We then explore their applications in bioinformatics and computational biology, including protein design and generation, drug and small-molecule design, protein-ligand interaction modelling, cryo-electron microscopy image data analysis and single-cell data analysis. Finally, we highlight open-source diffusion model tools and consider the future applications of diffusion models in bioinformatics.

3.
Res Sq ; 2024 Mar 11.
Article in English | MEDLINE | ID: mdl-38559017

ABSTRACT

Peptide design, with the goal of identifying peptides possessing unique biological properties, stands as a crucial challenge in peptide-based drug discovery. While traditional and computational methods have made significant strides, they often encounter hurdles due to the complexities and costs of laboratory experiments. Recent advancements in deep learning and Bayesian Optimization have paved the way for innovative research in this domain. In this context, our study presents a novel approach that effectively combines protein structure prediction with Bayesian Optimization for peptide design. By applying carefully designed objective functions, we guide and enhance the optimization trajectory for new peptide sequences. Benchmarked against multiple native structures, our methodology is tailored to generate new peptides to their optimal potential biological properties.

4.
bioRxiv ; 2024 Jan 28.
Article in English | MEDLINE | ID: mdl-37609352

ABSTRACT

Large protein language models (PLMs) present excellent potential to reshape protein research by encoding the amino acid sequences into mathematical and biological meaningful embeddings. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily depending on 3D structures. To address this issue, we introduce S-PLM, a 3D structure-aware PLM utilizing multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinate space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, we provide a library of lightweight tuning tools to adapt S-PLM for diverse protein property prediction tasks. Our results demonstrate S-PLM's superior performance over sequence-only PLMs, achieving competitiveness in protein function prediction compared to state-of-the-art methods employing both sequence and structure inputs.

5.
Nucleic Acids Res ; 52(D1): D426-D433, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37933852

ABSTRACT

The DescribePROT database of amino acid-level descriptors of protein structures and functions was substantially expanded since its release in 2020. This expansion includes substantial increase in the size, scope, and quality of the underlying data, the addition of experimental structural information, the inclusion of new data download options, and an upgraded graphical interface. DescribePROT currently covers 19 structural and functional descriptors for proteins in 273 reference proteomes generated by 11 accurate and complementary predictive tools. Users can search our resource in multiple ways, interact with the data using the graphical interface, and download data at various scales including individual proteins, entire proteomes, and whole database. The annotations in DescribePROT are useful for a broad spectrum of studies that include investigations of protein structure and function, development and validation of predictive tools, and to support efforts in understanding molecular underpinnings of diseases and development of therapeutics. DescribePROT can be freely accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.


Subject(s)
Amino Acids , Proteome , Proteome/chemistry , Databases, Factual
6.
Molecules ; 28(19)2023 Sep 25.
Article in English | MEDLINE | ID: mdl-37836636

ABSTRACT

Interactions between proteins and ions are essential for various biological functions like structural stability, metabolism, and signal transport. Given that more than half of all proteins bind to ions, it is becoming crucial to identify ion-binding sites. The accurate identification of protein-ion binding sites helps us to understand proteins' biological functions and plays a significant role in drug discovery. While several computational approaches have been proposed, this remains a challenging problem due to the small size and high versatility of metals and acid radicals. In this study, we propose IonPred, a sequence-based approach that employs ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to predict ion-binding sites using only raw protein sequences. We successfully fine-tuned our pretrained model to predict the binding sites for nine metal ions (Zn2+, Cu2+, Fe2+, Fe3+, Ca2+, Mg2+, Mn2+, Na+, and K+) and four acid radical ion ligands (CO32-, SO42-, PO43-, NO2-). IonPred surpassed six current state-of-the-art tools by over 44.65% and 28.46%, respectively, in the F1 score and MCC when compared on an independent test dataset. Our method is more computationally efficient than existing tools, producing prediction results for a hundred sequences for a specific ion in under ten minutes.


Subject(s)
Metals , Proteins , Ligands , Proteins/chemistry , Binding Sites , Protein Binding , Metals/chemistry , Ions/chemistry
7.
Am J Cardiol ; 204: 207-214, 2023 10 01.
Article in English | MEDLINE | ID: mdl-37556889

ABSTRACT

Because the 6-minute walking test (6MWT) is a self-paced submaximal test, the 6-minute walking distance (6MWD) is substantially influenced by individual effort level and physical condition, which is difficult to quantify. We aimed to explore the optimal indicator reflecting the perceived effort level during 6MWT. We prospectively enrolled 76 patients with pulmonary arterial hypertension and 152 healthy participants; they performed 2 6MWTs at 2 different speeds: (1) at leisurely speed, as performed in daily life without extra effort (leisure 6MWT) and (2) an increased walking speed, walking as the guideline indicated (standard 6MWT). The factors associated with 6MWD during standard 6MWT were investigated using a multiple linear regression analysis. The heart rate (HR) and Borg score increased and oxygen saturation (SpO2) decreased after walking in 2 6MWTs in both groups (all p <0.001). The ratio of difference in HR before and after each test (ΔHR) to HR before walking (HRat rest) and the difference in SpO2 (ΔSpO2) and Borg (ΔBorg) before and after each test were all significantly higher in both groups after standard 6MWT than after leisure 6MWT (all p <0.001). Multiple linear regression analysis revealed that ΔHR/HRat rest was an independent predictor of 6MWD during standard 6MWT in both groups (both p <0.001, adjusted R2 = 0.737 and 0.49, respectively). 6MWD and ΔHR/HRat rest were significantly lower in patients than in healthy participants (both p <0.001) and in patients with cardiac functional class III than in patients with class I/II (both p <0.001). In conclusion, ΔHR/HRat rest is a good reflector of combined physical and effort factors. HR response should be incorporated into 6MWD to better assess a participant's exercise capacity.


Subject(s)
Pulmonary Arterial Hypertension , Humans , Heart Rate , Walk Test , Walking/physiology , Regression Analysis , Exercise Test , Exercise Tolerance
8.
Nucleic Acids Res ; 51(W1): W343-W349, 2023 07 05.
Article in English | MEDLINE | ID: mdl-37178004

ABSTRACT

Predicting protein localization and understanding its mechanisms are critical in biology and pathology. In this context, we propose a new web application of MULocDeep with improved performance, result interpretation, and visualization. By transferring the original model into species-specific models, MULocDeep achieved competitive prediction performance at the subcellular level against other state-of-the-art methods. It uniquely provides a comprehensive localization prediction at the suborganellar level. Besides prediction, our web service quantifies the contribution of single amino acids to localization for individual proteins; for a group of proteins, common motifs or potential targeting-related regions can be derived. Furthermore, the visualizations of targeting mechanism analyses can be downloaded for publication-ready figures. The MULocDeep web service is available at https://www.mu-loc.org/.


Subject(s)
Proteins , Software , Amino Acids/metabolism , Computational Biology/methods , Protein Transport , Proteins/chemistry , Internet
9.
Nat Commun ; 14(1): 964, 2023 02 21.
Article in English | MEDLINE | ID: mdl-36810839

ABSTRACT

Single-cell multi-omics (scMulti-omics) allows the quantification of multiple modalities simultaneously to capture the intricacy of complex molecular mechanisms and cellular heterogeneity. Existing tools cannot effectively infer the active biological networks in diverse cell types and the response of these networks to external stimuli. Here we present DeepMAPS for biological network inference from scMulti-omics. It models scMulti-omics in a heterogeneous graph and learns relations among cells and genes within both local and global contexts in a robust manner using a multi-head graph transformer. Benchmarking results indicate DeepMAPS performs better than existing tools in cell clustering and biological network construction. It also showcases competitive capability in deriving cell-type-specific biological networks in lung tumor leukocyte CITE-seq data and matched diffuse small lymphocytic lymphoma scRNA-seq and scATAC-seq data. In addition, we deploy a DeepMAPS webserver equipped with multiple functionalities and visualizations to improve the usability and reproducibility of scMulti-omics data analysis.


Subject(s)
Benchmarking , Data Analysis , Reproducibility of Results , Cluster Analysis , Electric Power Supplies , Single-Cell Analysis
10.
Nat Commun ; 14(1): 812, 2023 02 13.
Article in English | MEDLINE | ID: mdl-36781861

ABSTRACT

Unlike PIWI-interacting RNA (piRNA) in other species that mostly target transposable elements (TEs), >80% of piRNAs in adult mammalian testes lack obvious targets. However, mammalian piRNA sequences and piRNA-producing loci evolve more rapidly than the rest of the genome for unknown reasons. Here, through comparative studies of chickens, ducks, mice, and humans, as well as long-read nanopore sequencing on diverse chicken breeds, we find that piRNA loci across amniotes experience: (1) a high local mutation rate of structural variations (SVs, mutations ≥ 50 bp in size); (2) positive selection to suppress young and actively mobilizing TEs commencing at the pachytene stage of meiosis during germ cell development; and (3) negative selection to purge deleterious SV hotspots. Our results indicate that genetic instability at pachytene piRNA loci, while producing certain pathogenic SVs, also protects genome integrity against TE mobilization by driving the formation of rapid-evolving piRNA sequences.


Subject(s)
Chickens , Germ Cells , Humans , Male , Animals , Mice , RNA, Small Interfering/genetics , RNA, Small Interfering/metabolism , Chickens/genetics , Chickens/metabolism , Germ Cells/metabolism , Testis/metabolism , DNA Transposable Elements/genetics , Piwi-Interacting RNA , Mammals/genetics
11.
Nat Mach Intell ; 5(4): 337-339, 2023 Apr.
Article in English | MEDLINE | ID: mdl-38260002

ABSTRACT

Predicting whether T-cell receptors bind to specific peptides is a challenging problem as the majority of binding examples in the training data involves only a few peptides. A new approach employs meta-learning to improve predictions for binding to peptides for which no or little binding data exists.

12.
Methods Mol Biol ; 2499: 105-124, 2022.
Article in English | MEDLINE | ID: mdl-35696076

ABSTRACT

Phosphorylation plays a vital role in signal transduction and cell cycle. Identifying and understanding phosphorylation through machine-learning methods has a long history. However, existing methods only learn representations of a protein sequence segment from a labeled dataset itself, which could result in biased or incomplete features, especially for kinase-specific phosphorylation site prediction in which training data are typically sparse. To learn a comprehensive contextual representation of a protein sequence segment for kinase-specific phosphorylation site prediction, we pretrained our model from over 24 million unlabeled sequence fragments using ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately). The pretrained model was applied to kinase-specific site prediction of kinases CDK, PKA, CK2, MAPK, and PKC. The pretrained ELECTRA model achieves 9.02% improvement over BERT and 11.10% improvement over MusiteDeep in the area under the precision-recall curve on the benchmark data.


Subject(s)
Machine Learning , Protein Kinases , Phosphorylation , Protein Kinases/metabolism
13.
Nucleic Acids Res ; 50(D1): D333-D339, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34551440

ABSTRACT

Resolving the spatial distribution of the transcriptome at a subcellular level can increase our understanding of biology and diseases. To facilitate studies of biological functions and molecular mechanisms in the transcriptome, we updated RNALocate, a resource for RNA subcellular localization analysis that is freely accessible at http://www.rnalocate.org/ or http://www.rna-society.org/rnalocate/. Compared to RNALocate v1.0, the new features in version 2.0 include (i) expansion of the data sources and the coverage of species; (ii) incorporation and integration of RNA-seq datasets containing information about subcellular localization; (iii) addition and reorganization of RNA information (RNA subcellular localization conditions and descriptive figures for method, RNA homology information, RNA interaction and ncRNA disease information) and (iv) three additional prediction tools: DM3Loc, iLoc-lncRNA and iLoc-mRNA. Overall, RNALocate v2.0 provides a comprehensive RNA subcellular localization resource for researchers to deconvolute the highly complex architecture of the cell.


Subject(s)
Databases, Nucleic Acid , RNA, Untranslated/genetics , Software , Transcriptome , Animals , Base Sequence , Cell Compartmentation , Datasets as Topic , Drosophila melanogaster/genetics , Drosophila melanogaster/metabolism , Eukaryotic Cells/cytology , Eukaryotic Cells/metabolism , Gene Expression Regulation , Gene Ontology , Humans , Internet , Mice , Molecular Sequence Annotation , RNA, Untranslated/classification , RNA, Untranslated/metabolism , Rats , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Sequence Alignment , Sequence Homology, Nucleic Acid , Subcellular Fractions/chemistry , Subcellular Fractions/metabolism , Zebrafish/genetics , Zebrafish/metabolism
14.
Comput Struct Biotechnol J ; 19: 5834-5844, 2021.
Article in English | MEDLINE | ID: mdl-34765098

ABSTRACT

The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.

15.
Comput Struct Biotechnol J ; 19: 4825-4839, 2021.
Article in English | MEDLINE | ID: mdl-34522290

ABSTRACT

Prediction of protein localization plays an important role in understanding protein function and mechanisms. In this paper, we propose a general deep learning-based localization prediction framework, MULocDeep, which can predict multiple localizations of a protein at both subcellular and suborganellar levels. We collected a dataset with 44 suborganellar localization annotations in 10 major subcellular compartments-the most comprehensive suborganelle localization dataset to date. We also experimentally generated an independent dataset of mitochondrial proteins in Arabidopsis thaliana cell cultures, Solanum tuberosum tubers, and Vicia faba roots and made this dataset publicly available. Evaluations using the above datasets show that overall, MULocDeep outperforms other major methods at both subcellular and suborganellar levels. Furthermore, MULocDeep assesses each amino acid's contribution to localization, which provides insights into the mechanism of protein sorting and localization motifs. A web server can be accessed at http://mu-loc.org.

16.
Nucleic Acids Res ; 49(W1): W228-W236, 2021 07 02.
Article in English | MEDLINE | ID: mdl-34037802

ABSTRACT

G2PDeep is an open-access web server, which provides a deep-learning framework for quantitative phenotype prediction and discovery of genomics markers. It uses zygosity or single nucleotide polymorphism (SNP) information from plants and animals as the input to predict quantitative phenotype of interest and genomic markers associated with phenotype. It provides a one-stop-shop platform for researchers to create deep-learning models through an interactive web interface and train these models with uploaded data, using high-performance computing resources plugged at the backend. G2PDeep also provides a series of informative interfaces to monitor the training process and compare the performance among the trained models. The trained models can then be deployed automatically. The quantitative phenotype and genomic markers are predicted using a user-selected trained model and the results are visualized. Our state-of-the-art model has been benchmarked and demonstrated competitive performance in quantitative phenotype predictions by other researchers. In addition, the server integrates the soybean nested association mapping (SoyNAM) dataset with five phenotypes, including grain yield, height, moisture, oil, and protein. A publicly available dataset for seed protein and oil content has also been integrated into the server. The G2PDeep server is publicly available at http://g2pdeep.org. The Python-based deep-learning model is available at https://github.com/shuaizengMU/G2PDeep_model.


Subject(s)
Genetic Markers , Phenotype , Software , Deep Learning , Genomics , Internet , Polymorphism, Single Nucleotide , Glycine max/genetics
17.
Nucleic Acids Res ; 49(8): e46, 2021 05 07.
Article in English | MEDLINE | ID: mdl-33503258

ABSTRACT

Subcellular localization of messenger RNAs (mRNAs), as a prevalent mechanism, gives precise and efficient control for the translation process. There is mounting evidence for the important roles of this process in a variety of cellular events. Computational methods for mRNA subcellular localization prediction provide a useful approach for studying mRNA functions. However, few computational methods were designed for mRNA subcellular localization prediction and their performance have room for improvement. Especially, there is still no available tool to predict for mRNAs that have multiple localization annotations. In this paper, we propose a multi-head self-attention method, DM3Loc, for multi-label mRNA subcellular localization prediction. Evaluation results show that DM3Loc outperforms existing methods and tools in general. Furthermore, DM3Loc has the interpretation ability to analyze RNA-binding protein motifs and key signals on mRNAs for subcellular localization. Our analyses found hundreds of instances of mRNA isoform-specific subcellular localizations and many significantly enriched gene functions for mRNAs in different subcellular localizations.


Subject(s)
Computational Biology/methods , Neural Networks, Computer , RNA, Messenger/metabolism , Subcellular Fractions/metabolism , Cell Membrane/genetics , Cell Membrane/metabolism , Cell Nucleus/genetics , Cell Nucleus/metabolism , Cytosol/metabolism , Databases, Genetic , Databases, Protein , Endoplasmic Reticulum/genetics , Endoplasmic Reticulum/metabolism , Exosomes/genetics , Exosomes/metabolism , Gene Ontology , Humans , Proteomics , RNA, Messenger/genetics , Ribosomes/genetics , Ribosomes/metabolism , Transcriptome/genetics
18.
Comput Struct Biotechnol J ; 18: 1877-1883, 2020.
Article in English | MEDLINE | ID: mdl-32774783

ABSTRACT

Pseudouridine synthase binds to uridine sites and catalyzes the conversion of uridine to pseudouridine (Ψ). This binding takes place in a specific context and in the conformation of nucleotides. Most machine-learning methods for Ψ site classification use nucleotide frequency as a feature, which may not fully depict the relevant conformation around a Ψ site. Using the power of deep learning and raw sequence, as well as secondary structure features, our tool MU-PseUDeep is designed to capture both the sequence and secondary structure context, which inputs the raw RNA sequence and the predicted secondary structure to two sets of convolutional neural networks. It has shown considerable improvement in Ψ site prediction over existing tools, XG-PseU, PseUI, and iRNA-PseU for both balanced and imbalanced datasets. To the best of our knowledge, this is the most accurate tool for Ψ site prediction. We also used MU-PseUDeep to scan the human transcriptome, which shows that the genes with predicted Ψ sites are enriched in nucleotide and protein binding, as well as in neurodegeneration pathways. The tool is open source, available at https://github.com/smk5g5/MU-PseUDeep.

19.
Nucleic Acids Res ; 48(W1): W140-W146, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32324217

ABSTRACT

MusiteDeep is an online resource providing a deep-learning framework for protein post-translational modification (PTM) site prediction and visualization. The predictor only uses protein sequences as input and no complex features are needed, which results in a real-time prediction for a large number of proteins. It takes less than three minutes to predict for 1000 sequences per PTM type. The output is presented at the amino acid level for the user-selected PTM types. The framework has been benchmarked and has demonstrated competitive performance in PTM site predictions by other researchers. In this webserver, we updated the previous framework by utilizing more advanced ensemble techniques, and providing prediction and visualization for multiple PTMs simultaneously for users to analyze potential PTM cross-talks directly. Besides prediction, users can interactively review the predicted PTM sites in the context of known PTM annotations and protein 3D structures through homology-based search. In addition, the server maintains a local database providing pre-processed PTM annotations from Uniport/Swiss-Prot for users to download. This database will be updated every three months. The MusiteDeep server is available at https://www.musite.net. The stand-alone tools for locally using MusiteDeep are available at https://github.com/duolinwang/MusiteDeep_web.


Subject(s)
Deep Learning , Protein Processing, Post-Translational , Software , Computer Graphics , Internet , Models, Molecular , Protein Conformation , Proteins/chemistry , Sequence Analysis, Protein
20.
Methods ; 173: 16-23, 2020 02 15.
Article in English | MEDLINE | ID: mdl-31220603

ABSTRACT

Nowadays, large amounts of omics data have been generated and contributed to increasing knowledge about associated biological mechanisms. A new challenge coming along is how to identify the active pathways and extract useful insights from these data with huge background information and noise. Although biologically meaningful modules can often be detected by many existing informatics tools, it is still hard to interpret or make use of the results towards in silico hypothesis generation and testing. To address this gap, we previously developed the IMPRes (Integrative MultiOmics Pathway Resolution) v 1.0 algorithm, a new step-wise active pathway detection method using a dynamic programming approach. This approach enables the network detection one step at a time, making it easy for researchers to trace the pathways, and leading to more accurate drug design and more effective treatment strategies. In this paper, we present IMPRes-Pro, an enhancement to IMPRes v1.0 by integrating proteomics data along with transcriptomics data and constructing a heterogeneous background network. The evaluation experiment conducted on human primary breast cancer dataset has shown the advantage over the original IMPRes v1.0 method. Furthermore, a case study on human metastatic breast cancer dataset was performed and we have provided several insights regarding the selection of optimal therapy strategy. IMPRes-Pro algorithm and visualization tool is available as a web service at http://digbio.missouri.edu/impres.


Subject(s)
Breast Neoplasms/genetics , Computational Biology/methods , Proteomics/methods , Software , Algorithms , Breast Neoplasms/pathology , Computer Graphics , Computer Simulation , Female , Gene Expression Profiling/methods , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...