Search | VHL Regional Portal

1.

Evaluating Representation Learning on the Protein Structure Universe.

Jamasb, Arian R; Morehead, Alex; Joshi, Chaitanya K; Zhang, Zuobai; Didi, Kieran; Mathis, Simon; Harris, Charles; Tang, Jian; Cheng, Jianlin; Liò, Pietro; Blundell, Tom L.

ArXiv ; 2024 Jun 19.

Article in English | MEDLINE | ID: mdl-38947934

ABSTRACT

We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.

2.

Bering: joint cell segmentation and annotation for spatial transcriptomics with transferred graph embeddings.

Jin, Kang; Zhang, Zuobai; Zhang, Ke; Viggiani, Francesca; Callahan, Claire; Tang, Jian; Aronow, Bruce J; Shu, Jian.

bioRxiv ; 2023 Sep 22.

Article in English | MEDLINE | ID: mdl-37786667

ABSTRACT

Single-cell spatial transcriptomics such as in-situ hybridization or sequencing technologies can provide subcellular resolution that enables the identification of individual cell identities, locations, and a deep understanding of subcellular mechanisms. However, accurate segmentation and annotation that allows individual cell boundaries to be determined remains a major challenge that limits all the above and downstream insights. Current machine learning methods heavily rely on nuclei or cell body staining, resulting in the significant loss of both transcriptome depth and the limited ability to learn latent representations of spatial colocalization relationships. Here, we propose Bering, a graph deep learning model that leverages transcript colocalization relationships for joint noise-aware cell segmentation and molecular annotation in 2D and 3D spatial transcriptomics data. Graph embeddings for the cell annotation are transferred as a component of multi-modal input for cell segmentation, which is employed to enrich gene relationships throughout the process. To evaluate performance, we benchmarked Bering with state-of-the-art methods and observed significant improvement in cell segmentation accuracies and numbers of detected transcripts across various spatial technologies and tissues. To streamline segmentation processes, we constructed expansive pre-trained models, which yield high segmentation accuracy in new data through transfer learning and self-distillation, demonstrating the generalizability of Bering.

3.

Fast Approximation of Coherence for Second-Order Noisy Consensus Networks.

Zhang, Zuobai; Xu, Wanyue; Yi, Yuhao; Zhang, Zhongzhi.

IEEE Trans Cybern ; 52(1): 677-686, 2022 Jan.

Article in English | MEDLINE | ID: mdl-32011280

ABSTRACT

It has been recently established that for second-order consensus dynamics with additive noise, the performance measures, including the vertex coherence and network coherence defined, respectively, as the steady-state variance of the deviation of each vertex state from the average and the average steady-state variance of the system, are closely related to the biharmonic distances. However, direct computation of biharmonic distances is computationally infeasible for huge networks with millions of vertices. In this article, leveraging the implicit fact that both vertex and network coherence can be expressed in terms of the diagonal entries of pseudoinverse L2 of the square of graph Laplacian, we develop a nearly linear-time algorithm to approximate all diagonal entries of L2 , which has a theoretically guaranteed error for each diagonal entry. The key ingredient of our approximation algorithm is an integration of the Johnson-Lindenstrauss lemma and Laplacian solvers. Extensive numerical experiments on real-life and model networks are presented, which indicate that our approximation algorithm is both efficient and accurate and is scalable to large-scale networks with millions of vertices.

4.

Coherence Scaling of Noisy Second-Order Scale-Free Consensus Networks.

Xu, Wanyue; Wu, Bin; Zhang, Zuobai; Zhang, Zhongzhi; Kan, Haibin; Chen, Guanrong.

IEEE Trans Cybern ; 52(7): 5923-5934, 2022 Jul.

Article in English | MEDLINE | ID: mdl-33606650

ABSTRACT

A striking discovery in the field of network science is that the majority of real networked systems have some universal structural properties. In general, they are simultaneously sparse, scale-free, small-world, and loopy. In this article, we investigate the second-order consensus of dynamic networks with such universal structures subject to white noise at vertices. We focus on the network coherence HSO characterized in terms of the H2 -norm of the vertex systems, which measures the mean deviation of vertex states from their average value. We first study numerically the coherence of some representative real-world networks. We find that their coherence HSO scales sublinearly with the vertex number N . We then study analytically HSO for a class of iteratively growing networks-pseudofractal scale-free webs (PSFWs), and obtain an exact solution to HSO, which also increases sublinearly in N , with an exponent much smaller than 1. To explain the reasons for this sublinear behavior, we finally study HSO for Sierpinski gaskets, for which HSO grows superlinearly in N , with a power exponent much larger than 1. Sierpinski gaskets have the same number of vertices and edges as the PSFWs but do not display the scale-free and small-world properties. We thus conclude that the scale-free, small-world, and loopy topologies are jointly responsible for the observed sublinear scaling of HSO.

5.

Publisher Correction: Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data.

Zhao, Yifan; Cai, Huiyu; Zhang, Zuobai; Tang, Jian; Li, Yue.

Nat Commun ; 12(1): 5860, 2021 Oct 01.

Article in English | MEDLINE | ID: mdl-34599193

6.

Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data.

Zhao, Yifan; Cai, Huiyu; Zhang, Zuobai; Tang, Jian; Li, Yue.

Nat Commun ; 12(1): 5261, 2021 09 06.

Article in English | MEDLINE | ID: mdl-34489404

ABSTRACT

The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, large-scale integrative analysis of scRNA-seq data remains a challenge largely due to unwanted batch effects and the limited transferabilty, interpretability, and scalability of the existing computational methods. We present single-cell Embedded Topic Model (scETM). Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. In particular, scETM simultaneously learns an encoder network to infer cell type mixture and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from multiple scRNA-seq datasets. scETM is scalable to over 106 cells and confers remarkable cross-tissue and cross-species zero-shot transfer-learning performance. Using gene set enrichment analysis, we find that scETM-learned topics are enriched in biologically meaningful and disease-related pathways. Lastly, scETM enables the incorporation of known gene sets into the gene embeddings, thereby directly learning the associations between pathways and topics via the topic embeddings.

Subject(s)

Databases, Genetic , Models, Genetic , Sequence Analysis, RNA/statistics & numerical data , Single-Cell Analysis/methods , Alzheimer Disease/genetics , Alzheimer Disease/pathology , Animals , Depressive Disorder, Major/genetics , Depressive Disorder, Major/pathology , Gene Expression Profiling/methods , Gene Expression Profiling/statistics & numerical data , Genes, Mitochondrial , Humans , Mice , Neural Networks, Computer , RNA, Small Cytoplasmic , Retina/cytology , Retina/physiology , Sequence Analysis, RNA/methods

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL