Search | VHL Regional Portal

Systematic tissue annotations of genomics samples by modeling unstructured metadata.

Hawkins, Nathaniel T; Maldaver, Marc; Yannakopoulos, Anna; Guare, Lindsay A; Krishnan, Arjun.

Nat Commun ; 13(1): 6736, 2022 11 08.

Article in English | MEDLINE | ID: mdl-36347858

ABSTRACT

There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .

Subject(s)

Metadata , Natural Language Processing , Humans , Machine Learning , Genomics , Language

Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters.

Domanskyi, Sergii; Szedlak, Anthony; Hawkins, Nathaniel T; Wang, Jiayin; Paternostro, Giovanni; Piermarocchi, Carlo.

BMC Bioinformatics ; 20(1): 369, 2019 Jul 01.

Article in English | MEDLINE | ID: mdl-31262249

ABSTRACT

BACKGROUND: Single cell RNA sequencing (scRNA-seq) brings unprecedented opportunities for mapping the heterogeneity of complex cellular environments such as bone marrow, and provides insight into many cellular processes. Single cell RNA-seq has a far larger fraction of missing data reported as zeros (dropouts) than traditional bulk RNA-seq, and unsupervised clustering combined with Principal Component Analysis (PCA) can be used to overcome this limitation. After clustering, however, one has to interpret the average expression of markers on each cluster to identify the corresponding cell types, and this is normally done by hand by an expert curator. RESULTS: We present a computational tool for processing single cell RNA-seq data that uses a voting algorithm to automatically identify cells based on approval votes received by known molecular markers. Using a stochastic procedure that accounts for imbalances in the number of known molecular signatures for different cell types, the method computes the statistical significance of the final approval score and automatically assigns a cell type to clusters without an expert curator. We demonstrate the utility of the tool in the analysis of eight samples of bone marrow from the Human Cell Atlas. The tool provides a systematic identification of cell types in bone marrow based on a list of markers of immune cell types, and incorporates a suite of visualization tools that can be overlaid on a t-SNE representation. The software is freely available as a Python package at https://github.com/sdomanskyi/DigitalCellSorter . CONCLUSIONS: This methodology assures that extensive marker to cell type matching information is taken into account in a systematic way when assigning cell clusters to cell types. Moreover, the method allows for a high throughput processing of multiple scRNA-seq datasets, since it does not involve an expert curator, and it can be applied recursively to obtain cell sub-types. The software is designed to allow the user to substitute the marker to cell type matching information and apply the methodology to different cellular environments.

Subject(s)

Bone Marrow Cells/cytology , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Software , Algorithms , Bone Marrow Cells/metabolism , Cluster Analysis , Humans , Principal Component Analysis , Single-Cell Analysis

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL