Graggle: A Graph-based Approach to Document Clustering

King, I. J.; Huang, H. H.

King, I. J.; Huang, H. H..

2022 IEEE International Conference on Big Data, Big Data 2022 ; : 748-755, 2022.

Article in English | Scopus | ID: covidwho-2266556

ABSTRACT

ABSTRACT

Document recommendation systems have traditionally relied upon high-dimensional vector representations that scale poorly in corpora with diverse vocabularies. Existing graph-based approaches focus on the metadata of documents and, unfortunately, ignore the content of the papers. In this work, we have designed and implemented a new system we call Graggle, which builds a graph to model a corpus. Nodes are papers, and edges represent significant words shared between them. We then leverage modern graph learning techniques to turn this graph into a highly efficient tool for dimensionality reduction. Documents are represented as low-dimensional vector embeddings generated with a graph autoencoder. Our experiments show that this approach outperforms traditional document vector-based and text autoencoding approaches on labeled data. Additionally, we have applied this technique to a repository of unlabeled research documents about the novel coronavirus to demonstrate its effectiveness as a real-world tool. © 2022 IEEE.

Keywords

Data mining, graph analytics, recommender systems, text mining, Graphic methods, Learning systems, Dimensional vectors, Document Clustering, Document recommendation, Graph-analytic, Graph-based, High-dimensional, Higher-dimensional, Learning techniques, Text-mining, Vector representations

Fulltext

XML

Search on Google

Full text: Available Collection: Databases of international organizations Database: Scopus Language: English Journal: 2022 IEEE International Conference on Big Data, Big Data 2022 Year: 2022 Document Type: Article

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google