Pesquisa | Portal Regional da BVS (teste)

Experimental study on short-text clustering using transformer-based semantic similarity measure.

Abdalgader, Khaled; Matroud, Atheer A; Hossin, Khaled.

PeerJ Comput Sci ; 10: e2078, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38855231

RESUMO

Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.

NTRFinder: a software tool to find nested tandem repeats.

Matroud, Atheer A; Hendy, M D; Tuffley, C P.

Nucleic Acids Res ; 40(3): e17, 2012 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-22121222

RESUMO

We introduce the software tool NTRFinder to search for a complex repetitive structure in DNA we call a nested tandem repeat (NTR). An NTR is a recurrence of two or more distinct tandem motifs interspersed with each other. We propose that NTRs can be used as phylogenetic and population markers. We have tested our algorithm on both real and simulated data, and present some real NTRs of interest. NTRFinder can be downloaded from http://www.maths.otago.ac.nz/~aamatroud/.

Assuntos

Software , Sequências de Repetição em Tandem , Algoritmos , Cromossomos Humanos Y , Humanos , Análise de Sequência de DNA

An algorithm to solve the motif alignment problem for approximate nested tandem repeats in biological sequences.

Matroud, Atheer A; Tuffley, Christopher P; Hendy, Michael D.

J Comput Biol ; 18(9): 1211-8, 2011 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-21899426

RESUMO

An approximate nested tandem repeat (NTR) in a string T is a complex repetitive structure consisting of many approximate copies of two substrings x and X ("motifs") interspersed with one another. NTRs fall into a class of repetitive structures broadly known as subrepeats. NTRs have been found in real DNA sequences and are expected to be important in evolutionary biology, both in understanding evolution of the ribosomal DNA (where NTRs can occur), and as a potential marker in population genetic and phylogenetic studies. This article describes an alignment algorithm for the verification phase of the software tool NTRFinder developed for database searches for NTRs. When the search algorithm has located a subsequence containing a possible NTR, with motifs X and x, a verification step aligns this subsequence against an exact NTR built from the templates X and x, to determine whether the subsequence contains an approximate NTR and its extent. This article describes an algorithm to solve this alignment problem in O(|T|(|X| + |x|)) space and time. The algorithm is based on Fischetti et al.'s wrap-around dynamic programming.

Assuntos

Algoritmos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequências de Repetição em Tandem , Mineração de Dados/métodos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA