Pesquisa | Portal Regional da BVS

Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios.

Mahmoud, Zeinab; Li, Chunlin; Zappatore, Marco; Solyman, Aiman; Alfatemi, Ali; Ibrahim, Ashraf Osman; Abdelmaboud, Abdelzahir.

PeerJ Comput Sci ; 9: e1639, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38077556

RESUMO

The correction of grammatical errors in natural language processing is a crucial task as it aims to enhance the accuracy and intelligibility of written language. However, developing a grammatical error correction (GEC) framework for low-resource languages presents significant challenges due to the lack of available training data. This article proposes a novel GEC framework for low-resource languages, using Arabic as a case study. To generate more training data, we propose a semi-supervised confusion method called the equal distribution of synthetic errors (EDSE), which generates a wide range of parallel training data. Additionally, this article addresses two limitations of the classical seq2seq GEC model, which are unbalanced outputs due to the unidirectional decoder and exposure bias during inference. To overcome these limitations, we apply a knowledge distillation technique from neural machine translation. This method utilizes two decoders, a forward decoder right-to-left and a backward decoder left-to-right, and measures their agreement using Kullback-Leibler divergence as a regularization term. The experimental results on two benchmarks demonstrate that our proposed framework outperforms the Transformer baseline and two widely used bidirectional decoding techniques, namely asynchronous and synchronous bidirectional decoding. Furthermore, the proposed framework reported the highest F1 score, and generating synthetic data using the equal distribution technique for syntactic errors resulted in a significant improvement in performance. These findings demonstrate the effectiveness of the proposed framework for improving grammatical error correction for low-resource languages, particularly for the Arabic language.

Patient subgrouping with distinct survival rates via integration of multiomics data on a Grassmann manifold.

Alfatemi, Ali; Peng, Hong; Rong, Wentao; Zhang, Bin; Cai, Hongmin.

BMC Med Inform Decis Mak ; 22(1): 190, 2022 07 23.

Artigo em Inglês | MEDLINE | ID: mdl-35870923

RESUMO

BACKGROUND: Patient subgroups are important for easily understanding a disease and for providing precise yet personalized treatment through multiple omics dataset integration. Multiomics datasets are produced daily. Thus, the fusion of heterogeneous big data into intrinsic structures is an urgent problem. Novel mathematical methods are needed to process these data in a straightforward way. RESULTS: We developed a novel method for subgrouping patients with distinct survival rates via the integration of multiple omics datasets and by using principal component analysis to reduce the high data dimensionality. Then, we constructed similarity graphs for patients, merged the graphs in a subspace, and analyzed them on a Grassmann manifold. The proposed method could identify patient subgroups that had not been reported previously by selecting the most critical information during the merging at each level of the omics dataset. Our method was tested on empirical multiomics datasets from The Cancer Genome Atlas. CONCLUSION: Through the integration of microRNA, gene expression, and DNA methylation data, our method accurately identified patient subgroups and achieved superior performance compared with popular methods.

Assuntos

MicroRNAs , Neoplasias , Metilação de DNA , Genoma , Humanos , Neoplasias/genética , Taxa de Sobrevida

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA