Pesquisa | Portal Regional da BVS

A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application.

Mokoatle, Mpho; Marivate, Vukosi; Mapiye, Darlington; Bornman, Riana; Hayes, Vanessa M.

BMC Bioinformatics ; 24(1): 112, 2023 Mar 23.

Artigo em Inglês | MEDLINE | ID: mdl-36959534

RESUMO

BACKGROUND: Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. METHODS: In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. RESULTS: The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE's sentence transformer only marginally improved the performance of machine learning models.

Assuntos

Neoplasias , Redes Neurais de Computação , Masculino , Humanos , Aprendizado de Máquina , Algoritmos , Neoplasias/diagnóstico por imagem , Algoritmo Florestas Aleatórias

Discriminatory Gleason grade group signatures of prostate cancer: An application of machine learning methods.

Mokoatle, Mpho; Mapiye, Darlington; Marivate, Vukosi; Hayes, Vanessa M; Bornman, Riana.

PLoS One ; 17(6): e0267714, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35679280

RESUMO

One of the most precise methods to detect prostate cancer is by evaluation of a stained biopsy by a pathologist under a microscope. Regions of the tissue are assessed and graded according to the observed histological pattern. However, this is not only laborious, but also relies on the experience of the pathologist and tends to suffer from the lack of reproducibility of biopsy outcomes across pathologists. As a result, computational approaches are being sought and machine learning has been gaining momentum in the prediction of the Gleason grade group. To date, machine learning literature has addressed this problem by using features from magnetic resonance imaging images, whole slide images, tissue microarrays, gene expression data, and clinical features. However, there is a gap with regards to predicting the Gleason grade group using DNA sequences as the only input source to the machine learning models. In this work, using whole genome sequence data from South African prostate cancer patients, an application of machine learning and biological experiments were combined to understand the challenges that are associated with the prediction of the Gleason grade group. A series of machine learning binary classifiers (XGBoost, LSTM, GRU, LR, RF) were created only relying on DNA sequences input features. All the models were not able to adequately discriminate between the DNA sequences of the studied Gleason grade groups (Gleason grade group 1 and 5). However, the models were further evaluated in the prediction of tumor DNA sequences from matched-normal DNA sequences, given DNA sequences as the only input source. In this new problem, the models performed acceptably better than before with the XGBoost model achieving the highest accuracy of 74 ± 01, F1 score of 79 ± 01, recall of 99 ± 0.0, and precision of 66 ± 0.1.

Assuntos

Neoplasias da Próstata , Biópsia , Humanos , Aprendizado de Máquina , Masculino , Gradação de Tumores , Neoplasias da Próstata/diagnóstico , Neoplasias da Próstata/genética , Neoplasias da Próstata/patologia , Reprodutibilidade dos Testes

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA