RESUMO
Artificial intelligence is revolutionizing all fields that affect people's lives and health. One of the most critical applications is in the study of tumors. It is the case of glioblastoma (GBM) that has behaviors that need to be understood to develop effective therapies. Due to advances in single-cell RNA sequencing (scRNA-seq), it is possible to understand the cellular and molecular heterogeneity in the GBM. Given that there are different cell groups in these tumors, there is a need to apply Machine Learning (ML) algorithms. It will allow extracting information to understand how cancer changes and broaden the search for effective treatments. We proposed multiple comparisons of ML algorithms to classify cell groups based on the GBM scRNA-seq data. This broad comparison spectrum can show the scientific-medical community which models can achieve the best performance in this task. In this work are classified the following cell groups: Tumor Core (TC), Tumor Periphery (TP) and Normal Periphery (NP), in binary and multi-class scenarios. This work presents the biomarker candidates found for the models with the best results. The analyses presented here allow us to verify the biomarker candidates to understand the genetic characteristics of GBM, which may be affected by a suitable identification of GBM heterogeneity. This work obtained for the four scenarios covered cross-validation results of $93.03\% \pm 5.37\%$, $97.42\% \pm 3.94\%$, $98.27\% \pm 1.81\%$ and $93.04\% \pm 6.88\%$ for the classification of TP versus TC, TP versus NP, NP versus TP and TC (TPC) and NP versus TP versus TC, respectively.
Assuntos
Glioblastoma , Humanos , Glioblastoma/genética , Glioblastoma/patologia , Inteligência Artificial , Biomarcadores , Aprendizado de Máquina , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodosRESUMO
LTR-retrotransposons are the most abundant repeat sequences in plant genomes and play an important role in evolution and biodiversity. Their characterization is of great importance to understand their dynamics. However, the identification and classification of these elements remains a challenge today. Moreover, current software can be relatively slow (from hours to days), sometimes involve a lot of manual work and do not reach satisfactory levels in terms of precision and sensitivity. Here we present Inpactor2, an accurate and fast application that creates LTR-retrotransposon reference libraries in a very short time. Inpactor2 takes an assembled genome as input and follows a hybrid approach (deep learning and structure-based) to detect elements, filter partial sequences and finally classify intact sequences into superfamilies and, as very few tools do, into lineages. This tool takes advantage of multi-core and GPU architectures to decrease execution times. Using the rice genome, Inpactor2 showed a run time of 5 minutes (faster than other tools) and has the best accuracy and F1-Score of the tools tested here, also having the second best accuracy and specificity only surpassed by EDTA, but achieving 28% higher sensitivity. For large genomes, Inpactor2 is up to seven times faster than other available bioinformatics tools.
Assuntos
Aprendizado Profundo , Retroelementos , Retroelementos/genética , Sequências Repetidas Terminais/genética , Genoma de Planta , Software , Evolução Molecular , FilogeniaRESUMO
Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.
Assuntos
Retroelementos , Sequências Repetidas Terminais , Elementos de DNA Transponíveis , Evolução Molecular , Genoma de Planta , Aprendizado de Máquina , Plantas/genética , Retroelementos/genéticaRESUMO
Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.