Search | VHL Regional Portal

BioPrediction-RPI: Democratizing the prediction of interaction between non-coding RNA and protein with end-to-end machine learning.

Florentino, Bruno Rafael; Parmezan Bonidia, Robson; Sanches, Natan Henrique; da Rocha, Ulisses N; de Carvalho, André C P L F.

Comput Struct Biotechnol J ; 23: 2267-2276, 2024 Dec.

Article in English | MEDLINE | ID: mdl-38827228

ABSTRACT

Machine Learning (ML) algorithms have been important tools for the extraction of useful knowledge from biological sequences, particularly in healthcare, agriculture, and the environment. However, the categorical and unstructured nature of these sequences requiring usually additional feature engineering steps, before an ML algorithm can be efficiently applied. The addition of these steps to the ML algorithm creates a processing pipeline, known as end-to-end ML. Despite the excellent results obtained by applying end-to-end ML to biotechnology problems, the performance obtained depends on the expertise of the user in the components of the pipeline. In this work, we propose an end-to-end ML-based framework called BioPrediction-RPI, which can identify implicit interactions between sequences, such as pairs of non-coding RNA and proteins, without the need for specialized expertise in end-to-end ML. This framework applies feature engineering to represent each sequence by structural and topological features. These features are divided into feature groups and used to train partial models, whose partial decisions are combined into a final decision, which, provides insights to the user by giving an interpretability report. In our experiments, the developed framework was competitive when compared with various expert-created models. We assessed BioPrediction-RPI with 12 datasets when it presented equal or better performance than all tools in 40% to 100% of cases, depending on the experiment. Finally, BioPrediction-RPI can fine-tune models based on new data and perform at the same level as ML experts, democratizing end-to-end ML and increasing its access to those working in biological sciences.

BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification.

Avila Santos, Anderson P; de Almeida, Breno L S; Bonidia, Robson P; Stadler, Peter F; Stefanic, Polonca; Mandic-Mulec, Ines; Rocha, Ulisses; Sanches, Danilo S; de Carvalho, André C P L F.

RNA Biol ; 21(1): 1-12, 2024 Jan.

Article in English | MEDLINE | ID: mdl-38528797

ABSTRACT

The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of k-mer one-hot, k-mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.

Subject(s)

Deep Learning , RNA, Untranslated/genetics , Algorithms , RNA , Neural Networks, Computer

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria.

Bonidia, Robson P; Santos, Anderson P Avila; de Almeida, Breno L S; Stadler, Peter F; da Rocha, Ulisses N; Sanches, Danilo S; de Carvalho, André C P L F.

Brief Bioinform ; 23(4)2022 07 18.

Article in English | MEDLINE | ID: mdl-35753697

ABSTRACT

Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people's lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.

Subject(s)

COVID-19 Vaccines , COVID-19 , Algorithms , Bacteria/genetics , Humans , Machine Learning

Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy.

Bonidia, Robson P; Avila Santos, Anderson P; de Almeida, Breno L S; Stadler, Peter F; Nunes da Rocha, Ulisses; Sanches, Danilo S; de Carvalho, André C P L F.

Entropy (Basel) ; 24(10)2022 Oct 01.

Article in English | MEDLINE | ID: mdl-37420418

ABSTRACT

In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.

MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.

Bonidia, Robson P; Domingues, Douglas S; Sanches, Danilo S; de Carvalho, André C P L F.

Brief Bioinform ; 23(1)2022 01 17.

Article in English | MEDLINE | ID: mdl-34750626

ABSTRACT

One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.

Subject(s)

Proteins , RNA , Algorithms , Amino Acid Sequence , DNA/genetics , Machine Learning , Proteins/chemistry , RNA/genetics

CRISPRloci: comprehensive and accurate annotation of CRISPR-Cas systems.

Alkhnbashi, Omer S; Mitrofanov, Alexander; Bonidia, Robson; Raden, Martin; Tran, Van Dinh; Eggenhofer, Florian; Shah, Shiraz A; Öztürk, Ekrem; Padilha, Victor A; Sanches, Danilo S; de Carvalho, André C P L F; Backofen, Rolf.

Nucleic Acids Res ; 49(W1): W125-W130, 2021 07 02.

Article in English | MEDLINE | ID: mdl-34133710

ABSTRACT

CRISPR-Cas systems are adaptive immune systems in prokaryotes, providing resistance against invading viruses and plasmids. The identification of CRISPR loci is currently a non-standardized, ambiguous process, requiring the manual combination of multiple tools, where existing tools detect only parts of the CRISPR-systems, and lack quality control, annotation and assessment capabilities of the detected CRISPR loci. Our CRISPRloci server provides the first resource for the prediction and assessment of all possible CRISPR loci. The server integrates a series of advanced Machine Learning tools within a seamless web interface featuring: (i) prediction of all CRISPR arrays in the correct orientation; (ii) definition of CRISPR leaders for each locus; and (iii) annotation of cas genes and their unambiguous classification. As a result, CRISPRloci is able to accurately determine the CRISPR array and associated information, such as: the Cas subtypes; cassette boundaries; accuracy of the repeat structure, orientation and leader sequence; virus-host interactions; self-targeting; as well as the annotation of cas genes, all of which have been missing from existing tools. This annotation is presented in an interactive interface, making it easy for scientists to gain an overview of the CRISPR system in their organism of interest. Predictions are also rendered in GFF format, enabling in-depth genome browser inspection. In summary, CRISPRloci constitutes a full suite for CRISPR-Cas system characterization that offers annotation quality previously available only after manual inspection.

Subject(s)

CRISPR-Cas Systems , Clustered Regularly Interspaced Short Palindromic Repeats , Molecular Sequence Annotation , Software , CRISPR-Associated Proteins/classification , CRISPR-Associated Proteins/genetics , Machine Learning

Feature extraction approaches for biological sequences: a comparative study of mathematical features.

Bonidia, Robson P; Sampaio, Lucas D H; Domingues, Douglas S; Paschoal, Alexandre R; Lopes, Fabrício M; de Carvalho, André C P L F; Sanches, Danilo S.

Brief Bioinform ; 22(5)2021 09 02.

Article in English | MEDLINE | ID: mdl-33585910

ABSTRACT

As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.

Subject(s)

Computational Biology/methods , Deep Learning , Models, Theoretical , RNA, Circular/genetics , RNA, Long Noncoding/genetics , RNA, Messenger/genetics , Base Sequence/genetics , Entropy , Fourier Analysis , Humans , Open Reading Frames , RNA, Circular/classification , RNA, Long Noncoding/classification , RNA, Messenger/classification

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL