Search | VHL Regional Portal

1.

sincFold: end-to-end learning of short- and long-range interactions in RNA secondary structure.

Bugnon, Leandro A; Di Persia, Leandro; Gerard, Matias; Raad, Jonathan; Prochetto, Santiago; Fenoy, Emilio; Chorostecki, Uciel; Ariel, Federico; Stegmayer, Georgina; Milone, Diego H.

Brief Bioinform ; 25(4)2024 May 23.

Article in English | MEDLINE | ID: mdl-38855913

ABSTRACT

MOTIVATION: Coding and noncoding RNA molecules participate in many important biological processes. Noncoding RNAs fold into well-defined secondary structures to exert their functions. However, the computational prediction of the secondary structure from a raw RNA sequence is a long-standing unsolved problem, which after decades of almost unchanged performance has now re-emerged due to deep learning. Traditional RNA secondary structure prediction algorithms have been mostly based on thermodynamic models and dynamic programming for free energy minimization. More recently deep learning methods have shown competitive performance compared with the classical ones, but there is still a wide margin for improvement. RESULTS: In this work we present sincFold, an end-to-end deep learning approach, that predicts the nucleotides contact matrix using only the RNA sequence as input. The model is based on 1D and 2D residual neural networks that can learn short- and long-range interaction patterns. We show that structures can be accurately predicted with minimal physical assumptions. Extensive experiments were conducted on several benchmark datasets, considering sequence homology and cross-family validation. sincFold was compared with classical methods and recent deep learning models, showing that it can outperform the state-of-the-art methods.

Subject(s)

Computational Biology , Deep Learning , Nucleic Acid Conformation , RNA , RNA/chemistry , RNA/genetics , Computational Biology/methods , Algorithms , Neural Networks, Computer , Thermodynamics

2.

Evaluating large language models for annotating proteins.

Vitale, Rosario; Bugnon, Leandro A; Fenoy, Emilio Luis; Milone, Diego H; Stegmayer, Georgina.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38706315

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.

Subject(s)

Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning

3.

Transfer learning: The key to functionally annotate the protein universe.

Bugnon, Leandro A; Fenoy, Emilio; Edera, Alejandro A; Raad, Jonathan; Stegmayer, Georgina; Milone, Diego H.

Patterns (N Y) ; 4(2): 100691, 2023 Feb 10.

Article in English | MEDLINE | ID: mdl-36873903

ABSTRACT

The automatic annotation of the protein universe is still an unresolved challenge. Today, there are 229,149,489 entries in the UniProtKB database, but only 0.25% of them have been functionally annotated. This manual process integrates knowledge from the protein families database Pfam, annotating family domains using sequence alignments and hidden Markov models. This approach has grown the Pfam annotations at a low rate in the last years. Recently, deep learning models appeared with the capability of learning evolutionary patterns from unaligned protein sequences. However, this requires large-scale data, while many families contain just a few sequences. Here, we contend this limitation can be overcome by transfer learning, exploiting the full potential of self-supervised learning on large unannotated data and then supervised learning on a small labeled dataset. We show results where errors in protein family prediction can be reduced by 55% with respect to standard methods.

4.

exp2GO: Improving Prediction of Functions in the Gene Ontology With Expression Data.

Di Persia, Leandro; Lopez, Tiago; Arce, Agustin; Milone, Diego H; Stegmayer, Georgina.

IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 999-1008, 2023.

Article in English | MEDLINE | ID: mdl-35417352

ABSTRACT

The computational methods for the prediction of gene function annotations aim to automatically find associations between a gene and a set of Gene Ontology (GO) terms describing its functions. Since the hand-made curation process of novel annotations and the corresponding wet experiments validations are very time-consuming and costly procedures, there is a need for computational tools that can reliably predict likely annotations and boost the discovery of new gene functions. This work proposes a novel method for predicting annotations based on the inference of GO similarities from expression similarities. The novel method was benchmarked against other methods on several public biological datasets, obtaining the best comparative results. exp2GO effectively improved the prediction of GO annotations in comparison to state-of-the-art methods. Furthermore, the proposal was validated with a full genome case where it was capable of predicting relevant and accurate biological functions. The repository of this project withh full data and code is available at https://github.com/sinc-lab/exp2GO.

Subject(s)

Computational Biology , Gene Ontology , Computational Biology/methods , Molecular Sequence Annotation , Phenotype

5.

Hierarchical deep learning for predicting GO annotations by integrating protein knowledge.

Merino, Gabriela A; Saidi, Rabie; Milone, Diego H; Stegmayer, Georgina; Martin, Maria J.

Bioinformatics ; 38(19): 4488-4496, 2022 09 30.

Article in English | MEDLINE | ID: mdl-35929781

ABSTRACT

MOTIVATION: Experimental testing and manual curation are the most precise ways for assigning Gene Ontology (GO) terms describing protein functions. However, they are expensive, time-consuming and cannot cope with the exponential growth of data generated by high-throughput sequencing methods. Hence, researchers need reliable computational systems to help fill the gap with automatic function prediction. The results of the last Critical Assessment of Function Annotation challenge revealed that GO-terms prediction remains a very challenging task. Recent developments on deep learning are significantly breaking out the frontiers leading to new knowledge in protein research thanks to the integration of data from multiple sources. However, deep models hitherto developed for functional prediction are mainly focused on sequence data and have not achieved breakthrough performances yet. RESULTS: We propose DeeProtGO, a novel deep-learning model for predicting GO annotations by integrating protein knowledge. DeeProtGO was trained for solving 18 different prediction problems, defined by the three GO sub-ontologies, the type of proteins, and the taxonomic kingdom. Our experiments reported higher prediction quality when more protein knowledge is integrated. We also benchmarked DeeProtGO against state-of-the-art methods on public datasets, and showed it can effectively improve the prediction of GO annotations. AVAILABILITY AND IMPLEMENTATION: DeeProtGO and a case of use are available at https://github.com/gamerino/DeeProtGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Deep Learning , Gene Ontology , Computational Biology/methods , Molecular Sequence Annotation , Proteins/metabolism

6.

Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.

Fenoy, Emilio; Edera, Alejando A; Stegmayer, Georgina.

Brief Bioinform ; 23(4)2022 07 18.

Article in English | MEDLINE | ID: mdl-35758229

ABSTRACT

A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.

Subject(s)

Computational Biology , Proteins , Algorithms , Amino Acid Sequence , Computational Biology/methods , Machine Learning

7.

Anc2vec: embedding gene ontology terms by preserving ancestors relationships.

Edera, Alejandro A; Milone, Diego H; Stegmayer, Georgina.

Brief Bioinform ; 23(2)2022 03 10.

Article in English | MEDLINE | ID: mdl-35136916

ABSTRACT

The gene ontology (GO) provides a hierarchical structure with a controlled vocabulary composed of terms describing functions and localization of gene products. Recent works propose vector representations, also known as embeddings, of GO terms that capture meaningful information about them. Significant performance improvements have been observed when these representations are used on diverse downstream tasks, such as the measurement of semantic similarity between GO terms and functional similarity between proteins. Despite the success shown by these approaches, existing embeddings of GO terms still fail to capture crucial structural features of the GO. Here, we present anc2vec, a novel protocol based on neural networks for constructing vector representations of GO terms by preserving three important ontological features: its ontological uniqueness, ancestors hierarchy and sub-ontology membership. The advantages of using anc2vec are demonstrated by systematic experiments on diverse tasks: visualization, sub-ontology prediction, inference of structurally related terms, retrieval of terms from aggregated embeddings, and prediction of protein-protein interactions. In these tasks, experimental results show that the performance of anc2vec representations is better than those of recent approaches. This demonstrates that higher performances on diverse tasks can be achieved by embeddings when the structure of the GO is better represented. Full source code and data are available at https://github.com/sinc-lab/anc2vec.

Subject(s)

Semantics , Software , Computational Biology/methods , Gene Ontology , Neural Networks, Computer , Proteins/genetics

8.

miRe2e: a full end-to-end deep model based on transformers for prediction of pre-miRNAs.

Raad, Jonathan; Bugnon, Leandro A; Milone, Diego H; Stegmayer, Georgina.

Bioinformatics ; 38(5): 1191-1197, 2022 02 07.

Article in English | MEDLINE | ID: mdl-34875006

ABSTRACT

MOTIVATION: MicroRNAs (miRNAs) are small RNA sequences with key roles in the regulation of gene expression at post-transcriptional level in different species. Accurate prediction of novel miRNAs is needed due to their importance in many biological processes and their associations with complicated diseases in humans. Many machine learning approaches were proposed in the last decade for this purpose, but requiring handcrafted features extraction to identify possible de novo miRNAs. More recently, the emergence of deep learning (DL) has allowed the automatic feature extraction, learning relevant representations by themselves. However, the state-of-art deep models require complex pre-processing of the input sequences and prediction of their secondary structure to reach an acceptable performance. RESULTS: In this work, we present miRe2e, the first full end-to-end DL model for pre-miRNA prediction. This model is based on Transformers, a neural architecture that uses attention mechanisms to infer global dependencies between inputs and outputs. It is capable of receiving the raw genome-wide data as input, without any pre-processing nor feature engineering. After a training stage with known pre-miRNAs, hairpin and non-harpin sequences, it can identify all the pre-miRNA sequences within a genome. The model has been validated through several experimental setups using the human genome, and it was compared with state-of-the-art algorithms obtaining 10 times better performance. AVAILABILITY AND IMPLEMENTATION: Webdemo available at https://sinc.unl.edu.ar/web-demo/miRe2e/ and source code available for download at https://github.com/sinc-lab/miRe2e. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

MicroRNAs , Humans , MicroRNAs/genetics , MicroRNAs/chemistry , Algorithms , Machine Learning , Genome, Human , Computational Biology

9.

Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning.

Bugnon, Leandro A; Yones, Cristian; Milone, Diego H; Stegmayer, Georgina.

Brief Bioinform ; 22(3)2021 05 20.

Article in English | MEDLINE | ID: mdl-34020552

ABSTRACT

MOTIVATION: The genome-wide discovery of microRNAs (miRNAs) involves identifying sequences having the highest chance of being a novel miRNA precursor (pre-miRNA), within all the possible sequences in a complete genome. The known pre-miRNAs are usually just a few in comparison to the millions of candidates that have to be analyzed. This is of particular interest in non-model species and recently sequenced genomes, where the challenge is to find potential pre-miRNAs only from the sequenced genome. The task is unfeasible without the help of computational methods, such as deep learning. However, it is still very difficult to find an accurate predictor, with a low false positive rate in this genome-wide context. Although there are many available tools, these have not been tested in realistic conditions, with sequences from whole genomes and the high class imbalance inherent to such data. RESULTS: In this work, we review six recent methods for tackling this problem with machine learning. We compare the models in five genome-wide datasets: Arabidopsis thaliana, Caenorhabditis elegans, Anopheles gambiae, Drosophila melanogaster, Homo sapiens. The models have been designed for the pre-miRNAs prediction task, where there is a class of interest that is significantly underrepresented (the known pre-miRNAs) with respect to a very large number of unlabeled samples. It was found that for the smaller genomes and smaller imbalances, all methods perform in a similar way. However, for larger datasets such as the H. sapiens genome, it was found that deep learning approaches using raw information from the sequences reached the best scores, achieving low numbers of false positives. AVAILABILITY: The source code to reproduce these results is in: http://sourceforge.net/projects/sourcesinc/files/gwmirna Additionally, the datasets are freely available in: https://sourceforge.net/projects/sourcesinc/files/mirdata.

Subject(s)

Genome , Machine Learning , MicroRNAs/genetics , RNA Precursors/genetics , Animals , Arabidopsis/genetics , Computational Biology/methods , Humans

10.

Novel SARS-CoV-2 encoded small RNAs in the passage to humans.

Merino, Gabriela A; Raad, Jonathan; Bugnon, Leandro A; Yones, Cristian; Kamenetzky, Laura; Claus, Juan; Ariel, Federico; Milone, Diego H; Stegmayer, Georgina.

Bioinformatics ; 36(24): 5571-5581, 2021 04 05.

Article in English | MEDLINE | ID: mdl-33244583

ABSTRACT

MOTIVATION: The Severe Acute Respiratory Syndrome-Coronavirus 2 (SARS-CoV-2) has recently emerged as the responsible for the pandemic outbreak of the coronavirus disease 2019. This virus is closely related to coronaviruses infecting bats and Malayan pangolins, species suspected to be an intermediate host in the passage to humans. Several genomic mutations affecting viral proteins have been identified, contributing to the understanding of the recent animal-to-human transmission. However, the capacity of SARS-CoV-2 to encode functional putative microRNAs (miRNAs) remains largely unexplored. RESULTS: We have used deep learning to discover 12 candidate stem-loop structures hidden in the viral protein-coding genome. Among the precursors, the expression of eight mature miRNAs-like sequences was confirmed in small RNA-seq data from SARS-CoV-2 infected human cells. Predicted miRNAs are likely to target a subset of human genes of which 109 are transcriptionally deregulated upon infection. Remarkably, 28 of those genes potentially targeted by SARS-CoV-2 miRNAs are down-regulated in infected human cells. Interestingly, most of them have been related to respiratory diseases and viral infection, including several afflictions previously associated with SARS-CoV-1 and SARS-CoV-2. The comparison of SARS-CoV-2 pre-miRNA sequences with those from bat and pangolin coronaviruses suggests that single nucleotide mutations could have helped its progenitors jumping inter-species boundaries, allowing the gain of novel mature miRNAs targeting human mRNAs. Our results suggest that the recent acquisition of novel miRNAs-like sequences in the SARS-CoV-2 genome may have contributed to modulate the transcriptional reprograming of the new host upon infection. AVAILABILITY AND IMPLEMENTATION: https://github.com/sinc-lab/sarscov2-mirna-discovery. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

COVID-19 , Coronavirus , Animals , Betacoronavirus , Coronavirus/genetics , Genome, Viral , Humans , Pandemics , SARS-CoV-2

11.

Deep Neural Architectures for Highly Imbalanced Data in Bioinformatics.

Bugnon, Leandro A; Yones, Cristian; Milone, Diego H; Stegmayer, Georgina.

IEEE Trans Neural Netw Learn Syst ; 31(8): 2857-2867, 2020 08.

Article in English | MEDLINE | ID: mdl-31170082

ABSTRACT

In the postgenome era, many problems in bioinformatics have arisen due to the generation of large amounts of imbalanced data. In particular, the computational classification of precursor microRNA (pre-miRNA) involves a high imbalance in the classes. For this task, a classifier is trained to identify RNA sequences having the highest chance of being miRNA precursors. The big issue is that well-known pre-miRNAs are usually just a few in comparison to the hundreds of thousands of candidate sequences in a genome, which results in highly imbalanced data. This imbalance has a strong influence on most standard classifiers and, if not properly addressed, the classifier is not able to work properly in a real-life scenario. This work provides a comparative assessment of recent deep neural architectures for dealing with the large imbalanced data issue in the classification of pre-miRNAs. We present and analyze recent architectures in a benchmark framework with genomes of animals and plants, with increasing imbalance ratios up to 1:2000. We also propose a new graphical way for comparing classifiers performance in the context of high-class imbalance. The comparative results obtained show that, at a very high imbalance, deep belief neural networks can provide the best performance.

Subject(s)

Computational Biology/classification , Computational Biology/methods , Databases, Factual/classification , Deep Learning/classification , Neural Networks, Computer , Plants/classification , Animals , Elasticity , Humans

12.

Complexity measures of the mature miRNA for improving pre-miRNAs prediction.

Raad, Jonathan; Stegmayer, Georgina; Milone, Diego H.

Bioinformatics ; 36(8): 2319-2327, 2020 04 15.

Article in English | MEDLINE | ID: mdl-31860057

ABSTRACT

MOTIVATION: The discovery of microRNA (miRNA) in the last decade has certainly changed the understanding of gene regulation in the cell. Although a large number of algorithms with different features have been proposed, they still predict an impractical amount of false positives. Most of the proposed features are based on the structure of precursors of the miRNA only, not considering the important and relevant information contained in the mature miRNA. Such new kind of features could certainly improve the performance of the predictors of new miRNAs. RESULTS: This paper presents three new features that are based on the sequence information contained in the mature miRNA. We will show how these new features, when used by a classical supervised machine learning approach as well as by more recent proposals based on deep learning, improve the prediction performance in a significant way. Moreover, several experimental conditions were defined and tested to evaluate the novel features impact in situations close to genome-wide analysis. The results show that the incorporation of new features based on the mature miRNA allows to improve the detection of new miRNAs independently of the classifier used. AVAILABILITY AND IMPLEMENTATION: https://sourceforge.net/projects/sourcesinc/files/cplxmirna/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

MicroRNAs , Algorithms , Computational Biology , Genome , MicroRNAs/genetics , Supervised Machine Learning

13.

Alternative use of miRNA-biogenesis co-factors in plants at low temperatures.

Ré, Delfina A; Lang, Patricia L M; Yones, Cristian; Arce, Agustin L; Stegmayer, Georgina; Milone, Diego; Manavella, Pablo A.

Development ; 146(5)2019 03 01.

Article in English | MEDLINE | ID: mdl-30760482

ABSTRACT

Plants use molecular mechanisms to sense temperatures, trigger quick adaptive responses and thereby cope with environmental changes. MicroRNAs (miRNAs) are key regulators of plant development under such conditions. The catalytic action of DICER LIKE 1 (DCL1), in conjunction with HYPONASTIC LEAVES 1 (HYL1) and SERRATE (SE), produces miRNAs from double-stranded RNAs. As plants lack a stable internal temperature to which enzymatic reactions could be optimized during evolution, reactions such as miRNA processing have to be adjusted to fluctuating environmental temperatures. Here, we report that with decreasing ambient temperature, the plant miRNA biogenesis machinery becomes more robust, producing miRNAs even in the absence of the key DCL1 co-factors HYL1 and SE. This reduces the morphological and reproductive defects of se and hyl1 mutants, restoring seed production. Using small RNA-sequencing and bioinformatics analyses, we have identified specific miRNAs that become HYL1/SE independent for their production in response to temperature decrease. We found that the secondary structure of primary miRNAs is key for this temperature recovery. This finding may have evolutionary implications as a potential adaptation-driving mechanism to a changing climate.

Subject(s)

Arabidopsis Proteins/metabolism , Arabidopsis/metabolism , Cell Cycle Proteins/metabolism , Gene Expression Regulation, Plant , MicroRNAs/metabolism , RNA-Binding Proteins/metabolism , Ribonuclease III/metabolism , Cold Temperature , Computational Biology , Genes, Plant , Mutation , Phenotype , Pollen/metabolism , Protein Structure, Secondary , Sequence Analysis, RNA

14.

Predicting novel microRNA: a comprehensive comparison of machine learning approaches.

Stegmayer, Georgina; Di Persia, Leandro E; Rubiolo, Mariano; Gerard, Matias; Pividori, Milton; Yones, Cristian; Bugnon, Leandro A; Rodriguez, Tadeo; Raad, Jonathan; Milone, Diego H.

Brief Bioinform ; 20(5): 1607-1620, 2019 09 27.

Article in English | MEDLINE | ID: mdl-29800232

ABSTRACT

MOTIVATION: The importance of microRNAs (miRNAs) is widely recognized in the community nowadays because these short segments of RNA can play several roles in almost all biological processes. The computational prediction of novel miRNAs involves training a classifier for identifying sequences having the highest chance of being precursors of miRNAs (pre-miRNAs). The big issue with this task is that well-known pre-miRNAs are usually few in comparison with the hundreds of thousands of candidate sequences in a genome, which results in high class imbalance. This imbalance has a strong influence on most standard classifiers, and if not properly addressed in the model and the experiments, not only performance reported can be completely unrealistic but also the classifier will not be able to work properly for pre-miRNA prediction. Besides, another important issue is that for most of the machine learning (ML) approaches already used (supervised methods), it is necessary to have both positive and negative examples. The selection of positive examples is straightforward (well-known pre-miRNAs). However, it is difficult to build a representative set of negative examples because they should be sequences with hairpin structure that do not contain a pre-miRNA. RESULTS: This review provides a comprehensive study and comparative assessment of methods from these two ML approaches for dealing with the prediction of novel pre-miRNAs: supervised and unsupervised training. We present and analyze the ML proposals that have appeared during the past 10 years in literature. They have been compared in several prediction tasks involving two model genomes and increasing imbalance levels. This work provides a review of existing ML approaches for pre-miRNA prediction and fair comparisons of the classifiers with same features and data sets, instead of just a revision of published software tools. The results and the discussion can help the community to select the most adequate bioinformatics approach according to the prediction task at hand. The comparative results obtained suggest that from low to mid-imbalance levels between classes, supervised methods can be the best. However, at very high imbalance levels, closer to real case scenarios, models including unsupervised and deep learning can provide better performance.

Subject(s)

Machine Learning , MicroRNAs/physiology , Animals , Computational Biology , Humans , MicroRNAs/chemistry , MicroRNAs/genetics

15.

Clustermatch: discovering hidden relations in highly diverse kinds of qualitative and quantitative data without standardization.

Pividori, Milton; Cernadas, Andres; de Haro, Luis A; Carrari, Fernando; Stegmayer, Georgina; Milone, Diego H.

Bioinformatics ; 35(11): 1931-1939, 2019 06 01.

Article in English | MEDLINE | ID: mdl-30357313

ABSTRACT

MOTIVATION: Heterogeneous and voluminous data sources are common in modern datasets, particularly in systems biology studies. For instance, in multi-holistic approaches in the fruit biology field, data sources can include a mix of measurements such as morpho-agronomic traits, different kinds of molecules (nucleic acids and metabolites) and consumer preferences. These sources not only have different types of data (quantitative and qualitative), but also large amounts of variables with possibly non-linear relationships among them. An integrative analysis is usually hard to conduct, since it requires several manual standardization steps, with a direct and critical impact on the results obtained. These are important issues in clustering applications, which highlight the need of new methods for uncovering complex relationships in such diverse repositories. RESULTS: We designed a new method named Clustermatch to easily and efficiently perform data-mining tasks on large and highly heterogeneous datasets. Our approach can derive a similarity measure between any quantitative or qualitative variables by looking on how they influence on the clustering of the biological materials under study. Comparisons with other methods in both simulated and real datasets show that Clustermatch is better suited for finding meaningful relationships in complex datasets. AVAILABILITY AND IMPLEMENTATION: Files can be downloaded from https://sourceforge.net/projects/sourcesinc/files/clustermatch/ and https://bitbucket.org/sinc-lab/clustermatch/. In addition, a web-demo is available at http://sinc.unl.edu.ar/web-demo/clustermatch/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Data Mining , Cluster Analysis , Reference Standards

16.

Metabolic pathways synthesis based on ant colony optimization.

Gerard, Matias F; Stegmayer, Georgina; Milone, Diego H.

Sci Rep ; 8(1): 16398, 2018 11 06.

Article in English | MEDLINE | ID: mdl-30401873

ABSTRACT

One of the current challenges in bioinformatics is to discover new ways to transform a set of compounds into specific products. The usual approach is finding the reactions to synthesize a particular product, from a given substrate, by means of classical searching algorithms. However, they have three main limitations: difficulty in handling large amounts of reactions and compounds; absence of a step that verifies the availability of substrates; and inability to find branched pathways. We present here a novel bio-inspired algorithm for synthesizing linear and branched metabolic pathways. It allows relating several compounds simultaneously, ensuring the availability of substrates for every reaction in the solution. Comparisons with classical searching algorithms and other recent metaheuristic approaches show clear advantages of this proposal, fully recovering well-known pathways. Furthermore, solutions found can be analyzed in a simple way through graphical representations on the web.

Subject(s)

Ants/metabolism , Metabolic Networks and Pathways , Metabolomics/methods , Algorithms , Animals , Behavior, Animal , Feasibility Studies

17.

Whole genome analysis of codon usage in Echinococcus.

Maldonado, Lucas L; Stegmayer, Georgina; Milone, Diego H; Oliveira, Guilherme; Rosenzvit, Mara; Kamenetzky, Laura.

Mol Biochem Parasitol ; 225: 54-66, 2018 10.

Article in English | MEDLINE | ID: mdl-30081061

ABSTRACT

The species of the genus Echinococcus are parasitic platyhelminths that cause echinococcosis and exert a global burden on public and animal health. Here we performed codon usage bias and comparative genomic analyses using whole genome and expression data of three Echinococcus species. The study of 4,710,883 codons, two orders of magnitude more than in previous research works, showed that the codon usage in Echinococcus genes is biased towards the pyrimidines T and C ending codons, with an average effective number of codons equal to 57 revealing a low codon usage bias. The gene annotations and the expression profile of 7613 genes allowed to accurately determine 27 optimal codons for the Echinococcus species, most of them ending in G/C. Approximately the 30% of Echinococcus genes analysed exhibits higher codon usage bias as well as a higher expression profile. Neutrality-plots demonstrated that the selection pressure is the main evolutionary force shaping the codon usage with a contribution of 80%. Comparative genome analyses among several tapeworm species revealed that codon usage patterns are a conserved trait in cestodes parasites. Since cestodes parasites take advantage of the host protein synthesis pathways, this study could provide valuable information associated with the parasite-host relationship that would be useful to determine which host's factors are relevant for shaping the codon usage.

Subject(s)

Codon , DNA, Helminth/genetics , Echinococcus/genetics , Genome, Helminth , RNA, Helminth/genetics , Animals

18.

Inferring Unknown Biological Function by Integration of GO Annotations and Gene Expression Data.

Leale, Guillermo; Baya, Ariel Emilio; Milone, Diego H; Granitto, Pablo M; Stegmayer, Georgina.

IEEE/ACM Trans Comput Biol Bioinform ; 15(1): 168-180, 2018.

Article in English | MEDLINE | ID: mdl-27723603

ABSTRACT

Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. Since experimentally studying the functions of those genes, one by one, would be unfeasible, new computational methods for gene functions inference are needed. We present here a novel computational approach for inferring biological function for a set of genes with previously unknown function, given a set of genes with well-known information. This approach is based on the premise that genes with similar behaviour should be grouped together. This is known as the guilt-by-association principle. Thus, it is possible to take advantage of clustering techniques to obtain groups of unknown genes that are co-clustered with genes that have well-known semantic information (GO annotations). Meaningful knowledge to infer unknown semantic information can therefore be provided by these well-known genes. We provide a method to explore the potential function of new genes according to those currently annotated. The results obtained indicate that the proposed approach could be a useful and effective tool when used by biologists to guide the inference of biological functions for recently discovered genes. Our work sets an important landmark in the field of identifying unknown gene functions through clustering, using an external source of biological input. A simple web interface to this proposal can be found at http://fich.unl.edu.ar/sinc/webdemo/gamma-am/.

Subject(s)

Computational Biology/methods , Gene Ontology , Genes/physiology , Machine Learning , Transcriptome/physiology , Arabidopsis/genetics , Arabidopsis/metabolism , Cluster Analysis , Databases, Genetic , Gene Expression Profiling/methods , Genes/genetics , Models, Genetic , Molecular Sequence Annotation/methods , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Transcriptome/genetics

19.

Computational Prediction of Novel miRNAs from Genome-Wide Data.

Stegmayer, Georgina; Yones, Cristian; Kamenetzky, Laura; Macchiaroli, Natalia; Milone, Diego H.

Methods Mol Biol ; 1654: 29-37, 2017.

Article in English | MEDLINE | ID: mdl-28986781

ABSTRACT

The computational prediction of novel microRNAs (miRNAs) within a full genome involves identifying sequences having the highest chance of being bona fide miRNA precursors (pre-miRNAs). These sequences are usually named candidates to miRNA. The well-known pre-miRNAs are usually only a few in comparison to the hundreds of thousands of potential candidates to miRNA that have to be analyzed. Although the selection of positive labeled examples is straightforward, it is very difficult to build a set of negative examples in order to obtain a good set of training samples for a supervised method. In this chapter we describe an approach to this problem, based on the unsupervised clustering of unlabeled sequences from genome-wide data, and the well-known miRNA precursors for the organism under study. Therefore, the protocol developed allows for quick identification of the best candidates to miRNA as those sequences clustered together with known precursors.

Subject(s)

Computational Biology/methods , MicroRNAs/genetics , RNA, Long Noncoding/genetics , Animals , Humans

20.

microRNA analysis of Taenia crassiceps cysticerci under praziquantel treatment and genome-wide identification of Taenia solium miRNAs.

Pérez, Matías Gastón; Macchiaroli, Natalia; Lichtenstein, Gabriel; Conti, Gabriela; Asurmendi, Sebastián; Milone, Diego Humberto; Stegmayer, Georgina; Kamenetzky, Laura; Cucher, Marcela; Rosenzvit, Mara Cecilia.

Int J Parasitol ; 47(10-11): 643-653, 2017 09.

Article in English | MEDLINE | ID: mdl-28526608

ABSTRACT

MicroRNAs (miRNAs) are small non-coding RNAs that have emerged as important regulators of gene expression and perform critical functions in development and disease. In spite of the increased interest in miRNAs from helminth parasites, no information is available on miRNAs from Taenia solium, the causative agent of cysticercosis, a neglected disease affecting millions of people worldwide. Here we performed a comprehensive analysis of miRNAs from Taenia crassiceps, a laboratory model for T. solium studies, and identified miRNAs in the T. solium genome. Moreover, we analysed the effect of praziquantel, one of the two main drugs used for cysticercosis treatment, on the miRNA expression profile of T. crassiceps cysticerci. Using small RNA-seq and two independent algorithms for miRNA prediction, as well as northern blot validation, we found transcriptional evidence of 39 miRNA loci in T. crassiceps. Since miRNAs were mapped to the T. solium genome, these miRNAs are considered common to both parasites. The miRNA expression profile of T. crassiceps was biased to the same set of highly expressed miRNAs reported in other cestodes. We found a significant altered expression of miR-7b under praziquantel treatment. In addition, we searched for miRNAs predicted to target genes related to drug response. We performed a detailed target prediction for miR-7b and found genes related to drug action. We report an initial approach to study the effect of sub-lethal drug treatment on miRNA expression in a cestode parasite, which provides a platform for further studies of miRNA involvement in drug effects. The results of our work could be applied to drug development and provide basic knowledge of cysticercosis and other neglected helminth infections.

Subject(s)

MicroRNAs/genetics , Praziquantel/pharmacology , RNA, Helminth/genetics , Taenia/genetics , Animals , Anthelmintics/pharmacology , Gene Expression Regulation/physiology

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL