Search | VHL Regional Portal

Hierarchical deep learning for predicting GO annotations by integrating protein knowledge.

Merino, Gabriela A; Saidi, Rabie; Milone, Diego H; Stegmayer, Georgina; Martin, Maria J.

Bioinformatics ; 38(19): 4488-4496, 2022 09 30.

Article in English | MEDLINE | ID: mdl-35929781

ABSTRACT

MOTIVATION: Experimental testing and manual curation are the most precise ways for assigning Gene Ontology (GO) terms describing protein functions. However, they are expensive, time-consuming and cannot cope with the exponential growth of data generated by high-throughput sequencing methods. Hence, researchers need reliable computational systems to help fill the gap with automatic function prediction. The results of the last Critical Assessment of Function Annotation challenge revealed that GO-terms prediction remains a very challenging task. Recent developments on deep learning are significantly breaking out the frontiers leading to new knowledge in protein research thanks to the integration of data from multiple sources. However, deep models hitherto developed for functional prediction are mainly focused on sequence data and have not achieved breakthrough performances yet. RESULTS: We propose DeeProtGO, a novel deep-learning model for predicting GO annotations by integrating protein knowledge. DeeProtGO was trained for solving 18 different prediction problems, defined by the three GO sub-ontologies, the type of proteins, and the taxonomic kingdom. Our experiments reported higher prediction quality when more protein knowledge is integrated. We also benchmarked DeeProtGO against state-of-the-art methods on public datasets, and showed it can effectively improve the prediction of GO annotations. AVAILABILITY AND IMPLEMENTATION: DeeProtGO and a case of use are available at https://github.com/gamerino/DeeProtGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Deep Learning , Gene Ontology , Computational Biology/methods , Molecular Sequence Annotation , Proteins/metabolism

Novel SARS-CoV-2 encoded small RNAs in the passage to humans.

Merino, Gabriela A; Raad, Jonathan; Bugnon, Leandro A; Yones, Cristian; Kamenetzky, Laura; Claus, Juan; Ariel, Federico; Milone, Diego H; Stegmayer, Georgina.

Bioinformatics ; 36(24): 5571-5581, 2021 04 05.

Article in English | MEDLINE | ID: mdl-33244583

ABSTRACT

MOTIVATION: The Severe Acute Respiratory Syndrome-Coronavirus 2 (SARS-CoV-2) has recently emerged as the responsible for the pandemic outbreak of the coronavirus disease 2019. This virus is closely related to coronaviruses infecting bats and Malayan pangolins, species suspected to be an intermediate host in the passage to humans. Several genomic mutations affecting viral proteins have been identified, contributing to the understanding of the recent animal-to-human transmission. However, the capacity of SARS-CoV-2 to encode functional putative microRNAs (miRNAs) remains largely unexplored. RESULTS: We have used deep learning to discover 12 candidate stem-loop structures hidden in the viral protein-coding genome. Among the precursors, the expression of eight mature miRNAs-like sequences was confirmed in small RNA-seq data from SARS-CoV-2 infected human cells. Predicted miRNAs are likely to target a subset of human genes of which 109 are transcriptionally deregulated upon infection. Remarkably, 28 of those genes potentially targeted by SARS-CoV-2 miRNAs are down-regulated in infected human cells. Interestingly, most of them have been related to respiratory diseases and viral infection, including several afflictions previously associated with SARS-CoV-1 and SARS-CoV-2. The comparison of SARS-CoV-2 pre-miRNA sequences with those from bat and pangolin coronaviruses suggests that single nucleotide mutations could have helped its progenitors jumping inter-species boundaries, allowing the gain of novel mature miRNAs targeting human mRNAs. Our results suggest that the recent acquisition of novel miRNAs-like sequences in the SARS-CoV-2 genome may have contributed to modulate the transcriptional reprograming of the new host upon infection. AVAILABILITY AND IMPLEMENTATION: https://github.com/sinc-lab/sarscov2-mirna-discovery. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

COVID-19 , Coronavirus , Animals , Betacoronavirus , Coronavirus/genetics , Genome, Viral , Humans , Pandemics , SARS-CoV-2

Genetic Diversity, Population Structure and Linkage Disequilibrium Assessment among International Sunflower Breeding Collections.

Filippi, Carla V; Merino, Gabriela A; Montecchia, Juan F; Aguirre, Natalia C; Rivarola, Máximo; Naamati, Guy; Fass, Mónica I; Álvarez, Daniel; Di Rienzo, Julio; Heinz, Ruth A; Contreras Moreira, Bruno; Lia, Verónica V; Paniego, Norma B.

Genes (Basel) ; 11(3)2020 03 06.

Article in English | MEDLINE | ID: mdl-32155892

ABSTRACT

Sunflower germplasm collections are valuable resources for broadening the genetic base of commercial hybrids and ameliorate the risk of climate events. Nowadays, the most studied worldwide sunflower pre-breeding collections belong to INTA (Argentina), INRA (France), and USDA-UBC (United States of America-Canada). In this work, we assess the amount and distribution of genetic diversity (GD) available within and between these collections to estimate the distribution pattern of global diversity. A mixed genotyping strategy was implemented, by combining proprietary genotyping-by-sequencing data with public whole-genome-sequencing data, to generate an integrative 11,834-common single nucleotide polymorphism matrix including the three breeding collections. In general, the GD estimates obtained were moderate. An analysis of molecular variance provided evidence of population structure between breeding collections. However, the optimal number of subpopulations, studied via discriminant analysis of principal components (K = 12), the bayesian STRUCTURE algorithm (K = 6) and distance-based methods (K = 9) remains unclear, since no single unifying characteristic is apparent for any of the inferred groups. Different overall patterns of linkage disequilibrium (LD) were observed across chromosomes, with Chr10, Chr17, Chr5, and Chr2 showing the highest LD. This work represents the largest and most comprehensive inter-breeding collection analysis of genomic diversity for cultivated sunflower conducted to date.

Subject(s)

Helianthus/genetics , Linkage Disequilibrium , Polymorphism, Genetic , Seed Bank , Chromosomes, Plant/genetics , Plant Breeding/methods

Massive integrative gene set analysis enables functional characterization of breast cancer subtypes.

Rodriguez, Juan C; Merino, Gabriela A; Llera, Andrea S; Fernández, Elmer A.

J Biomed Inform ; 93: 103157, 2019 05.

Article in English | MEDLINE | ID: mdl-30928514

ABSTRACT

The availability of large-scale repositories and integrated cancer genome efforts have created unprecedented opportunities to study and describe cancer biology. In this sense, the aim of translational researchers is the integration of multiple omics data to achieve a better identification of homogeneous subgroups of patients in order to develop adequate diagnostic and treatment strategies from the personalized medicine perspective. So far, existing integrative methods have grouped together omics data information, leaving out individual omics data phenotypic interpretation. Here, we present the Massive and Integrative Gene Set Analysis (MIGSA) R package. This tool can analyze several high throughput experiments in a comprehensive way through a functional analysis strategy, relating a phenotype to its biological function counterpart defined by means of gene sets. By simultaneously querying different multiple omics data from the same or different groups of patients, common and specific functional patterns for each studied phenotype can be obtained. The usefulness of MIGSA was demonstrated by applying the package to functionally characterize the intrinsic breast cancer PAM50 subtypes. For each subtype, specific functional transcriptomic profiles and gene sets enriched by transcriptomic and proteomic data were identified. To achieve this, transcriptomic and proteomic data from 28 datasets were analyzed using MIGSA. As a result, enriched gene sets and important genes were consistently found as related to a specific subtype across experiments or data types and thus can be used as molecular signature biomarkers.

Subject(s)

Breast Neoplasms/genetics , Biomarkers, Tumor/metabolism , Breast Neoplasms/classification , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Datasets as Topic , Female , Humans

A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies.

Merino, Gabriela A; Conesa, Ana; Fernández, Elmer A.

Brief Bioinform ; 20(2): 471-481, 2019 03 22.

Article in English | MEDLINE | ID: mdl-29040385

ABSTRACT

Over the last few years, RNA-seq has been used to study alterations in alternative splicing related to several diseases. Bioinformatics workflows used to perform these studies can be divided into two groups, those finding changes in the absolute isoform expression and those studying differential splicing. Many computational methods for transcriptomics analysis have been developed, evaluated and compared; however, there are not enough reports of systematic and objective assessment of processing pipelines as a whole. Moreover, comparative studies have been performed considering separately the changes in absolute or relative isoform expression levels. Consequently, no consensus exists about the best practices and appropriate workflows to analyse alternative and differential splicing. To assist the adequate pipeline choice, we present here a benchmarking of nine commonly used workflows to detect differential isoform expression and splicing. We evaluated the workflows performance over different experimental scenarios where changes in absolute and relative isoform expression occurred simultaneously. In addition, the effect of the number of isoforms per gene, and the magnitude of the expression change over pipeline performances were also evaluated. Our results suggest that workflow performance is influenced by the number of replicates per condition and the conditions heterogeneity. In general, workflows based on DESeq2, DEXSeq, Limma and NOISeq performed well over a wide range of transcriptomics experiments. In particular, we suggest the use of workflows based on Limma when high precision is required, and DESeq2 and DEXseq pipelines to prioritize sensitivity. When several replicates per condition are available, NOISeq and Limma pipelines are indicated.

Subject(s)

Alternative Splicing , Benchmarking/methods , Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Neoplasm Proteins/genetics , Prostatic Neoplasms/genetics , Sequence Analysis, RNA/methods , Case-Control Studies , Gene Expression Profiling , Humans , Male , Neoplasm Proteins/metabolism , Prostate/metabolism , Prostatic Neoplasms/metabolism , Protein Isoforms , Workflow

TarSeqQC: Quality control on targeted sequencing experiments in R.

Merino, Gabriela A; Murua, Yanina A; Fresno, Cristóbal; Sendoya, Juan M; Golubicki, Mariano; Iseas, Soledad; Coraglio, Mariana; Podhajcer, Osvaldo L; Llera, Andrea S; Fernández, Elmer A.

Hum Mutat ; 38(5): 494-502, 2017 05.

Article in English | MEDLINE | ID: mdl-28236343

ABSTRACT

Targeted sequencing (TS) is growing as a screening methodology used in research and medical genetics to identify genomic alterations causing human diseases. In general, a list of possible genomic variants is derived from mapped reads through a variant calling step. This processing step is usually based on variant coverage, although it may be affected by several factors. Therefore, undercovered relevant clinical variants may not be reported, affecting pathology diagnosis or treatment. Thus, a prior quality control of the experiment is critical to determine variant detection accuracy and to avoid erroneous medical conclusions. There are several quality control tools, but they are focused on issues related to whole-genome sequencing. However, in TS, quality control should assess experiment, gene, and genomic region performances based on achieved coverages. Here, we propose TarSeqQC R package for quality control in TS experiments. The tool is freely available at Bioconductor repository. TarSeqQC was used to analyze two datasets; low-performance primer pools and features were detected, enhancing the quality of experiment results. Read count profiles were also explored, showing TarSeqQC's effectiveness as an exploration tool. Our proposal may be a valuable bioinformatic tool for routinely TS experiments in both research and medical genetics.

Subject(s)

Computational Biology/methods , Genomics/methods , High-Throughput Nucleotide Sequencing , Software , Computational Biology/standards , Datasets as Topic , Genomics/standards , Humans , Neoplasms/genetics , Quality Control , Reproducibility of Results , Software/standards , User-Computer Interface

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL