Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
1.
Comput Methods Programs Biomed ; 254: 108258, 2024 May 31.
Article in English | MEDLINE | ID: mdl-38851122

ABSTRACT

BACKGROUND AND OBJECTIVE: differential expression analysis is one of the most popular activities in transcriptomic studies based on next-generation sequencing technologies. In fact, differentially expressed genes (DEGs) between two conditions represent ideal prognostic and diagnostic candidate biomarkers for many pathologies. As a result, several algorithms, such as DESeq2 and edgeR, have been developed to identify DEGs. Despite their widespread use, there is no consensus on which model performs best for different types of data, and many existing methods suffer from high False Discovery Rates (FDR). METHODS: we present a new algorithm, DeClUt, based on the intuition that the expression profile of differentially expressed genes should form two reasonably compact and well-separated clusters. This, in turn, implies that the bipartition induced by the two conditions being compared should overlap with the clustering. The clustering algorithm underlying DeClUt was designed to be robust to outliers typical of RNA-seq data. In particular, we used the average silhouette function to enforce membership assignment of samples to the most appropriate condition. RESULTS: DeClUt was tested on real RNA-seq datasets and benchmarked against four of the most widely used methods (edgeR, DESeq2, NOISeq, and SAMseq). Experiments showed a higher self-consistency of results than the competitors as well as a significantly lower False Positive Rate (FPR). Moreover, tested on a real prostate cancer RNA-seq dataset, DeClUt has highlighted 8 DE genes, linked to neoplastic process according to DisGeNET database, that none of the other methods had identified. CONCLUSIONS: our work presents a novel algorithm that builds upon basic concepts of data clustering and exhibits greater consistency and significantly lower False Positive Rate than state-of-the-art methods. Additionally, DeClUt is able to highlight relevant differentially expressed genes not otherwise identified by other tools contributing to improve efficacy of differential expression analyses in various biological applications.

2.
Phys Rev E ; 106(3-1): 034319, 2022 Sep.
Article in English | MEDLINE | ID: mdl-36266916

ABSTRACT

Hypergraphs and simplical complexes both capture the higher-order interactions of complex systems, ranging from higher-order collaboration networks to brain networks. One open problem in the field is what should drive the choice of the adopted mathematical framework to describe higher-order networks starting from data of higher-order interactions. Unweighted simplicial complexes typically involve a loss of information of the data, though having the benefit to capture the higher-order topology of the data. In this work we show that weighted simplicial complexes allow one to circumvent all the limitations of unweighted simplicial complexes to represent higher-order interactions. In particular, weighted simplicial complexes can represent higher-order networks without loss of information, allowing one at the same time to capture the weighted topology of the data. The higher-order topology is probed by studying the spectral properties of suitably defined weighted Hodge Laplacians displaying a normalized spectrum. The higher-order spectrum of (weighted) normalized Hodge Laplacians is studied combining cohomology theory with information theory. In the proposed framework we quantify and compare the information content of higher-order spectra of different dimension using higher-order spectral entropies and spectral relative entropies. The proposed methodology is tested on real higher-order collaboration networks and on the weighted version of the simplicial complex model "Network Geometry with Flavor."

3.
Genes (Basel) ; 12(12)2021 12 02.
Article in English | MEDLINE | ID: mdl-34946895

ABSTRACT

OBJECTIVES: Dilated cardiomyopathy (DCM) is characterized by a specific transcriptome. Since the DCM molecular network is largely unknown, the aim was to identify specific disease-related molecular targets combining an original machine learning (ML) approach with protein-protein interaction network. METHODS: The transcriptomic profiles of human myocardial tissues were investigated integrating an original computational approach, based on the Custom Decision Tree algorithm, in a differential expression bioinformatic framework. Validation was performed by quantitative real-time PCR. RESULTS: Our preliminary study, using samples from transplanted tissues, allowed the discovery of specific DCM-related genes, including MYH6, NPPA, MT-RNR1 and NEAT1, already known to be involved in cardiomyopathies Interestingly, a combination of these expression profiles with clinical characteristics showed a significant association between NEAT1 and left ventricular end-diastolic diameter (LVEDD) (Rho = 0.73, p = 0.05), according to severity classification (NYHA-class III). CONCLUSIONS: The use of the ML approach was useful to discover preliminary specific genes that could lead to a rapid selection of molecular targets correlated with DCM clinical parameters. For the first time, NEAT1 under-expression was significantly associated with LVEDD in the human heart.


Subject(s)
Biomarkers/metabolism , Cardiomyopathy, Dilated/pathology , Computational Biology/methods , Machine Learning/standards , Protein Interaction Maps , Transcriptome , Adult , Cardiomyopathy, Dilated/genetics , Cardiomyopathy, Dilated/metabolism , Case-Control Studies , Female , Humans , Male , Middle Aged , Sequence Analysis, RNA/methods , Severity of Illness Index
4.
Comput Struct Biotechnol J ; 19: 5762-5790, 2021.
Article in English | MEDLINE | ID: mdl-34765093

ABSTRACT

We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion of AI follows the remarkable achievements made possible by "deep learning", along with a burst of "big data" that can meet its hunger. Biology is about to overthrow astronomy as the paradigmatic representative of big data producer. This has been made possible by huge advancements in the field of high throughput technologies, applied to determine how the individual components of a biological system work together to accomplish different processes. The disciplines contributing to this bulk of data are collectively known as functional genomics. They consist in studies of: i) the information contained in the DNA (genomics); ii) the modifications that DNA can reversibly undergo (epigenomics); iii) the RNA transcripts originated by a genome (transcriptomics); iv) the ensemble of chemical modifications decorating different types of RNA transcripts (epitranscriptomics); v) the products of protein-coding transcripts (proteomics); and vi) the small molecules produced from cell metabolism (metabolomics) present in an organism or system at a given time, in physiological or pathological conditions. After reviewing main applications of AI in functional genomics, we discuss important accompanying issues, including ethical, legal and economic issues and the importance of explainability.

5.
Comput Biol Med ; 133: 104352, 2021 06.
Article in English | MEDLINE | ID: mdl-33852974

ABSTRACT

MicroRNAs (miRNAs) are short endogenous molecules of RNA that influence cell regulation by suppressing genes. Their ubiquity throughout all branches of the tree of life has suggested their central role in many cellular functions. Nowadays, several personalized medicine applications rely on miRNAs as biomarkers for diagnoses, prognoses, and prediction of drug response. The increasing ease of sequencing miRNAs contrasts with the difficulty of accurately quantifying their concentration. The use of general purpose aligners is only a partial solution as they have limited possibilities to accurately solve ambiguous mapping due to the short length of these sequences. We developed EZcount, an all-in-one software that, with a single command, performs the entire quantification process: from raw fastq files to read counts. Experiments show that EZcount is more sensitive and accurate than methods based on sequence alignment, independently of the library preparation protocol and sequencing machine. The parallel architecture of EZcount makes it fast enough to process a sample in minutes using a standard workstation. EZcount runs on all of the most common operating systems (Linux, Windows and MacOS) and is freely available for download at https://gitlab.com/BioAlgo/miR-pipe. A detailed description of the datasets, the raw experimental results, and all the scripts used for testing are available as supplementary material.


Subject(s)
MicroRNAs , Software , High-Throughput Nucleotide Sequencing , MicroRNAs/genetics , Sequence Alignment
7.
Bioinformatics ; 35(6): 914-922, 2019 03 15.
Article in English | MEDLINE | ID: mdl-30165507

ABSTRACT

MOTIVATION: Large-scale sequencing projects have confirmed the hypothesis that eukaryotic DNA is rich in repetitions whose functional role needs to be elucidated. In particular, tandem repeats (TRs) (i.e. short, almost identical sequences that lie adjacent to each other) have been associated to many cellular processes and, indeed, are also involved in several genetic disorders. The need of comprehensive lists of TRs for association studies and the absence of a computational model able to capture their variability have revived research on discovery algorithms. RESULTS: Building upon the idea that sequence similarities can be easily displayed using graphical methods, we formalized the structure that TRs induce in dot-plot matrices where a sequence is compared with itself. Leveraging on the observation that a compact representation of these matrices can be built and searched in linear time, we developed Dot2dot: an accurate algorithm fast enough to be suitable for whole-genome discovery of TRs. Experiments on five manually curated collections of TRs have shown that Dot2dot is more accurate than other established methods, and completes the analysis of the biggest known reference genome in about one day on a standard PC. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are freely available upon paper acceptance at the URL: https://github.com/Gege7177/Dot2dot. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Software , Tandem Repeat Sequences , Algorithms , Eukaryota , Sequence Analysis, DNA
8.
PLoS One ; 13(7): e0200353, 2018.
Article in English | MEDLINE | ID: mdl-30048452

ABSTRACT

MicroRNAs are small non-coding RNAs that influence gene expression by binding to the 3' UTR of target mRNAs in order to repress protein synthesis. Soon after discovery, microRNA dysregulation has been associated to several pathologies. In particular, they have often been reported as differentially expressed in healthy and tumor samples. This fact suggested that microRNAs are likely to be good candidate biomarkers for cancer diagnosis and personalized medicine. With the advent of Next-Generation Sequencing (NGS), measuring the expression level of the whole miRNAome at once is now routine. Yet, the collaborative effort of sharing data opens to the possibility of population analyses. This context motivated us to perform an in-silico study to distill cancer-specific panels of microRNAs that can serve as biomarkers. We observed that the problem of finding biomarkers can be modeled as a two-class classification task where, given the miRNAomes of a population of healthy and cancerous samples, we want to find the subset of microRNAs that leads to the highest classification accuracy. We fulfill this task leveraging on a sensible combination of data mining tools. In particular, we used: differential evolution for candidate selection, component analysis to preserve the relationships among miRNAs, and SVM for sample classification. We identified 10 cancer-specific panels whose classification accuracy is always higher than 92%. These panels have a very little overlap suggesting that miRNAs are not only predictive of the onset of cancer, but can be used for classification purposes as well. We experimentally validated the contribution of each of the employed tools to the selection of discriminating miRNAs. Moreover, we tested the significance of each panel for the corresponding cancer type. In particular, enrichment analysis showed that the selected miRNAs are involved in oncogenesis pathways, while survival analysis proved that miRNAs can be used to evaluate cancer severity. Summarizing: results demonstrated that our method is able to produce cancer-specific panels that are promising candidates for a subsequent in vitro validation.


Subject(s)
Early Detection of Cancer , High-Throughput Nucleotide Sequencing , MicroRNAs/genetics , Neoplasms/genetics , Biomarkers, Tumor/genetics , Biomarkers, Tumor/metabolism , Data Mining , Early Detection of Cancer/methods , Humans , MicroRNAs/metabolism , Neoplasms/classification , Neoplasms/metabolism , Support Vector Machine
9.
Front Genet ; 9: 155, 2018.
Article in English | MEDLINE | ID: mdl-29770143

ABSTRACT

Polymorphic Tandem Repeat (PTR) is a common form of polymorphism in the human genome. A PTR consists in a variation found in an individual (or in a population) of the number of repeating units of a Tandem Repeat (TR) locus of the genome with respect to the reference genome. Several phenotypic traits and diseases have been discovered to be strongly associated with or caused by specific PTR loci. PTR are further distinguished in two main classes: Short Tandem Repeats (STR) when the repeating unit has size up to 6 base pairs, and Variable Number Tandem Repeats (VNTR) for repeating units of size above 6 base pairs. As larger and larger populations are screened via high throughput sequencing projects, it becomes technically feasible and desirable to explore the association between PTR and a panoply of such traits and conditions. In order to facilitate these studies, we have devised a method for compiling catalogs of PTR from assembled genomes, and we have produced a catalog of PTR for genic regions (exons, introns, UTR and adjacent regions) of the human genome (GRCh38). We applied four different TR discovery software tools to uncover in the first phase 55,223,485 TR (after duplicate removal) in GRCh38, of which 373,173 were determined to be PTR in the second phase by comparison with five assembled human genomes. Of these, 263,266 are not included by state-of-the-art PTR catalogs. The new methodology is mainly based on a hierarchical and systematic application of alignment-based sequence comparisons to identify and measure the polymorphism of TR. While previous catalogs focus on the class of STR of small total size, we remove any size restrictions, aiming at the more general class of PTR, and we also target fuzzy TR by using specific detection tools. Similarly to other previous catalogs of human polymorphic loci, we focus our catalog toward applications in the discovery of disease-associated loci. Validation by cross-referencing with existing catalogs on common clinically-relevant loci shows good concordance. Overall, this proposed census of human PTR in genic regions is a shared resource (web accessible), complementary to existing catalogs, facilitating future genome-wide studies involving PTR.

10.
Nucleic Acids Res ; 46(D1): D354-D359, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29036351

ABSTRACT

miRandola (http://mirandola.iit.cnr.it/) is a database of extracellular non-coding RNAs (ncRNAs) that was initially published in 2012, foreseeing the relevance of ncRNAs as non-invasive biomarkers. An increasing amount of experimental evidence shows that ncRNAs are frequently dysregulated in diseases. Further, ncRNAs have been discovered in different extracellular forms, such as exosomes, which circulate in human body fluids. Thus, miRandola 2017 is an effort to update and collect the accumulating information on extracellular ncRNAs that is spread across scientific publications and different databases. Data are manually curated from 314 articles that describe miRNAs, long non-coding RNAs and circular RNAs. Fourteen organisms are now included in the database, and associations of ncRNAs with 25 drugs, 47 sample types and 197 diseases. miRandola also classifies extracellular RNAs based on their extracellular form: Argonaute2 protein, exosome, microvesicle, microparticle, membrane vesicle, high density lipoprotein and circulating. We also implemented a new web interface to improve the user experience.


Subject(s)
Databases, Genetic , Knowledge Bases , RNA, Untranslated , Biomarkers , Cell-Free Nucleic Acids , Data Curation , Humans , MicroRNAs , RNA , RNA, Circular , RNA, Long Noncoding , User-Computer Interface
12.
BMC Bioinformatics ; 17(Suppl 12): 372, 2016 Nov 08.
Article in English | MEDLINE | ID: mdl-28185552

ABSTRACT

BACKGROUND: Biological networks play an increasingly important role in the exploration of functional modularity and cellular organization at a systemic level. Quite often the first tools used to analyze these networks are clustering algorithms. We concentrate here on the specific task of predicting protein complexes (PC) in large protein-protein interaction networks (PPIN). Currently, many state-of-the-art algorithms work well for networks of small or moderate size. However, their performance on much larger networks, which are becoming increasingly common in modern proteome-wise studies, needs to be re-assessed. RESULTS AND DISCUSSION: We present a new fast algorithm for clustering large sparse networks: Core&Peel, which runs essentially in time and storage O(a(G)m+n) for a network G of n nodes and m arcs, where a(G) is the arboricity of G (which is roughly proportional to the maximum average degree of any induced subgraph in G). We evaluated Core&Peel on five PPI networks of large size and one of medium size from both yeast and homo sapiens, comparing its performance against those of ten state-of-the-art methods. We demonstrate that Core&Peel consistently outperforms the ten competitors in its ability to identify known protein complexes and in the functional coherence of its predictions. Our method is remarkably robust, being quite insensible to the injection of random interactions. Core&Peel is also empirically efficient attaining the second best running time over large networks among the tested algorithms. CONCLUSIONS: Our algorithm Core&Peel pushes forward the state-of the-art in PPIN clustering providing an algorithmic solution with polynomial running time that attains experimentally demonstrable good output quality and speed on challenging large real networks.


Subject(s)
Protein Interaction Mapping/methods , Proteins/metabolism , Saccharomyces cerevisiae/metabolism , Algorithms , Cluster Analysis , Humans , Protein Binding , Protein Interaction Maps , Proteome/metabolism , Saccharomyces cerevisiae/genetics
13.
PLoS One ; 10(4): e0122473, 2015.
Article in English | MEDLINE | ID: mdl-25848944

ABSTRACT

Genes and their expression regulation are among the key factors in the comprehension of the genesis and development of complex diseases. In this context, microRNAs (miRNAs) are post-transcriptional regulators that play an important role in gene expression since they are frequently deregulated in pathologies like cardiovascular disease and cancer. In vitro validation of miRNA--targets regulation is often too expensive and time consuming to be carried out for every possible alternative. As a result, a tool able to provide some criteria to prioritize trials is becoming a pressing need. Moreover, before planning in vitro experiments, the scientist needs to evaluate the miRNA-target genes interaction network. In this paper we describe the miRable method whose purpose is to identify new potentially relevant genes and their interaction networks associate to a specific pathology. To achieve this goal miRable follows a system biology approach integrating together general-purpose medical knowledge (literature, Protein-Protein Interaction networks, prediction tools) and pathology specific data (gene expression data). A case study on Prostate Cancer has shown that miRable is able to: 1) find new potential miRNA-targets pairs, 2) highlight novel genes potentially involved in a disease but never or little studied before, 3) reconstruct all possible regulatory subnetworks starting from the literature to expand the knowledge on the regulation of miRNA regulatory mechanisms.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks/genetics , MicroRNAs/genetics , Prostatic Neoplasms/genetics , Gene Expression Profiling , Humans , Male , RNA, Messenger/genetics , RNA, Messenger/metabolism
14.
Heart Views ; 14(1): 33-5, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23580924

ABSTRACT

Coronary artery anomalies are uncommon disorders. According to the literature, ≈1% of the general population is affected by a coronary artery abnormality. Coronary artery anomalies are often not associated with clinical signs, symptoms, or complications; nevertheless, they can be associated with congenital heart diseases and lead to sudden death. However, these anomalies are more often discovered as incidental findings at the time of coronary angiography or autopsy. The clinical relevance of coronary artery anomalies is closely related to the functional ability to provide adequate blood supply to the myocardial tissue. We describe a complex left coronary artery anomaly, not previously reported in medical literature, involving origin, course, and distribution of this vessel.

15.
Bioinformatics ; 26(18): 2217-25, 2010 Sep 15.
Article in English | MEDLINE | ID: mdl-20624781

ABSTRACT

MOTIVATION: Single nucleotide polymorphisms are the most common form of variation in human DNA, and are involved in many research fields, from molecular biology to medical therapy. The technological opportunity to deal with long DNA sequences using shotgun sequencing has raised the problem of fragment recombination. In this regard, Single Individual Haplotyping (SIH) problem has received considerable attention over the past few years. RESULTS: In this article, we survey seven recent approaches to the SIH problem and evaluate them extensively using real human haplotype data from the HapMap project. We also implemented a data generator tailored to the current shotgun sequencing technology that uses haplotypes from the HapMap project. AVAILABILITY: The data we used to compare the algorithms are available on demand, since we think they represent an important benchmark that can be used to easily compare novel algorithmic ideas with the state of the art. Moreover, we had to re-implement six of the algorithms surveyed because the original code was not available to us. Five of these algorithms and the data generator used in this article endowed with a Web interface are available at http://bioalgo.iit.cnr.it/rehap.


Subject(s)
Algorithms , Haplotypes , Polymorphism, Single Nucleotide , Base Sequence , Humans , Software
16.
J Comput Biol ; 16(6): 859-73, 2009 Jun.
Article in English | MEDLINE | ID: mdl-19522668

ABSTRACT

Microarray technology for profiling gene expression levels is a popular tool in modern biological research. Applications range from tissue classification to the detection of metabolic networks, from drug discovery to time-critical personalized medicine. Given the increase in size and complexity of the data sets produced, their analysis is becoming problematic in terms of time/quality trade-offs. Clustering genes with similar expression profiles is a key initial step for subsequent manipulations and the increasing volumes of data to be analyzed requires methods that are at the same time efficient (completing an analysis in minutes rather than hours) and effective (identifying significant clusters with high biological correlations). In this paper, we propose K-Boost, a clustering algorithm based on a combination of the furthest-point-first (FPF) heuristic for solving the metric k-center problem, a stability-based method for determining the number of clusters, and a k-means-like cluster refinement. K-Boost runs in O (|N| x k) time, where N is the input matrix and k is the number of proposed clusters. Experiments show that this low complexity is usually coupled with a very good quality of the computed clusterings, which we measure using both internal and external criteria. Supporting data can be found as online Supplementary Material at www.liebertonline.com.


Subject(s)
Algorithms , Computational Biology/methods , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis/methods , Cluster Analysis , Databases, Genetic , Fibroblasts/metabolism , Humans , Saccharomyces cerevisiae/genetics
17.
Article in English | MEDLINE | ID: mdl-18989037

ABSTRACT

Single nucleotide polymorphism (SNP) is the most frequent form of DNA variation. The set of SNP's present in a chromosome (called the haplotype) is of interest in a wide area of applications in molecular biology and biomedicine, including diagnostic and medical therapy. In this paper we propose a new heuristic method for the problem of haplotype reconstruction for (portions of) a pair of homologous human chromosomes from a single individual (SIH). The problem is well known in literature and exact algorithms have been proposed for the case when no (or few) gaps are allowed in the input fragments. These algorithms, though exact and of polynomial complexity, are slow in practice. When gaps are considered no exact method of polynomial complexity is known. The problem is also hard to approximate with guarantees. Therefore fast heuristics have been proposed. In this paper we describe SpeedHap, a new heuristic method that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors (up to 20%) and low coverage (as low as 3). We test SpeedHap on real data from the HapMap Project.


Subject(s)
Algorithms , Chromosome Mapping/methods , DNA Mutational Analysis/methods , Haplotypes/genetics , Polymorphism, Single Nucleotide/genetics , Sequence Analysis, DNA/methods , Software , Base Sequence , Humans , Molecular Sequence Data , Reproducibility of Results , Sensitivity and Specificity
18.
Nucleic Acids Res ; 36(Web Server issue): W315-9, 2008 Jul 01.
Article in English | MEDLINE | ID: mdl-18477631

ABSTRACT

The AMIC@ Web Server offers a light-weight multi-method clustering engine for microarray gene-expression data. AMIC@ is a highly interactive tool that stresses user-friendliness and robustness by adopting AJAX technology, thus allowing an effective interleaved execution of different clustering algorithms and inspection of results. Among the salient features AMIC@ offers, there are: (i) automatic file format detection, (ii) suggestions on the number of clusters using a variant of the stability-based method of Tibshirani et al. (iii) intuitive visual inspection of the data via heatmaps and (iv) measurements of the clustering quality using cluster homogeneity. Large data sets can be processed efficiently by selecting algorithms (such as FPF-SB and k-Boost), specifically designed for this purpose. In case of very large data sets, the user can opt for a batch-mode use of the system by means of the Clustering wizard that runs all algorithms at once and delivers the results via email. AMIC@ is freely available and open to all users with no login requirement at the following URL http://bioalgo.iit.cnr.it/amica.


Subject(s)
Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Software , Algorithms , Cluster Analysis , Computer Graphics , Internet
SELECTION OF CITATIONS
SEARCH DETAIL
...