Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Brief Bioinform ; 23(3)2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35381599

RESUMO

MOTIVATION: Biological networks topology yields important insights into biological function, occurrence of diseases and drug design. In the last few years, different types of topological measures have been introduced and applied to infer the biological relevance of network components/interactions, according to their position within the network structure. Although comparisons of such measures have been previously proposed, to what extent the topology per se may lead to the extraction of novel biological knowledge has never been critically examined nor formalized in the literature. RESULTS: We present a comparative analysis of nine outstanding topological measures, based on compact views obtained from the rank they induce on a given input biological network. The goal is to understand their ability in correctly positioning nodes/edges in the rank, according to the functional knowledge implicitly encoded in biological networks. To this aim, both internal and external (gold standard) validation criteria are taken into account, and six networks involving three different organisms (yeast, worm and human) are included in the comparison. The results show that a distinct handful of best-performing measures can be identified for each of the considered organisms, independently from the reference gold standard. AVAILABILITY: Input files and code for the computation of the considered topological measures and K-haus distance are available at https://gitlab.com/MaryBonomo/ranking. CONTACT: simona.rombo@unipa.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Briefings in Bioinformatics online.


Assuntos
Algoritmos
3.
Bioinformatics ; 38(4): 925-932, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34718420

RESUMO

MOTIVATION: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS: By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION: The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência , Genômica
4.
BMC Bioinformatics ; 22(1): 144, 2021 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-33752596

RESUMO

BACKGROUND: Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. RESULTS: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. CONCLUSIONS: Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. AVAILABILITY: The software and the datasets are available at https://github.com/fpalini/fastdoopc.


Assuntos
Compressão de Dados , Genômica , Software , Algoritmos , Big Data
5.
Bioinformatics ; 37(12): 1658-1665, 2021 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-33471066

RESUMO

MOTIVATION: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. RESULTS: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. AVAILABILITYAND IMPLEMENTATION: The software and the datasets are available at https://github.com/fpalini/fade. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.
BMC Bioinformatics ; 20(Suppl 4): 138, 2019 Apr 18.
Artigo em Inglês | MEDLINE | ID: mdl-30999863

RESUMO

BACKGROUND: Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. RESULTS: One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. CONCLUSIONS: We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.


Assuntos
Análise de Dados , Bases de Dados de Ácidos Nucleicos , Genoma , Estatística como Assunto , Algoritmos , Sequência de Bases , Software , Fatores de Tempo
7.
Bioinformatics ; 34(20): 3454-3460, 2018 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-30204840

RESUMO

Motivation: Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vitro and in vivo have a counterpart in terms of the underlying genomic sequences. Results: We present the first quantitative comparison between the in vitro and in vivo nucleosome maps of two model organisms (S. cerevisiae and C. elegans). The comparison is based on the construction of weighted k-mer dictionaries. Our findings show that there is a good level of sequence conservation between in vitro and in vivo in both the two organisms, in contrast to the abovementioned important differences in chromatin structural organization. Moreover, our results provide evidence that the two organisms predispose themselves differently, in terms of sequence composition and both in vitro and in vivo, for the nucleosome occupancy. This leads to the conclusion that, although the notion of a genome encoding for its own nucleosome occupancy is general, the intrinsic histone k-mer sequence preferences tend to be species-specific. Availability and implementation: The files containing the dictionaries and the main results of the analysis are available at http://math.unipa.it/rombo/material. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Análise de Sequência , Animais , Caenorhabditis elegans/genética , Cromatina/genética , Células Eucarióticas , Histonas/genética , Nucleossomos , Saccharomyces cerevisiae/genética
8.
Bioinformatics ; 34(11): 1826-1833, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29342232

RESUMO

Motivation: Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results: Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. Availability and implementation: The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. Contact: umberto.ferraro@uniroma1.it. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Linguística , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Bactérias/genética , Análise por Conglomerados , Epigenômica/métodos , Eucariotos/genética , Humanos , Metagenoma
9.
Bioinformatics ; 33(10): 1575-1577, 2017 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-28093410

RESUMO

SUMMARY: MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. AVAILABILITY AND IMPLEMENTATION: The software and the datasets are available at http://www.di.unisa.it/FASTdoop/ . CONTACT: umberto.ferraro@uniroma1.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Genômica/métodos , Armazenamento e Recuperação da Informação , Análise de Sequência de DNA/métodos , Biblioteca Gênica , Genoma Humano , Humanos
10.
Bioinformatics ; 32(6): 835-42, 2016 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-26576651

RESUMO

MOTIVATION: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. RESULTS: We contribute to close this important methodological gap between the two models by providing three very simple formulas for the sequence specific one. They are all based on well-known formulas in Computer Science and Bioinformatics, and they give different quantifications of how complex a sequence is. In view of how remarkably well they perform, it is very surprising that measures of sequence complexity have not even been considered as candidates to close the mentioned gap. We provide experimental evidence that the intrinsic level of combinatorial organization and information-theoretic content of subsequences within a genome are strongly correlated to the level of DNA encoded nucleosome organization discovered by Kaplan et al Our results establish an important connection between the intrinsic complexity of subsequences in a genome and the intrinsic, i.e. DNA encoded, nucleosome organization of eukaryotic genomes. It is a first step towards a mathematical characterization of this latter 'encoding'. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: futro@us.ibm.com.


Assuntos
Genoma , Nucleossomos , Cromatina , DNA , Eucariotos
11.
Bioinformatics ; 31(18): 2939-46, 2015 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-26007227

RESUMO

MOTIVATION: Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. RESULTS: We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhaustive and within the bounds dictated by the information-theoretic content of the sample sets we use and (ii) informative for comparative epigenomics. We analize four different organisms and we propose a paradigmatic formalization of k-mer dictionaries, providing two different and complementary views of the k-mers involved in NER and NDR. The first extends well known studies in this area, its comparative nature being its major merit. The second, very novel, brings to light the rich variety of k-mers involved in influencing nucleosome positioning, for which an initial classification in terms of clusters is also provided. Although such a classification offers many insights, the following deserves to be singled-out: short poly(dA:dT) tracts are reported in the literature as fundamental for nucleosome depletion, however a global quantitative look reveals that their role is much less prominent than one would expect based on previous studies. AVAILABILITY AND IMPLEMENTATION: Dictionaries, clusters and Supplementary Material are available online at http://math.unipa.it/rombo/epigenomics/. CONTACT: simona.rombo@unipa.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Montagem e Desmontagem da Cromatina/genética , Epigenômica , Nucleossomos/genética , Análise de Sequência de DNA/métodos , Animais , Genoma , Humanos
12.
Brief Bioinform ; 15(3): 390-406, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24347576

RESUMO

High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their implementation and availability. Performance of the various methods is also highlighted, although the state of the art does not lend itself to a consistent and coherent comparison among all the methods presented here.


Assuntos
Biologia Computacional/métodos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Compressão de Dados/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Metagenômica/estatística & dados numéricos , Alinhamento de Sequência , Software
13.
BMC Bioinformatics ; 14 Suppl 1: S6, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23369037

RESUMO

BACKGROUND: Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. RESULTS: A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm. CONCLUSIONS: The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d2, Pearson correlation dr and mutual information dMI. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that dMI may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data.


Assuntos
Algoritmos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise por Conglomerados , Transcriptoma
14.
EMBO J ; 30(9): 1766-77, 2011 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-21448136

RESUMO

The evolutionarily conserved ATP-dependent nucleosome remodelling factor ISWI can space nucleosomes affecting a variety of nuclear processes. In Drosophila, loss of ISWI leads to global transcriptional defects and to dramatic alterations in higher-order chromatin structure, especially on the male X chromosome. In order to understand if chromatin condensation and gene expression defects, observed in ISWI mutants, are directly correlated with ISWI nucleosome spacing activity, we conducted a genome-wide survey of ISWI binding and nucleosome positioning in wild-type and ISWI mutant chromatin. Our analysis revealed that ISWI binds both genic and intergenic regions. Remarkably, we found that ISWI binds genes near their promoters causing specific alterations in nucleosome positioning at the level of the Transcription Start Site, providing an important insights in understanding ISWI role in higher eukaryote transcriptional regulation. Interestingly, differences in nucleosome spacing, between wild-type and ISWI mutant chromatin, tend to accumulate on the X chromosome for all ISWI-bound genes analysed. Our study shows how in higher eukaryotes the activity of the evolutionarily conserved nucleosome remodelling factor ISWI regulates gene expression and chromosome organization genome-wide.


Assuntos
Adenosina Trifosfatases/metabolismo , Montagem e Desmontagem da Cromatina/fisiologia , Proteínas de Ligação a DNA/metabolismo , Proteínas de Drosophila/metabolismo , Regulação da Expressão Gênica/fisiologia , Nucleossomos/fisiologia , Ligação Proteica , Fatores de Transcrição/metabolismo , Cromossomo X/genética , Animais , Imunoprecipitação da Cromatina , Cruzamentos Genéticos , Drosophila , Genômica , Masculino , Regiões Promotoras Genéticas/genética
15.
Algorithms Mol Biol ; 6(1): 1, 2011 Jan 14.
Artigo em Inglês | MEDLINE | ID: mdl-21235792

RESUMO

BACKGROUND: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up. RESULTS: Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations. CONCLUSIONS: In summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.

16.
Bioinformatics ; 25(13): 1575-86, 2009 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-19251772

RESUMO

MOTIVATION: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. RESULTS: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. AVAILABILITY: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The Supplementary Material provides pointers to software and benchmark datasets for a range of applications of broad interest. In addition to provide reference to software, the Supplementary Material also gives a brief presentation of some fundamental results and techniques related to this paper. It is at: http://www.math.unipa.it/ approximately raffaele/suppMaterial/compReview/


Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Armazenamento e Recuperação da Informação , Software
17.
BMC Bioinformatics ; 9: 462, 2008 Oct 29.
Artigo em Inglês | MEDLINE | ID: mdl-18959783

RESUMO

BACKGROUND: Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. RESULTS: We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-Squares) and KL (Krzanowski and Lai index). We perform extensive experiments on six benchmark microarray datasets, using both Hierarchical and K-means clustering algorithms, and we provide an analysis assessing both the intrinsic ability of a measure to predict the correct number of clusters in a dataset and its merit relative to the other measures. We pay particular attention both to precision and speed. Moreover, we also provide various fast approximation algorithms for the computation of Gap, FOM and WCSS. The main result is a hierarchy of those measures in terms of precision and speed, highlighting some of their merits and limitations not reported before in the literature. CONCLUSION: Based on our analysis, we draw several conclusions for the use of those internal measures on microarray data. We report the main ones. Consensus is by far the best performer in terms of predictive power and remarkably algorithm-independent. Unfortunately, on large datasets, it may be of no use because of its non-trivial computer time demand (weeks on a state of the art PC). FOM is the second best performer although, quite surprisingly, it may not be competitive in this scenario: it has essentially the same predictive power of WCSS but it is from 6 to 100 times slower in time, depending on the dataset. The approximation algorithms for the computation of FOM, Gap and WCSS perform very well, i.e., they are faster while still granting a very close approximation of FOM and WCSS. The approximation algorithm for the computation of Gap deserves to be singled-out since it has a predictive power far better than Gap, it is competitive with the other measures, but it is at least two order of magnitude faster in time with respect to Gap. Another important novel conclusion that can be drawn from our analysis is that all the measures we have considered show severe limitations on large datasets, either due to computational demand (Consensus, as already mentioned, Clest and Gap) or to lack of precision (all of the other measures, including their approximations). The software and datasets are available under the GNU GPL on the supplementary material web page.


Assuntos
Biologia Computacional/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Animais , Benchmarking/métodos , Análise por Conglomerados , Bases de Dados Genéticas , Humanos , Software , Estatística como Assunto/métodos
18.
Algorithms Mol Biol ; 2: 10, 2007 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-17877802

RESUMO

This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at http://www.math.unipa.it/~raffaele/BATS/ under the GNU GPL.

19.
BMC Bioinformatics ; 8: 252, 2007 Jul 13.
Artigo em Inglês | MEDLINE | ID: mdl-17629909

RESUMO

BACKGROUND: Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available. RESULTS: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next. CONCLUSION: UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.


Assuntos
Algoritmos , Proteínas/química , Curva ROC , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Área Sob a Curva , Benchmarking , Bases de Dados de Proteínas , Humanos , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Software
20.
BMC Bioinformatics ; 6: 289, 2005 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-16336639

RESUMO

BACKGROUND: Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. RESULTS: GenClust is a new genetic algorithm for clustering gene expression data. It has two key features: (a) a novel coding of the search space that is simple, compact and easy to update; (b) it can be used naturally in conjunction with data driven internal validation methods. We have experimented with the FOM methodology, specifically conceived for validating clusters of gene expression data. The validity of GenClust has been assessed experimentally on real data sets, both with the use of validation measures and in comparison with other algorithms, i.e., Average Link, Cast, Click and K-means. CONCLUSION: Experiments show that none of the algorithms we have used is markedly superior to the others across data sets and validation measures; i.e., in many cases the observed differences between the worst and best performing algorithm may be statistically insignificant and they could be considered equivalent. However, there are cases in which an algorithm may be better than others and therefore worthwhile. In particular, experiments for GenClust show that, although simple in its data representation, it converges very rapidly to a local optimum and that its ability to identify meaningful clusters is comparable, and sometimes superior, to that of more sophisticated algorithms. In addition, it is well suited for use in conjunction with data driven internal validation measures and, in particular, the FOM methodology.


Assuntos
Biologia Computacional/métodos , Regulação da Expressão Gênica , Algoritmos , Análise por Conglomerados , DNA Complementar/metabolismo , Perfilação da Expressão Gênica , Modelos Estatísticos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos , Oligonucleotídeos/química , Fases de Leitura Aberta , Reconhecimento Automatizado de Padrão , Alinhamento de Sequência , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...