Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
1.
IEEE Comput Graph Appl ; 42(3): 19-28, 2022.
Article in English | MEDLINE | ID: mdl-35671278

ABSTRACT

Graphs and other structured data have come to the forefront in machine learning over the past few years due to the efficacy of novel representation learning methods boosting the prediction performance in various tasks. Representation learning methods embed the nodes in a low-dimensional real-valued space, enabling the application of traditional machine learning methods on graphs. These representations have been widely premised to be also suited for graph visualization. However, no benchmarks or encompassing studies on this topic exist. We present an empirical study comparing several state-of-the-art representation learning methods with two recent graph layout algorithms, using readability and distance-based measures as well as the link prediction performance. Generally, no method consistently outperformed the others across quality measures. The graph layout methods provided qualitatively superior layouts when compared to representation learning methods. Embedding graphs in a higher dimensional space and applying t-distributed stochastic neighbor embedding for visualization improved the preservation of local neighborhoods, albeit at substantially higher computational cost.


Subject(s)
Algorithms , Machine Learning , Benchmarking , Empirical Research , Research Design
2.
Big Data ; 2022 Mar 10.
Article in English | MEDLINE | ID: mdl-35271383

ABSTRACT

Network representation learning methods map network nodes to vectors in an embedding space that can preserve specific properties and enable traditional downstream prediction tasks. The quality of the representations learned is then generally showcased through results on these downstream tasks. Commonly used benchmark tasks such as link prediction or network reconstruction, however, present complex evaluation pipelines and an abundance of design choices. This, together with a lack of standardized evaluation setups, can obscure the real progress in the field. In this article, we aim at investigating the impact on the performance of a variety of such design choices and perform an extensive and consistent evaluation that can shed light on the state-of-the-art on network representation learning. Our evaluation reveals that only limited progress has been made in recent years, with embedding-based approaches struggling to outperform basic heuristics in many scenarios.

3.
Mach Learn ; 110(10): 2905-2940, 2021.
Article in English | MEDLINE | ID: mdl-34840420

ABSTRACT

Dimensionality reduction and manifold learning methods such as t-distributed stochastic neighbor embedding (t-SNE) are frequently used to map high-dimensional data into a two-dimensional space to visualize and explore that data. Going beyond the specifics of t-SNE, there are two substantial limitations of any such approach: (1) not all information can be captured in a single two-dimensional embedding, and (2) to well-informed users, the salient structure of such an embedding is often already known, preventing that any real new insights can be obtained. Currently, it is not known how to extract the remaining information in a similarly effective manner. We introduce conditional t-SNE (ct-SNE), a generalization of t-SNE that discounts prior information in the form of labels. This enables obtaining more informative and more relevant embeddings. To achieve this, we propose a conditioned version of the t-SNE objective, obtaining an elegant method with a single integrated objective. We show how to efficiently optimize the objective and study the effects of the extra parameter that ct-SNE has over t-SNE. Qualitative and quantitative empirical results on synthetic and real data show ct-SNE is scalable, effective, and achieves its goal: it allows complementary structure to be captured in the embedding and provided new insights into real data.

4.
PLoS One ; 16(9): e0256922, 2021.
Article in English | MEDLINE | ID: mdl-34469486

ABSTRACT

The democratization of AI tools for content generation, combined with unrestricted access to mass media for all (e.g. through microblogging and social media), makes it increasingly hard for people to distinguish fact from fiction. This raises the question of how individual opinions evolve in such a networked environment without grounding in a known reality. The dominant approach to studying this problem uses simple models from the social sciences on how individuals change their opinions when exposed to their social neighborhood, and applies them on large social networks. We propose a novel model that incorporates two known social phenomena: (i) Biased Assimilation: the tendency of individuals to adopt other opinions if they are similar to their own; (ii) Backfire Effect: the fact that an opposite opinion may further entrench people in their stances, making their opinions more extreme instead of moderating them. To the best of our knowledge, this is the first DeGroot-type opinion formation model that captures the Backfire Effect. A thorough theoretical and empirical analysis of the proposed model reveals intuitive conditions for polarization and consensus to exist, as well as the properties of the resulting opinions.


Subject(s)
Attitude , Models, Psychological , Online Social Networking , Prejudice/psychology , Humans , Social Media
5.
Entropy (Basel) ; 21(6)2019 Jun 05.
Article in English | MEDLINE | ID: mdl-33267280

ABSTRACT

Numerical time series data are pervasive, originating from sources as diverse as wearable devices, medical equipment, to sensors in industrial plants. In many cases, time series contain interesting information in terms of subsequences that recur in approximate form, so-called motifs. Major open challenges in this area include how one can formalize the interestingness of such motifs and how the most interesting ones can be found. We introduce a novel approach that tackles these issues. We formalize the notion of such subsequence patterns in an intuitive manner and present an information-theoretic approach for quantifying their interestingness with respect to any prior expectation a user may have about the time series. The resulting interestingness measure is thus a subjective measure, enabling a user to find motifs that are truly interesting to them. Although finding the best motif appears computationally intractable, we develop relaxations and a branch-and-bound approach implemented in a constraint programming solver. As shown in experiments on synthetic data and two real-world datasets, this enables us to mine interesting patterns in small or mid-sized time series.

6.
PLoS One ; 5(12): e14243, 2010 Dec 08.
Article in English | MEDLINE | ID: mdl-21170383

ABSTRACT

BACKGROUND: A trend towards automation of scientific research has recently resulted in what has been termed "data-driven inquiry" in various disciplines, including physics and biology. The automation of many tasks has been identified as a possible future also for the humanities and the social sciences, particularly in those disciplines concerned with the analysis of text, due to the recent availability of millions of books and news articles in digital format. In the social sciences, the analysis of news media is done largely by hand and in a hypothesis-driven fashion: the scholar needs to formulate a very specific assumption about the patterns that might be in the data, and then set out to verify if they are present or not. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we report what we think is the first large scale content-analysis of cross-linguistic text in the social sciences, by using various artificial intelligence techniques. We analyse 1.3 M news articles in 22 languages detecting a clear structure in the choice of stories covered by the various outlets. This is significantly affected by objective national, geographic, economic and cultural relations among outlets and countries, e.g., outlets from countries sharing strong economic ties are more likely to cover the same stories. We also show that the deviation from average content is significantly correlated with membership to the eurozone, as well as with the year of accession to the EU. CONCLUSIONS/SIGNIFICANCE: While independently making a multitude of small editorial decisions, the leading media of the 27 EU countries, over a period of six months, shaped the contents of the EU mediasphere in a way that reflects its deep geographic, economic and cultural relations. Detecting these subtle signals in a statistically rigorous way would be out of the reach of traditional methods. This analysis demonstrates the power of the available methods for significant automation of media content analysis.


Subject(s)
Culture , Data Collection , Mass Media , Automation , Books , European Union , Humans , Research , Social Sciences
7.
Ann N Y Acad Sci ; 1158: 29-35, 2009 Mar.
Article in English | MEDLINE | ID: mdl-19348629

ABSTRACT

Thanks to the availability of high-throughput omics data, bioinformatics approaches are able to hypothesize thus-far undocumented genetic interactions. However, due to the amount of noise in these data, inferences based on a single data source are often unreliable. A popular approach to overcome this problem is to integrate different data sources. In this study, we describe DISTILLER, a novel framework for data integration that simultaneously analyzes microarray and motif information to find modules that consist of genes that are co-expressed in a subset of conditions, and their corresponding regulators. By applying our method on publicly available data, we evaluated the condition-specific transcriptional network of Escherichia coli. DISTILLER confirmed 62% of 736 interactions described in RegulonDB, and 278 novel interactions were predicted.


Subject(s)
Computational Biology/methods , Escherichia coli/genetics , Gene Expression Regulation, Bacterial , Gene Regulatory Networks , Algorithms , Databases, Genetic , Gene Expression Profiling , Models, Genetic , Oligonucleotide Array Sequence Analysis
8.
Genome Biol ; 10(3): R27, 2009.
Article in English | MEDLINE | ID: mdl-19265557

ABSTRACT

We present DISTILLER, a data integration framework for the inference of transcriptional module networks. Experimental validation of predicted targets for the well-studied fumarate nitrate reductase regulator showed the effectiveness of our approach in Escherichia coli. In addition, the condition dependency and modularity of the inferred transcriptional network was studied. Surprisingly, the level of regulatory complexity seemed lower than that which would be expected from RegulonDB, indicating that complex regulatory programs tend to decrease the degree of modularity.


Subject(s)
Computational Biology/methods , Escherichia coli/genetics , Gene Regulatory Networks , Regulon/genetics , Software , Chromatin Immunoprecipitation , Gene Expression Regulation, Bacterial , Transcription Factors
9.
BMC Bioinformatics ; 10 Suppl 1: S30, 2009 Jan 30.
Article in English | MEDLINE | ID: mdl-19208131

ABSTRACT

BACKGROUND: The detection of cis-regulatory modules (CRMs) that mediate transcriptional responses in eukaryotes remains a key challenge in the postgenomic era. A CRM is characterized by a set of co-occurring transcription factor binding sites (TFBS). In silico methods have been developed to search for CRMs by determining the combination of TFBS that are statistically overrepresented in a certain geneset. Most of these methods solve this combinatorial problem by relying on computational intensive optimization methods. As a result their usage is limited to finding CRMs in small datasets (containing a few genes only) and using binding sites for a restricted number of transcription factors (TFs) out of which the optimal module will be selected. RESULTS: We present an itemset mining based strategy for computationally detecting cis-regulatory modules (CRMs) in a set of genes. We tested our method by applying it on a large benchmark data set, derived from a ChIP-Chip analysis and compared its performance with other well known cis-regulatory module detection tools. CONCLUSION: We show that by exploiting the computational efficiency of an itemset mining approach and combining it with a well-designed statistical scoring scheme, we were able to prioritize the biologically valid CRMs in a large set of coregulated genes using binding sites for a large number of potential TFs as input.


Subject(s)
Algorithms , Regulatory Elements, Transcriptional , Software , Transcription Factors/metabolism , Binding Sites , Databases, Genetic , Models, Genetic
10.
Pac Symp Biocomput ; : 166-77, 2008.
Article in English | MEDLINE | ID: mdl-18229684

ABSTRACT

To investigate the combination of cetuximab, capecitabine and radiotherapy in the preoperative treatment of patients with rectal cancer, fourty tumour samples were gathered before treatment (T0), after one dose of cetuximab but before radiotherapy with capecitabine (T1) and at moment of surgery (T2). The tumour and plasma samples were subjected at all timepoints to Affymetrix microarray and Luminex proteomics analysis, respectively. At surgery, the Rectal Cancer Regression Grade (RCRG) was registered. We used a kernel-based method with Least Squares Support Vector Machines to predict RCRG based on the integration of microarray and proteomics data on To and T1. We demonstrated that combining multiple data sources improves the predictive power. The best model was based on 5 genes and 10 proteins at T0 and T1 and could predict the RCRG with an accuracy of 91.7%, sensitivity of 96.2% and specificity of 80%.


Subject(s)
Antibodies, Monoclonal/therapeutic use , Antineoplastic Agents/therapeutic use , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Proteomics/statistics & numerical data , Rectal Neoplasms/therapy , Algorithms , Antibodies, Monoclonal, Humanized , Artificial Intelligence , Capecitabine , Cetuximab , Combined Modality Therapy , Computational Biology , Data Interpretation, Statistical , Databases, Factual , Deoxycytidine/analogs & derivatives , Deoxycytidine/therapeutic use , Fluorouracil/analogs & derivatives , Fluorouracil/therapeutic use , Humans , Least-Squares Analysis , Models, Statistical , Rectal Neoplasms/genetics , Rectal Neoplasms/metabolism
11.
Bioinformatics ; 23(13): i125-32, 2007 Jul 01.
Article in English | MEDLINE | ID: mdl-17646288

ABSTRACT

MOTIVATION: Hunting disease genes is a problem of primary importance in biomedical research. Biologists usually approach this problem in two steps: first a set of candidate genes is identified using traditional positional cloning or high-throughput genomics techniques; second, these genes are further investigated and validated in the wet lab, one by one. To speed up discovery and limit the number of costly wet lab experiments, biologists must test the candidate genes starting with the most probable candidates. So far, biologists have relied on literature studies, extensive queries to multiple databases and hunches about expected properties of the disease gene to determine such an ordering. Recently, we have introduced the data mining tool ENDEAVOUR (Aerts et al., 2006), which performs this task automatically by relying on different genome-wide data sources, such as Gene Ontology, literature, microarray, sequence and more. RESULTS: In this article, we present a novel kernel method that operates in the same setting: based on a number of different views on a set of training genes, a prioritization of test genes is obtained. We furthermore provide a thorough learning theoretical analysis of the method's guaranteed performance. Finally, we apply the method to the disease data sets on which ENDEAVOUR (Aerts et al., 2006) has been benchmarked, and report a considerable improvement in empirical performance. AVAILABILITY: The MATLAB code used in the empirical results will be made publicly available.


Subject(s)
Biomarkers/metabolism , Chromosome Mapping/methods , Database Management Systems , Databases, Factual , Disease Susceptibility/metabolism , Information Storage and Retrieval/methods , Models, Biological , Computer Simulation , Genetic Predisposition to Disease/genetics , Humans
12.
PLoS One ; 1: e85, 2006 Dec 20.
Article in English | MEDLINE | ID: mdl-17183716

ABSTRACT

Gene families are groups of homologous genes that are likely to have highly similar functions. Differences in family size due to lineage-specific gene duplication and gene loss may provide clues to the evolutionary forces that have shaped mammalian genomes. Here we analyze the gene families contained within the whole genomes of human, chimpanzee, mouse, rat, and dog. In total we find that more than half of the 9,990 families present in the mammalian common ancestor have either expanded or contracted along at least one lineage. Additionally, we find that a large number of families are completely lost from one or more mammalian genomes, and a similar number of gene families have arisen subsequent to the mammalian common ancestor. Along the lineage leading to modern humans we infer the gain of 689 genes and the loss of 86 genes since the split from chimpanzees, including changes likely driven by adaptive natural selection. Our results imply that humans and chimpanzees differ by at least 6% (1,418 of 22,000 genes) in their complement of genes, which stands in stark contrast to the oft-cited 1.5% difference between orthologous nucleotide sequences. This genomic "revolving door" of gene gain and loss represents a large number of genetic differences separating humans from our closest relatives.


Subject(s)
Biological Evolution , Mammals/genetics , Multigene Family , Animals , Dogs , Humans , Mice , Pan troglodytes/genetics , Phylogeny , Primates/genetics , Rats , Rodentia/genetics , Selection, Genetic
13.
Genome Biol ; 7(5): R37, 2006.
Article in English | MEDLINE | ID: mdl-16677396

ABSTRACT

'ReMoDiscovery' is an intuitive algorithm to correlate regulatory programs with regulators and corresponding motifs to a set of co-expressed genes. It exploits in a concurrent way three independent data sources: ChIP-chip data, motif information and gene expression profiles. When compared to published module discovery algorithms, ReMoDiscovery is fast and easily tunable. We evaluated our method on yeast data, where it was shown to generate biologically meaningful findings and allowed the prediction of potential novel roles of transcriptional regulators.


Subject(s)
Algorithms , Chromatin Immunoprecipitation , Gene Expression Profiling , Gene Expression Regulation , Oligonucleotide Array Sequence Analysis , Amino Acids/metabolism , Cell Cycle , Galactose/metabolism , Regulatory Elements, Transcriptional , Ribosomes/metabolism , Software , Transcription Factors/metabolism , Yeasts/genetics , Yeasts/growth & development , Yeasts/metabolism
14.
Bioinformatics ; 22(10): 1269-71, 2006 May 15.
Article in English | MEDLINE | ID: mdl-16543274

ABSTRACT

SUMMARY: We present CAFE (Computational Analysis of gene Family Evolution), a tool for the statistical analysis of the evolution of the size of gene families. It uses a stochastic birth and death process to model the evolution of gene family sizes over a phylogeny. For a specified phylogenetic tree, and given the gene family sizes in the extant species, CAFE can estimate the global birth and death rate of gene families, infer the most likely gene family size at all internal nodes, identify gene families that have accelerated rates of gain and loss (quantified by a p-value) and identify which branches cause the p-value to be small for significant families. AVAILABILITY: Software is available from http://www.bio.indiana.edu/~hahnlab/Software.html


Subject(s)
Algorithms , Chromosome Mapping/methods , DNA Mutational Analysis/methods , Evolution, Molecular , Multigene Family/genetics , Software , Genetic Variation/genetics , Phylogeny , User-Computer Interface
15.
Genome Res ; 15(8): 1153-60, 2005 Aug.
Article in English | MEDLINE | ID: mdl-16077014

ABSTRACT

Comparison of whole genomes has revealed that changes in the size of gene families among organisms is quite common. However, there are as yet no models of gene family evolution that make it possible to estimate ancestral states or to infer upon which lineages gene families have contracted or expanded. In addition, large differences in family size have generally been attributed to the effects of natural selection, without a strong statistical basis for these conclusions. Here we use a model of stochastic birth and death for gene family evolution and show that it can be efficiently applied to multispecies genome comparisons. This model takes into account the lengths of branches on phylogenetic trees, as well as duplication and deletion rates, and hence provides expectations for divergence in gene family size among lineages. The model offers both the opportunity to identify large-scale patterns in genome evolution and the ability to make stronger inferences regarding the role of natural selection in gene family expansion or contraction. We apply our method to data from the genomes of five yeast species to show its applicability.


Subject(s)
Evolution, Molecular , Genomics/methods , Models, Genetic , Multigene Family/genetics , Likelihood Functions , Phylogeny , Saccharomyces/genetics , Stochastic Processes
16.
Pac Symp Biocomput ; : 483-94, 2005.
Article in English | MEDLINE | ID: mdl-15759653

ABSTRACT

We present a method for inference of transcriptional modules from heterogeneous data sources. It allows identifying the responsible set of regulators in combination with their corresponding DNA recognition sites (motifs) and target genes. Our approach distinguishes itself from previous work in literature because it fully exploits the knowledge of three independently acquired data sources: ChIP-chip data; motif information as obtained by phylogenetic shadowing; and gene expression profiles obtained using microarray experiments. Moreover, these three data sources are dealt with in a new and fully integrated manner. By avoiding approaches that take the different data sources into account sequentially or iteratively, the transparency of the method and the interpretability of the results are ensured. Using our method on biological data demonstrated the biological relevance of the inference.


Subject(s)
Oligonucleotide Array Sequence Analysis , Saccharomyces cerevisiae/genetics , Transcription, Genetic , Algorithms , Cell Cycle/genetics , Fungal Proteins/genetics , Models, Genetic , Reproducibility of Results , Ribosomes/genetics , Saccharomyces cerevisiae/cytology
17.
Bioinformatics ; 20(16): 2626-35, 2004 Nov 01.
Article in English | MEDLINE | ID: mdl-15130933

ABSTRACT

MOTIVATION: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. RESULTS: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins--membrane proteins and ribosomal proteins--performs significantly better than the same algorithm trained on any single type of data. AVAILABILITY: Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm


Subject(s)
Algorithms , Chromosome Mapping/methods , Databases, Protein , Gene Expression Profiling/methods , Models, Genetic , Proteins/genetics , Sequence Analysis, Protein/methods , Artificial Intelligence , Databases, Genetic , Fungal Proteins/chemistry , Fungal Proteins/genetics , Genomics/methods , Information Storage and Retrieval/methods , Membrane Proteins/genetics , Membrane Proteins/metabolism , Models, Statistical , Pattern Recognition, Automated , Proteins/analysis , Proteins/chemistry , Proteins/classification , Ribosomal Proteins/chemistry , Ribosomal Proteins/genetics , Sequence Alignment , Sequence Homology, Amino Acid , Systems Integration
SELECTION OF CITATIONS
SEARCH DETAIL
...