Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 37
Filter
1.
Bioinformatics ; 39(10)2023 10 03.
Article in English | MEDLINE | ID: mdl-37792497

ABSTRACT

MOTIVATION: Nuclear magnetic resonance spectroscopy (NMR) is widely used to analyze metabolites in biological samples, but the analysis requires specific expertise, it is time-consuming, and can be inaccurate. Here, we present a powerful automate tool, SPatial clustering Algorithm-Statistical TOtal Correlation SpectroscopY (SPA-STOCSY), which overcomes challenges faced when analyzing NMR data and identifies metabolites in a sample with high accuracy. RESULTS: As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset. It first investigates the covariance pattern among datapoints and then calculates the optimal threshold with which to cluster datapoints belonging to the same structural unit, i.e. the metabolite. Generated clusters are then automatically linked to a metabolite library to identify candidates. To assess SPA-STOCSY's efficiency and accuracy, we applied it to synthesized spectra and spectra acquired on Drosophila melanogaster tissue and human embryonic stem cells. In the synthesized spectra, SPA outperformed Statistical Recoupling of Variables (SRV), an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the biological data, SPA-STOCSY performed comparably to the operator-based Chenomx analysis while avoiding operator bias, and it required <7 min of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. It may thus accelerate the use of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making. AVAILABILITY AND IMPLEMENTATION: The codes of SPA-STOCSY are available at https://github.com/LiuzLab/SPA-STOCSY.


Subject(s)
Drosophila melanogaster , Magnetic Resonance Imaging , Animals , Humans , Magnetic Resonance Spectroscopy/methods , Cluster Analysis , Metabolomics/methods
2.
PLoS Genet ; 19(5): e1010760, 2023 05.
Article in English | MEDLINE | ID: mdl-37200393

ABSTRACT

Heterozygous variants in the glucocerebrosidase (GBA) gene are common and potent risk factors for Parkinson's disease (PD). GBA also causes the autosomal recessive lysosomal storage disorder (LSD), Gaucher disease, and emerging evidence from human genetics implicates many other LSD genes in PD susceptibility. We have systemically tested 86 conserved fly homologs of 37 human LSD genes for requirements in the aging adult Drosophila brain and for potential genetic interactions with neurodegeneration caused by α-synuclein (αSyn), which forms Lewy body pathology in PD. Our screen identifies 15 genetic enhancers of αSyn-induced progressive locomotor dysfunction, including knockdown of fly homologs of GBA and other LSD genes with independent support as PD susceptibility factors from human genetics (SCARB2, SMPD1, CTSD, GNPTAB, SLC17A5). For several genes, results from multiple alleles suggest dose-sensitivity and context-dependent pleiotropy in the presence or absence of αSyn. Homologs of two genes causing cholesterol storage disorders, Npc1a / NPC1 and Lip4 / LIPA, were independently confirmed as loss-of-function enhancers of αSyn-induced retinal degeneration. The enzymes encoded by several modifier genes are upregulated in αSyn transgenic flies, based on unbiased proteomics, revealing a possible, albeit ineffective, compensatory response. Overall, our results reinforce the important role of lysosomal genes in brain health and PD pathogenesis, and implicate several metabolic pathways, including cholesterol homeostasis, in αSyn-mediated neurotoxicity.


Subject(s)
Parkinson Disease , alpha-Synuclein , Animals , Humans , alpha-Synuclein/genetics , alpha-Synuclein/metabolism , Animals, Genetically Modified , Drosophila/genetics , Drosophila/metabolism , Glucosylceramidase/genetics , Glucosylceramidase/metabolism , Lysosomes/metabolism , Parkinson Disease/pathology , Transferases (Other Substituted Phosphate Groups)/metabolism , Aging/metabolism
3.
bioRxiv ; 2023 Feb 22.
Article in English | MEDLINE | ID: mdl-36865102

ABSTRACT

Nuclear Magnetic Resonance (NMR) spectroscopy is widely used to analyze metabolites in biological samples, but the analysis can be cumbersome and inaccurate. Here, we present a powerful automated tool, SPA-STOCSY (Spatial Clustering Algorithm - Statistical Total Correlation Spectroscopy), which overcomes the challenges by identifying metabolites in each sample with high accuracy. As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset, first investigating the covariance pattern and then calculating the optimal threshold with which to cluster data points belonging to the same structural unit, i.e. metabolite. The generated clusters are then automatically linked to a compound library to identify candidates. To assess SPA-STOCSY’s efficiency and accuracy, we applied it to synthesized and real NMR data obtained from Drosophila melanogaster brains and human embryonic stem cells. In the synthesized spectra, SPA outperforms Statistical Recoupling of Variables, an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the real spectra, SPA-STOCSY performs comparably to operator-based Chenomx analysis but avoids operator bias and performs the analyses in less than seven minutes of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. As such, it might accelerate the utilization of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making.

4.
Biometrics ; 79(4): 3846-3858, 2023 12.
Article in English | MEDLINE | ID: mdl-36950906

ABSTRACT

Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.


Subject(s)
Alzheimer Disease , Humans , Aged , Alzheimer Disease/genetics , Genomics , Cluster Analysis
5.
PLoS Comput Biol ; 18(10): e1010577, 2022 10.
Article in English | MEDLINE | ID: mdl-36191044

ABSTRACT

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which lead to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.


Subject(s)
Algorithms , Computational Biology , Cluster Analysis , Computational Biology/methods , Consensus , Reproducibility of Results
6.
J Comput Biol ; 29(5): 465-482, 2022 05.
Article in English | MEDLINE | ID: mdl-35325552

ABSTRACT

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies have yielded a powerful tool to measure gene expression of individual cells. One major challenge of the scRNA-seq data is that it usually contains a large amount of zero expression values, which often impairs the effectiveness of downstream analyses. Numerous data imputation methods have been proposed to deal with these "dropout" events, but this is a difficult task for such high-dimensional and sparse data. Furthermore, there have been debates on the nature of the sparsity, about whether the zeros are due to technological limitations or represent actual biology. To address these challenges, we propose Single-cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information (SCENA), a novel approach that imputes the correlation matrix of the data of interest instead of the data itself. SCENA obtains a gene-by-gene correlation estimate by ensembling various individual estimates, some of which are based on known auxiliary information about gene expression networks. Our approach is a reliable method that makes no assumptions on the nature of sparsity in scRNA-seq data or the data distribution. By extensive simulation studies and real data applications, we demonstrate that SCENA is not only superior in gene correlation estimation, but also improves the accuracy and reliability of downstream analyses, including cell clustering, dimension reduction, and graphical model estimation to learn the gene expression network.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Cluster Analysis , Computer Simulation , RNA-Seq , Reproducibility of Results , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods
7.
Article in English | MEDLINE | ID: mdl-34746376

ABSTRACT

Ridge-like regularization often leads to improved generalization performance of machine learning models by mitigating overfitting. While ridge-regularized machine learning methods are widely used in many important applications, direct training via optimization could become challenging in huge data scenarios with millions of examples and features. We tackle such challenges by proposing a general approach that achieves ridge-like regularization through implicit techniques named Minipatch Ridge (MPRidge). Our approach is based on taking an ensemble of coefficients of unregularized learners trained on many tiny, random subsamples of both the examples and features of the training data, which we call minipatches. We empirically demonstrate that MPRidge induces an implicit ridge-like regularizing effect and performs nearly the same as explicit ridge regularization for a general class of predictors including logistic regression, SVM, and robust regression. Embarrassingly parallelizable, MPRidge provides a computationally appealing alternative to inducing ridge-like regularization for improving generalization performance in challenging big-data settings.

8.
Article in English | MEDLINE | ID: mdl-34734115

ABSTRACT

Boosting methods are among the best general-purpose and off-the-shelf machine learning approaches, gaining widespread popularity. In this paper, we seek to develop a boosting method that yields comparable accuracy to popular AdaBoost and gradient boosting methods, yet is faster computationally and whose solution is more interpretable. We achieve this by developing MP-Boost, an algorithm loosely based on AdaBoost that learns by adaptively selecting small subsets of instances and features, or what we term minipatches (MP), at each iteration. By sequentially learning on tiny subsets of the data, our approach is computationally faster than other classic boosting algorithms. Also as it progresses, MP-Boost adaptively learns a probability distribution on the features and instances that upweight the most important features and challenging instances, hence adaptively selecting the most relevant minipatches for learning. These learned probability distributions also aid in interpretation of our method. We empirically demonstrate the interpretability, comparative accuracy, and computational time of our approach on a variety of binary classification tasks.

9.
J Mach Learn Res ; 222021 Jan.
Article in English | MEDLINE | ID: mdl-34744522

ABSTRACT

In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.

10.
Article in English | MEDLINE | ID: mdl-34278383

ABSTRACT

Central venous pressure (CVP) is the blood pressure in the venae cavae, near the right atrium of the heart. This signal waveform is commonly collected in clinical settings, and yet there has been limited discussion of using this data for detecting arrhythmia and other cardiac events. In this paper, we develop a signal processing and feature engineering pipeline for CVP waveform analysis. Through a case study on pediatric junctional ectopic tachycardia (JET), we show that our extracted CVP features reliably detect JET with comparable results to the more commonly used electrocardiogram (ECG) features. This machine learning pipeline can thus improve the clinical diagnosis and ICU monitoring of arrhythmia. It also corroborates and complements the ECG-based diagnosis, especially when the ECG measurements are unavailable or corrupted.

11.
PLoS One ; 15(11): e0241707, 2020.
Article in English | MEDLINE | ID: mdl-33152028

ABSTRACT

Even though there is a clear link between Alzheimer's Disease (AD) related neuropathology and cognitive decline, numerous studies have observed that healthy cognition can exist in the presence of extensive AD pathology, a phenomenon sometimes called Cognitive Resilience (CR). To better understand and study CR, we develop the Alzheimer's Disease Cognitive Resilience Score (AD-CR Score), which we define as the difference between the observed and expected cognition given the observed level of AD pathology. Unlike other definitions of CR, our AD-CR Score is a fully non-parametric, stand-alone, individual-level quantification of CR that is derived independently of other factors or proxy variables. Using data from two ongoing, longitudinal cohort studies of aging, the Religious Orders Study (ROS) and the Rush Memory and Aging Project (MAP), we validate our AD-CR Score by showing strong associations with known factors related to CR such as baseline and longitudinal cognition, non AD-related pathology, education, personality, APOE, parkinsonism, depression, and life activities. Even though the proposed AD-CR Score cannot be directly calculated during an individual's lifetime because it uses postmortem pathology, we also develop a machine learning framework that achieves promising results in terms of predicting whether an individual will have an extremely high or low AD-CR Score using only measures available during the lifetime. Given this, our AD-CR Score can be used for further investigations into mechanisms of CR, and potentially for subject stratification prior to clinical trials of personalized therapies.


Subject(s)
Alzheimer Disease/diagnosis , Alzheimer Disease/physiopathology , Cognitive Dysfunction/diagnosis , Cognitive Dysfunction/physiopathology , Cohort Studies , Humans , Longitudinal Studies
12.
J Comput Graph Stat ; 29(1): 87-96, 2020.
Article in English | MEDLINE | ID: mdl-32982130

ABSTRACT

Convex clustering is a promising new approach to the classical problem of clustering, combining strong performance in empirical studies with rigorous theoretical foundations. Despite these advantages, convex clustering has not been widely adopted, due to its computationally intensive nature and its lack of compelling visualizations. To address these impediments, we introduce Algorithmic Regularization, an innovative technique for obtaining high-quality estimates of regularization paths using an iterative one-step approximation scheme. We justify our approach with a novel theoretical result, guaranteeing global convergence of the approximate path to the exact solution under easily-checked non-data-dependent assumptions. The application of algorithmic regularization to convex clustering yields the Convex Clustering via Algorithmic Regularization Paths (CARP) algorithm for computing the clustering solution path. On example data sets from genomics and text analysis, CARP delivers over a 100-fold speed-up over existing methods, while attaining a finer approximation grid than standard methods. Furthermore, CARP enables improved visualization of clustering solutions: the fine solution grid returned by CARP can be used to construct a convex clustering-based dendrogram, as well as forming the basis of a dynamic path-wise visualization based on modern web technologies. Our methods are implemented in the open-source R package clustRviz, available at https://github.com/DataSlingers/clustRviz.

13.
ACM BCB ; 20202020 Sep.
Article in English | MEDLINE | ID: mdl-34278382

ABSTRACT

Single cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.

14.
Neuroimage ; 197: 330-343, 2019 08 15.
Article in English | MEDLINE | ID: mdl-31029870

ABSTRACT

Advanced brain imaging techniques make it possible to measure individuals' structural connectomes in large cohort studies non-invasively. Given the availability of large scale data sets, it is extremely interesting and important to build a set of advanced tools for structural connectome extraction and statistical analysis that emphasize both interpretability and predictive power. In this paper, we developed and integrated a set of toolboxes, including an advanced structural connectome extraction pipeline and a novel tensor network principal components analysis (TN-PCA) method, to study relationships between structural connectomes and various human traits such as alcohol and drug use, cognition and motion abilities. The structural connectome extraction pipeline produces a set of connectome features for each subject that can be organized as a tensor network, and TN-PCA maps the high-dimensional tensor network data to a lower-dimensional Euclidean space. Combined with classical hypothesis testing, canonical correlation analysis and linear discriminant analysis techniques, we analyzed over 1100 scans of 1076 subjects from the Human Connectome Project (HCP) and the Sherbrooke test-retest data set, as well as 175 human traits measuring different domains including cognition, substance use, motor, sensory and emotion. The test-retest data validated the developed algorithms. With the HCP data, we found that structural connectomes are associated with a wide range of traits, e.g., fluid intelligence, language comprehension, and motor skills are associated with increased cortical-cortical brain structural connectivity, while the use of alcohol, tobacco, and marijuana are associated with decreased cortical-cortical connectivity. We also demonstrated that our extracted structural connectomes and analysis method can give superior prediction accuracies compared with alternative connectome constructions and other tensor and network regression methods.


Subject(s)
Brain/anatomy & histology , Connectome/methods , Diffusion Tensor Imaging/methods , Image Processing, Computer-Assisted/methods , Personality/physiology , Brain/diagnostic imaging , Data Interpretation, Statistical , Female , Humans , Male , Models, Neurological , Neural Pathways/anatomy & histology , Principal Component Analysis
15.
PLoS One ; 13(9): e0203007, 2018.
Article in English | MEDLINE | ID: mdl-30204756

ABSTRACT

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC's advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association.


Subject(s)
Genomics/methods , Breast Neoplasms/genetics , Cluster Analysis , Computer Simulation , DNA Copy Number Variations , DNA Methylation , Female , Genome , Humans , Ovarian Neoplasms/genetics , Spatial Analysis , Unsupervised Machine Learning
16.
Bioinformatics ; 34(7): 1141-1147, 2018 04 01.
Article in English | MEDLINE | ID: mdl-29617963

ABSTRACT

Motivation: Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest. Results: We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects. Availability and implementation: DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC. Contact: zhanghan@nankai.edu.cn or zhandonl@bcm.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Gene Expression Profiling/methods , Quality Control , Research Design , Sequence Analysis, RNA/methods , Glioblastoma/genetics , Humans , Topotecan/pharmacology
17.
BMC Bioinformatics ; 18(Suppl 11): 405, 2017 Oct 03.
Article in English | MEDLINE | ID: mdl-28984189

ABSTRACT

The 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) was held on December 8-10, 2016 in Houston, Texas, USA. ICIBM included eight scientific sessions, four tutorials, one poster session, four highlighted talks and four keynotes that covered topics on 3D genomics structural analysis, next generation sequencing (NGS) analysis, computational drug discovery, medical informatics, cancer genomics, and systems biology. Here, we present a summary of the nine research articles selected from ICIBM 2016 program for publishing in BMC Bioinformatics.


Subject(s)
Biology , Congresses as Topic , Internationality , Medicine , Statistics as Topic , Algorithms , DNA Copy Number Variations/genetics , Humans , Machine Learning , Neural Networks, Computer , RNA Splicing/genetics , Sequence Analysis, RNA
18.
BMC Genomics ; 18(Suppl 6): 703, 2017 Oct 03.
Article in English | MEDLINE | ID: mdl-28984207

ABSTRACT

In this editorial, we first summarize the 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) that was held on December 8-10, 2016 in Houston, Texas, USA, and then briefly introduce the ten research articles included in this supplement issue. ICIBM 2016 included four workshops or tutorials, four keynote lectures, four conference invited talks, eight concurrent scientific sessions and a poster session for 53 accepted abstracts, covering current topics in bioinformatics, systems biology, intelligent computing, and biomedical informatics. Through our call for papers, a total of 77 original manuscripts were submitted to ICIBM 2016. After peer review, 11 articles were selected in this special issue, covering topics such as single cell RNA-seq analysis method, genome sequence and variation analysis, bioinformatics method for vaccine development, and cancer genomics.


Subject(s)
Genomics , Inventions , Medicine
19.
Biometrics ; 73(1): 10-19, 2017 03.
Article in English | MEDLINE | ID: mdl-27163413

ABSTRACT

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees-features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.


Subject(s)
Cluster Analysis , Data Interpretation, Statistical , Gene Regulatory Networks , Algorithms , Computational Biology/methods , Databases, Genetic , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis
20.
BMC Syst Biol ; 10 Suppl 3: 69, 2016 08 26.
Article in English | MEDLINE | ID: mdl-27586041

ABSTRACT

BACKGROUND: Technological advances in medicine have led to a rapid proliferation of high-throughput "omics" data. Tools to mine this data and discover disrupted disease networks are needed as they hold the key to understanding complicated interactions between genes, mutations and aberrations, and epi-genetic markers. RESULTS: We developed an R software package, XMRF, that can be used to fit Markov Networks to various types of high-throughput genomics data. Encoding the models and estimation techniques of the recently proposed exponential family Markov Random Fields (Yang et al., 2012), our software can be used to learn genetic networks from RNA-sequencing data (counts via Poisson graphical models), mutation and copy number variation data (categorical via Ising models), and methylation data (continuous via Gaussian graphical models). CONCLUSIONS: XMRF is the only tool that allows network structure learning using the native distribution of the data instead of the standard Gaussian. Moreover, the parallelization feature of the implemented algorithms computes the large-scale biological networks efficiently. XMRF is available from CRAN and Github ( https://github.com/zhandong/XMRF ).


Subject(s)
Genomics , High-Throughput Nucleotide Sequencing , Markov Chains , Sequence Analysis, RNA , Software , Statistics as Topic/methods , Computer Graphics , DNA Copy Number Variations , Mutation , Poisson Distribution
SELECTION OF CITATIONS
SEARCH DETAIL
...