Search | VHL Regional Portal

Classification of histogram-valued data with support histogram machines.

Kang, Ilsuk; Park, Cheolwoo; Yoon, Young Joo; Park, Changyi; Kwon, Soon-Sun; Choi, Hosik.

J Appl Stat ; 50(3): 675-690, 2023.

Article in English | MEDLINE | ID: mdl-36819077

ABSTRACT

The current large amounts of data and advanced technologies have produced new types of complex data, such as histogram-valued data. The paper focuses on classification problems when predictors are observed as or aggregated into histograms. Because conventional classification methods take vectors as input, a natural approach converts histograms into vector-valued data using summary values, such as the mean or median. However, this approach forgoes the distributional information available in histograms. To address this issue, we propose a margin-based classifier called support histogram machine (SHM) for histogram-valued data. We adopt the support vector machine framework and the Wasserstein-Kantorovich metric to measure distances between histograms. The proposed optimization problem is solved by a dual approach. We then test the proposed SHM via simulated and real examples and demonstrate its superior performance to summary-value-based methods.

Network analysis for count data with excess zeros.

Choi, Hosik; Gim, Jungsoo; Won, Sungho; Kim, You Jin; Kwon, Sunghoon; Park, Changyi.

BMC Genet ; 18(1): 93, 2017 11 06.

Article in English | MEDLINE | ID: mdl-29110633

ABSTRACT

BACKGROUND: Undirected graphical models or Markov random fields have been a popular class of models for representing conditional dependence relationships between nodes. In particular, Markov networks help us to understand complex interactions between genes in biological processes of a cell. Local Poisson models seem to be promising in modeling positive as well as negative dependencies for count data. Furthermore, when zero counts are more frequent than are expected, excess zeros should be considered in the model. METHODS: We present a penalized Poisson graphical model for zero inflated count data and derive an expectation-maximization (EM) algorithm built on coordinate descent. Our method is shown to be effective through simulated and real data analysis. RESULTS: Results from the simulated data indicate that our method outperforms the local Poisson graphical model in the presence of excess zeros. In an application to a RNA sequencing data, we also investigate the gender effect by comparing the estimated networks according to different genders. Our method may help us in identifying biological pathways linked to sex hormone regulation and thus understanding underlying mechanisms of the gender differences. CONCLUSIONS: We have presented a penalized version of zero inflated spatial Poisson regression and derive an efficient EM algorithm built on coordinate descent. We discuss possible improvements of our method as well as potential research directions associated with our findings from the RNA sequencing data.

Subject(s)

Algorithms , Gene Expression Profiling/methods , Gene Regulatory Networks , High-Throughput Nucleotide Sequencing/methods , Models, Statistical , Sequence Analysis, RNA/methods , Computer Simulation , Female , Humans , Male , Poisson Distribution

Improving Disease Prediction by Incorporating Family Disease History in Risk Prediction Models with Large-Scale Genetic Data.

Gim, Jungsoo; Kim, Wonji; Kwak, Soo Heon; Choi, Hosik; Park, Changyi; Park, Kyong Soo; Kwon, Sunghoon; Park, Taesung; Won, Sungho.

Genetics ; 207(3): 1147-1155, 2017 11.

Article in English | MEDLINE | ID: mdl-28899997

ABSTRACT

Despite the many successes of genome-wide association studies (GWAS), the known susceptibility variants identified by GWAS have modest effect sizes, leading to notable skepticism about the effectiveness of building a risk prediction model from large-scale genetic data. However, in contrast to genetic variants, the family history of diseases has been largely accepted as an important risk factor in clinical diagnosis and risk prediction. Nevertheless, the complicated structures of the family history of diseases have limited their application in clinical practice. Here, we developed a new method that enables incorporation of the general family history of diseases with a liability threshold model, and propose a new analysis strategy for risk prediction with penalized regression analysis that incorporates both large numbers of genetic variants and clinical risk factors. Application of our model to type 2 diabetes in the Korean population (1846 cases and 1846 controls) demonstrated that single-nucleotide polymorphisms accounted for 32.5% of the variation explained by the predicted risk scores in the test data set, and incorporation of family history led to an additional 6.3% improvement in prediction. Our results illustrate that family medical history provides valuable information on the variation of complex diseases and improves prediction performance.

Subject(s)

Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Medical History Taking/methods , Models, Genetic , Pedigree , Diabetes Mellitus, Type 2/genetics , Genetic Variation , Genome-Wide Association Study/standards , Humans , Medical History Taking/standards

Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data.

Won, Sungho; Choi, Hosik; Park, Suyeon; Lee, Juyoung; Park, Changyi; Kwon, Sunghoon.

Biomed Res Int ; 2015: 605891, 2015.

Article in English | MEDLINE | ID: mdl-26346893

ABSTRACT

Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called "large P and small N" problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.

Subject(s)

Genetic Predisposition to Disease , Models, Genetic , Polymorphism, Single Nucleotide , Animals , Humans , Predictive Value of Tests

Gradient lasso for Cox proportional hazards model.

Sohn, Insuk; Kim, Jinseog; Jung, Sin-Ho; Park, Changyi.

Bioinformatics ; 25(14): 1775-81, 2009 Jul 15.

Article in English | MEDLINE | ID: mdl-19447787

ABSTRACT

MOTIVATION: There has been an increasing interest in expressing a survival phenotype (e.g. time to cancer recurrence or death) or its distribution in terms of a subset of the expression data of a subset of genes. Due to high dimensionality of gene expression data, however, there is a serious problem of collinearity in fitting a prediction model, e.g. Cox's proportional hazards model. To avoid the collinearity problem, several methods based on penalized Cox proportional hazards models have been proposed. However, those methods suffer from severe computational problems, such as slow or even failed convergence, because of high-dimensional matrix inversions required for model fitting. We propose to implement the penalized Cox regression with a lasso penalty via the gradient lasso algorithm that yields faster convergence to the global optimum than do other algorithms. Moreover the gradient lasso algorithm is guaranteed to converge to the optimum under mild regularity conditions. Hence, our gradient lasso algorithm can be a useful tool in developing a prediction model based on high-dimensional covariates including gene expression data. RESULTS: Results from simulation studies showed that the prediction model by gradient lasso recovers the prognostic genes. Also results from diffuse large B-cell lymphoma datasets and Norway/Stanford breast cancer dataset indicate that our method is very competitive compared with popular existing methods by Park and Hastie and Goeman in its computational time, prediction and selectivity. AVAILABILITY: R package glcoxph is available at http://datamining.dongguk.ac.kr/R/glcoxph.

Subject(s)

Computational Biology/methods , Gene Expression Profiling/methods , Algorithms , Humans , Lymphoma, B-Cell/genetics , Proportional Hazards Models

A copula method for modeling directional dependence of genes.

Kim, Jong-Min; Jung, Yoon-Sung; Sungur, Engin A; Han, Kap-Hoon; Park, Changyi; Sohn, Insuk.

BMC Bioinformatics ; 9: 225, 2008 May 01.

Article in English | MEDLINE | ID: mdl-18447957

ABSTRACT

BACKGROUND: Genes interact with each other as basic building blocks of life, forming a complicated network. The relationship between groups of genes with different functions can be represented as gene networks. With the deposition of huge microarray data sets in public domains, study on gene networking is now possible. In recent years, there has been an increasing interest in the reconstruction of gene networks from gene expression data. Recent work includes linear models, Boolean network models, and Bayesian networks. Among them, Bayesian networks seem to be the most effective in constructing gene networks. A major problem with the Bayesian network approach is the excessive computational time. This problem is due to the interactive feature of the method that requires large search space. Since fitting a model by using the copulas does not require iterations, elicitation of the priors, and complicated calculations of posterior distributions, the need for reference to extensive search spaces can be eliminated leading to manageable computational affords. Bayesian network approach produces a discretely expression of conditional probabilities. Discreteness of the characteristics is not required in the copula approach which involves use of uniform representation of the continuous random variables. Our method is able to overcome the limitation of Bayesian network method for gene-gene interaction, i.e. information loss due to binary transformation. RESULTS: We analyzed the gene interactions for two gene data sets (one group is eight histone genes and the other group is 19 genes which include DNA polymerases, DNA helicase, type B cyclin genes, DNA primases, radiation sensitive genes, repaire related genes, replication protein A encoding gene, DNA replication initiation factor, securin gene, nucleosome assembly factor, and a subunit of the cohesin complex) by adopting a measure of directional dependence based on a copula function. We have compared our results with those from other methods in the literature. Although microarray results show a transcriptional co-regulation pattern and do not imply that the gene products are physically interactive, this tight genetic connection may suggest that each gene product has either direct or indirect connections between the other gene products. Indeed, recent comprehensive analysis of a protein interaction map revealed that those histone genes are physically connected with each other, supporting the results obtained by our method. CONCLUSION: The results illustrate that our method can be an alternative to Bayesian networks in modeling gene interactions. One advantage of our approach is that dependence between genes is not assumed to be linear. Another advantage is that our approach can detect directional dependence. We expect that our study may help to design artificial drug candidates, which can block or activate biologically meaningful pathways. Moreover, our copula approach can be extended to investigate the effects of local environments on protein-protein interactions. The copula mutual information approach will help to propose the new variant of ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks): an algorithm for the reconstruction of gene regulatory networks.

Subject(s)

Computational Biology/methods , Gene Regulatory Networks , Models, Statistical , Cell Cycle/genetics , Data Interpretation, Statistical , Gene Expression Profiling/methods , Gene Regulatory Networks/physiology , Genes, Fungal/physiology , Likelihood Functions , Models, Genetic , Multivariate Analysis , Pattern Recognition, Automated/methods , Saccharomyces cerevisiae/cytology , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism , Stochastic Processes

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL