Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 56
Filter
1.
Stat Med ; 43(18): 3503-3523, 2024 Aug 15.
Article in English | MEDLINE | ID: mdl-38857600

ABSTRACT

Analysis of competing risks data has been an important topic in survival analysis due to the need to account for the dependence among the competing events. Also, event times are often recorded on discrete time scales, rendering the models tailored for discrete-time nature useful in the practice of survival analysis. In this work, we focus on regression analysis with discrete-time competing risks data, and consider the errors-in-variables issue where the covariates are prone to measurement errors. Viewing the true covariate value as a parameter, we develop the conditional score methods for various discrete-time competing risks models, including the cause-specific and subdistribution hazards models that have been popular in competing risks data analysis. The proposed estimators can be implemented by efficient computation algorithms, and the associated large sample theories can be simply obtained. Simulation results show satisfactory finite sample performances, and the application with the competing risks data from the scleroderma lung study reveals the utility of the proposed methods.


Subject(s)
Computer Simulation , Proportional Hazards Models , Humans , Survival Analysis , Algorithms , Models, Statistical , Regression Analysis , Risk Assessment/methods , Scleroderma, Systemic
2.
J Bioinform Comput Biol ; 21(3): 2350013, 2023 06.
Article in English | MEDLINE | ID: mdl-37350314

ABSTRACT

Precision medicine has been a global trend of medical development, wherein cancer diagnosis plays an important role. With accurate diagnosis of cancer, we can provide patients with appropriate medical treatments for improving patients' survival. Since disease developments involve complex interplay among multiple factors such as gene-gene interactions, cancer classifications based on microarray gene expression profiling data are expected to be effective, and hence, have attracted extensive attention in computational biology and medicine. However, when using genomic data to build a diagnostic model, there exist several problems to be overcome, including the high-dimensional feature space and feature contamination. In this paper, we propose using the overlapping group screening (OGS) approach to build an accurate cancer diagnosis model and predict the probability of a patient falling into some disease classification category in the logistic regression framework. This new proposal integrates gene pathway information into the procedure for identifying genes and gene-gene interactions associated with the classification of cancer outcome groups. We conduct a series of simulation studies to compare the predictive accuracy of our proposed method for cancer diagnosis with some existing machine learning methods, and find the better performances of the former method. We apply the proposed method to the genomic data of The Cancer Genome Atlas related to lung adenocarcinoma (LUAD), liver hepatocellular carcinoma (LHC), and thyroid carcinoma (THCA), to establish accurate cancer diagnosis models.


Subject(s)
Early Detection of Cancer , Neoplasms , Humans , Gene Expression Profiling/methods , Genomics , Computer Simulation , Neoplasms/genetics
3.
J Epidemiol ; 33(1): 52-61, 2023 01 05.
Article in English | MEDLINE | ID: mdl-34053962

ABSTRACT

BACKGROUND: This cohort was established to evaluate whether 38-year radiation exposure (since the start of nuclear reactor operations) is related to cancer risk in residents near three nuclear power plants (NPPs). METHODS: This cohort study enrolled all residents who lived within 8 km of any of the three NPPs in Taiwan from 1978 to 2016 (n = 214,502; person-years = 4,660,189). The control population (n = 257,475; person-years = 6,282,390) from three towns comprised all residents having lived more than 15 km from all three NPPs. Radiation exposure will be assessed via computer programs GASPAR-II and LADTAP-II by following methodologies provided in the United States Nuclear Regulatory Commission regulatory guides. We calculated the cumulative individual tissue organ equivalent dose and cumulative effective dose for each resident. This study presents the number of new cancer cases and prevalence in the residence-nearest NPP group and control group in the 38-year research observation period. CONCLUSION: TNPECS provides a valuable platform for research and opens unique possibilities for testing whether radiation exposure since the start of operations of nuclear reactors will affect health across the life course. The release of radioactive nuclear species caused by the operation of NPPs caused residents to have an effective dose between 10-7 and 10-3 mSv/year. The mean cumulative medical radiation exposure dose between the residence-nearest NPP group and the control group was not different (7.69; standard deviation, 18.39 mSv and 7.61; standard deviation, 19.17 mSv; P = 0.114).


Subject(s)
Neoplasms , Radiation Exposure , Humans , Cohort Studies , Japan , Neoplasms/epidemiology , Nuclear Power Plants , Radiation Exposure/adverse effects , Taiwan/epidemiology , United States
4.
Biom J ; 65(3): e2100361, 2023 03.
Article in English | MEDLINE | ID: mdl-36285659

ABSTRACT

Joint analysis of recurrent and nonrecurrent terminal events has attracted substantial attention in literature. However, there lacks formal methodology for such analysis when the event time data are on discrete scales, even though some modeling and inference strategies have been developed for discrete-time survival analysis. We propose a discrete-time joint modeling approach for the analysis of recurrent and terminal events where the two types of events may be correlated with each other. The proposed joint modeling assumes a shared frailty to account for the dependence among recurrent events and between the recurrent and the terminal terminal events. Also, the joint modeling allows for time-dependent covariates and rich families of transformation models for the recurrent and terminal events. A major advantage of our approach is that it does not assume a distribution for the frailty, nor does it assume a Poisson process for the analysis of the recurrent event. The utility of the proposed analysis is illustrated by simulation studies and two real applications, where the application to the biochemists' rank promotion data jointly analyzes the biochemists' citation numbers and times to rank promotion, and the application to the scleroderma lung study data jointly analyzes the adverse events and off-drug time among patients with the symptomatic scleroderma-related interstitial lung disease.


Subject(s)
Frailty , Models, Statistical , Humans , Recurrence , Computer Simulation , Survival Analysis
5.
BMC Bioinformatics ; 23(1): 202, 2022 May 30.
Article in English | MEDLINE | ID: mdl-35637439

ABSTRACT

BACKGROUND: In the context of biomedical and epidemiological research, gene-environment (G-E) interaction is of great significance to the etiology and progression of many complex diseases. In high-dimensional genetic data, two general models, marginal and joint models, are proposed to identify important interaction factors. Most existing approaches for identifying G-E interactions are limited owing to the lack of robustness to outliers/contamination in response and predictor data. In particular, right-censored survival outcomes make the associated feature screening even challenging. In this article, we utilize the overlapping group screening (OGS) approach to select important G-E interactions related to clinical survival outcomes by incorporating the gene pathway information under a joint modeling framework. RESULTS: Simulation studies under various scenarios are carried out to compare the performances of our proposed method with some commonly used methods. In the real data applications, we use our proposed method to identify G-E interactions related to the clinical survival outcomes of patients with head and neck squamous cell carcinoma, and esophageal carcinoma in The Cancer Genome Atlas clinical survival genetic data, and further establish corresponding survival prediction models. Both simulation and real data studies show that our method performs well and outperforms existing methods in the G-E interaction selection, effect estimation, and survival prediction accuracy. CONCLUSIONS: The OGS approach is useful for selecting important environmental factors, genes and G-E interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The same idea of the OGS approach can apply to other outcome models, such as the proportional odds survival time model, the logistic regression model for binary outcomes, and the multinomial logistic regression model for multi-class outcomes.


Subject(s)
Gene-Environment Interaction , Neoplasms , Computer Simulation , Genomics , Humans , Neoplasms/genetics , Research
6.
Biomolecules ; 11(12)2021 12 02.
Article in English | MEDLINE | ID: mdl-34944454

ABSTRACT

MicroRNAs (miRNAs), short non-coding RNAs, are involved in the initiation and progression of many human diseases that also play a key role in immune response and drug metabolism modulation [...].


Subject(s)
Cognition , Gene Expression Regulation, Neoplastic , Biomarkers
7.
Front Public Health ; 9: 680054, 2021.
Article in English | MEDLINE | ID: mdl-34291028

ABSTRACT

An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric "missForest" based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.


Subject(s)
Algorithms , Machine Learning , Computer Simulation , Proportional Hazards Models , Survival Analysis
8.
Environ Sci Pollut Res Int ; 28(29): 38679-38688, 2021 Aug.
Article in English | MEDLINE | ID: mdl-33735414

ABSTRACT

The effects of meteorological factors on health outcomes have gained popularity due to climate change, resulting in a general rise in temperature and abnormal climatic extremes. Instead of the conventional cross-sectional analysis that focuses on the association between a predictor and the single dependent variable, the distributed lag non-linear model (DLNM) has been widely adopted to examine the effect of multiple lag environmental factors health outcome. We propose several novel strategies to model mortality with the effects of distributed lag temperature measures and the delayed effect of mortality. Several attempts are derived by various statistical concepts, such as summation, autoregressive, principal component analysis, baseline adjustment, and modeling the offset in the DLNM. Five strategies are evaluated by simulation studies based on permutation techniques. The longitudinal climate and daily mortality data in Taipei, Taiwan, from 2012 to 2016 were implemented to generate the null distribution. According to simulation results, only one strategy, named MVDLNM, could yield valid type I errors, while the other four strategies demonstrated much more inflated type I errors. With a real-life application, the MVDLNM that incorporates both the current and lag mortalities revealed a more significant association than the conventional model that only fits the current mortality. The results suggest that, in public health or environmental research, not only the exposure may post a delayed effect but also the outcome of interest could provide the lag association signals. The joint modeling of the lag exposure and the delayed outcome enhances the power to discover such a complex association structure. The new approach MVDLNM models lag outcomes within 10 days and lag exposures up to 1 month and provide valid results.


Subject(s)
Mortality , Nonlinear Dynamics , China , Cross-Sectional Studies , Taiwan , Temperature
9.
Bioinformatics ; 37(15): 2150-2156, 2021 Aug 09.
Article in English | MEDLINE | ID: mdl-33595070

ABSTRACT

MOTIVATION: In high-dimensional genetic/genomic data, the identification of genes related to clinical survival trait is a challenging and important issue. In particular, right-censored survival outcomes and contaminated biomarker data make the relevant feature screening difficult. Several independence screening methods have been developed, but they fail to account for gene-gene dependency information, and may be sensitive to outlying feature data. RESULTS: We improve the inverse probability-of-censoring weighted (IPCW) Kendall's tau statistic by using Google's PageRank Markov matrix to incorporate feature dependency network information. Also, to tackle outlying feature data, the nonparanormal approach transforming the feature data to multivariate normal variates are utilized in the graphical lasso procedure to estimate the network structure in feature data. Simulation studies under various scenarios show that the proposed network-adjusted weighted Kendall's tau approach leads to more accurate feature selection and survival prediction than the methods without accounting for feature dependency network information and outlying feature data. The applications on the clinical survival outcome data of diffuse large B-cell lymphoma and of The Cancer Genome Atlas lung adenocarcinoma patients demonstrate clearly the advantages of the new proposal over the alternative methods. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

10.
PLoS One ; 16(1): e0244094, 2021.
Article in English | MEDLINE | ID: mdl-33411794

ABSTRACT

In recent years, machine learning methods have been applied to various prediction scenarios in time-series data. However, some processing procedures such as cross-validation (CV) that rearrange the order of the longitudinal data might ruin the seriality and lead to a potentially biased outcome. Regarding this issue, a recent study investigated how different types of CV methods influence the predictive errors in conventional time-series data. Here, we examine a more complex distributed lag nonlinear model (DLNM), which has been widely used to assess the cumulative impacts of past exposures on the current health outcome. This research extends the DLNM into an artificial neural network (ANN) and investigates how the ANN model reacts to various CV schemes that result in different predictive biases. We also propose a newly designed permutation ratio to evaluate the performance of the CV in the ANN. This ratio mimics the concept of the R-square in conventional statistical regression models. The results show that as the complexity of the ANN increases, the predicted outcome becomes more stable, and the bias shows a decreasing trend. Among the different settings of hyperparameters, the novel strategy, Leave One Block Out Cross-Validation (LOBO-CV), demonstrated much better results, and the lowest mean square error was observed. The hyperparameters of the ANN trained by the LOBO-CV yielded the minimum number of prediction errors. The newly proposed permutation ratio indicates that LOBO-CV can contribute up to 34% of the prediction accuracy.


Subject(s)
Neural Networks, Computer , Nonlinear Dynamics , Reproducibility of Results
11.
Curr Med Chem ; 28(27): 5648-5656, 2021.
Article in English | MEDLINE | ID: mdl-33208058

ABSTRACT

BACKGROUND: An association between migraine and Major Depression (MD) has been revealed in a number of clinical studies. Both diseases have affected a large global population. More understanding of the comorbidity mechanism of these two diseases can shed light on developing new therapies for their treatment. METHODS: To the best of our knowledge, there have not been any researches in the literature based on microRNA (miRNA) biomarkers to investigate the relationship between MD and migraine. In this study, we have discussed the association between these two diseases based on their miRNA biomarkers. In addition to miRNA biomarkers, we have also demonstrated epidemiological evidence for their association based on Taiwan Biobank (TWB) data. RESULTS: Among the 12 migraine miRNA biomarkers, 11 are related to MD. Only miR-181a has no direct evidence to be involved in the mechanism of MD. In addition to the biological biomarker evidence, the statistical analysis using the large-scale epidemiologic data collected from TWB provides strong evidence on the relationship between MD and migraine. CONCLUSION: The evidence based on both molecular and epidemiological data reveals the significant association between MD and migraine. This result can help investigate the correlated underlying mechanism of these two diseases.


Subject(s)
Depressive Disorder, Major , MicroRNAs , Migraine Disorders , Biomarkers , Cohort Studies , Comorbidity , Depression , Depressive Disorder, Major/epidemiology , Depressive Disorder, Major/genetics , Humans , MicroRNAs/genetics , Migraine Disorders/epidemiology , Migraine Disorders/genetics
12.
Infect Drug Resist ; 13: 3887-3894, 2020.
Article in English | MEDLINE | ID: mdl-33149633

ABSTRACT

BACKGROUND: The number of COVID-19 infections worldwide has reached 10 million. COVID­19 caused by SARS-CoV-2 is more contagious than SARS-CoV-1. There is a dispute about the origin of COVID-19. Study results showed that all SARS-CoV-2 sequences around the world share a common ancestor towards the end of 2019. METHODS: Virus sequences from COVID-19 samples at the early time should be less diversifiable than those from samples at the later time because there might be more mutations when the virus evolutes over time. The diversity of virus nucleotide sequences can be measured by the nucleotide substitution distance. To explore the diversity of SARS-CoV-2, we use different nucleotide substitution models to calculate the distances of SARS-CoV-2 samples from 3 different areas, China, Europe, and the USA. Then, we use these distances to infer the origin of COVID-19. RESULTS: It is known that COVID-19 originated in Wuhan China and then spread to Europe and the USA. By using different substitution models, the distances of SARS-CoV-2 samples from these areas are significantly different. By ANOVA testing, the p-value is less than 2.2e-16. The analyzed results in most substitution models show that China has the lowest diversity, followed by Europe and lastly by the USA. This outcome coincides with the virus transmission time order that SARS-CoV-2 starts in China, then outbreaks in Europe and finally in the USA. CONCLUSION: The magnitude of nucleotide substitution distance of SARS-CoV-2 is closely related to the transmission time order of SARS-CoV-2. This outcome reveals that the nucleotide substitution distance of SARS-CoV-2 may be used to infer the origin of COVID-19.

13.
Stat Med ; 39(29): 4372-4385, 2020 12 20.
Article in English | MEDLINE | ID: mdl-32871614

ABSTRACT

Survival analysis has been conventionally performed on a continuous time scale. In practice, the survival time is often recorded or handled on a discrete scale; when this is the case, the discrete-time survival analysis would provide analysis results more relevant to the actual data scale. Besides, data on time-dependent covariates in the survival analysis are usually collected through intermittent follow-ups, resulting in the missing and mismeasured covariate data. In this work, we propose the sufficient discrete hazard (SDH) approach to discrete-time survival analysis with longitudinal covariates that are subject to missingness and mismeasurement. The SDH method employs the conditional score idea available for dealing with mismeasured covariates, and the penalized least squares for estimating the missing covariate value using the regression spline basis. The SDH method is developed for the single event analysis with the logistic discrete hazard model, and for the competing risks analysis with the multinomial logit model. Simulation results revel good finite-sample performances of the proposed estimator and the associated asymptotic theory. The proposed SDH method is applied to the scleroderma lung study data, where the time to medication withdrawal and time to death were recorded discretely in months, for illustration.


Subject(s)
Research Design , Computer Simulation , Humans , Proportional Hazards Models , Risk Assessment , Survival Analysis
14.
Stat Med ; 39(22): 2936-2948, 2020 09 30.
Article in English | MEDLINE | ID: mdl-32578241

ABSTRACT

In controlled trials, "treatment switching" occurs when patients in one treatment group switch to alternative treatments during the trial, and poses challenges to treatment effect evaluation owing to crossover of the treatments groups. In this work, we assume that treatment switching can occur after some disease progression event and view the progression and death events as two semicompeting risks. The proposed model consists of a copula model for the joint distribution of time-to-progression (TTP) and overall survival (OS) up to the earlier of the two events, as well as a conditional hazard model for OS subsequent to progression. The copula model facilitates assessing the marginal distributions of TTP and OS separately from the association between the two events, and, in particular, the treatment effect on OS in the absence of treatment switching. The proposed conditional hazard model for death subsequent to progression allows us to assess the treatment switching (crossover) effect on OS given occurrence of progression and covariates. Semiparametric proportional hazards models are employed in the marginal models for TTP and OS. A nonparametric maximum likelihood procedure is developed for model inference, which is verified through asymptotic theory and simulation studies. The proposed analysis is applied to a lung cancer dataset to illustrate its real utility.


Subject(s)
Models, Statistical , Treatment Switching , Computer Simulation , Humans , Probability , Proportional Hazards Models
15.
Curr Med Chem ; 27(38): 6536-6547, 2020.
Article in English | MEDLINE | ID: mdl-32334497

ABSTRACT

A number of clinical studies have revealed that there is an association between major depression (MD) and gastroesophageal reflux disease (GERD). Both the diseases are shown to affect a large proportion of the global population. More advanced studies for understanding the comorbidity mechanism of these two diseases can shed light on developing new therapies of both diseases. To the best of our knowledge, there has not been any research work in the literature investigating the relationship between MD and GERD using their miRNA biomarkers. We adopt a phylogenetic analysis to analyze their miRNA biomarkers. From our analyzed results, the association between these two diseases can be explored through miRNA phylogeny. In addition to evidence from the phylogenetic analysis, we also demonstrate epidemiological evidence for the relationship between MD and GERD based on Taiwan biobank data.


Subject(s)
Gastroesophageal Reflux , Biomarkers , Depression , Gastroesophageal Reflux/epidemiology , Humans , MicroRNAs/genetics , Phylogeny
16.
Biom J ; 62(5): 1164-1175, 2020 09.
Article in English | MEDLINE | ID: mdl-32022280

ABSTRACT

We propose a joint analysis of recurrent and nonrecurrent event data subject to general types of interval censoring. The proposed analysis allows for general semiparametric models, including the Box-Cox transformation and inverse Box-Cox transformation models for the recurrent and nonrecurrent events, respectively. A frailty variable is used to account for the potential dependence between the recurrent and nonrecurrent event processes, while leaving the distribution of the frailty unspecified. We apply the pseudolikelihood for interval-censored recurrent event data, usually termed as panel count data, and the sufficient likelihood for interval-censored nonrecurrent event data by conditioning on the sufficient statistic for the frailty and using the working assumption of independence over examination times. Large sample theory and a computation procedure for the proposed analysis are established. We illustrate the proposed methodology by a joint analysis of the numbers of occurrences of basal cell carcinoma over time and time to the first recurrence of squamous cell carcinoma based on a skin cancer dataset, as well as a joint analysis of the numbers of adverse events and time to premature withdrawal from study medication based on a scleroderma lung disease dataset.


Subject(s)
Frailty , Models, Statistical , Carcinoma, Basal Cell , Carcinoma, Squamous Cell , Chronic Disease , Frailty/diagnosis , Frailty/epidemiology , Humans , Lung Diseases , Neoplasm Recurrence, Local , Scleroderma, Localized , Skin Neoplasms
17.
Bioinformatics ; 36(9): 2763-2769, 2020 05 01.
Article in English | MEDLINE | ID: mdl-31926011

ABSTRACT

MOTIVATION: In gene expression and genome-wide association studies, the identification of interaction effects is an important and challenging issue owing to its ultrahigh-dimensional nature. In particular, contaminated data and right-censored survival outcome make the associated feature screening even challenging. RESULTS: In this article, we propose an inverse probability-of-censoring weighted Kendall's tau statistic to measure association of a survival trait with biomarkers, as well as a Kendall's partial correlation statistic to measure the relationship of a survival trait with an interaction variable conditional on the main effects. The Kendall's partial correlation is then used to conduct interaction screening. Simulation studies under various scenarios are performed to compare the performance of our proposal with some commonly available methods. In the real data application, we utilize our proposed method to identify epistasis associated with the clinical survival outcomes of non-small-cell lung cancer, diffuse large B-cell lymphoma and lung adenocarcinoma patients. Both simulation and real data studies demonstrate that our method performs well and outperforms existing methods in identifying main and interaction biomarkers. AVAILABILITY AND IMPLEMENTATION: R-package 'IPCWK' is available to implement this method, together with a reference manual describing how to perform the 'IPCWK' package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Carcinoma, Non-Small-Cell Lung , Lung Neoplasms , Genome-Wide Association Study , Humans , Lung Neoplasms/genetics , Phenotype
18.
Stat Methods Med Res ; 28(1): 134-150, 2019 01.
Article in English | MEDLINE | ID: mdl-28671033

ABSTRACT

It is common in longitudinal studies that missing data occur due to subjects' no response, missed visits, dropout, death or other reasons during the course of study. To perform valid analysis in this setting, data missing not at random (MNAR) have to be considered. However, models for data MNAR often suffer from the identifiability issue and hence result in difficulty in estimation and computational convergence. To ameliorate this issue, we propose the LASSO and ridge-regularized selection models that regularize the missing data mechanism model to handle data MNAR, with the regularization parameter selected via a cross-validation procedure. The proposed models can be also employed for sensitivity analysis to examine the effects on inference of different assumptions about the missing data mechanism. We illustrate the performance of the proposed models via simulation studies and the analysis of data from a randomized clinical trial.


Subject(s)
Data Accuracy , Data Interpretation, Statistical , Cough/etiology , Humans , Likelihood Functions , Longitudinal Studies , Models, Statistical , Patient Dropouts/statistics & numerical data , Regression Analysis , Scleroderma, Systemic/complications
19.
Stat Methods Med Res ; 28(8): 2247-2257, 2019 08.
Article in English | MEDLINE | ID: mdl-29488447

ABSTRACT

Semiparametric transformation models, which include the Cox proportional hazards and proportional odds models as special cases, are popular in current practice of survival analysis owing to that, in contrast to parametric models, no assumption on the baseline distribution is required. Although sample size calculations for semiparametric survival analysis with right-censored data are available, no such calculation exits in literature for semiparametric analysis with current status data, where only an examination time and whether the event occurs prior to the examination are observable. We develop sample size calculation for semiparametric two-group comparison or regression analysis with current status data. The proposed formula can be readily implemented with given effect size, power level, covariate group proportions, covariate-specific examination (censoring) time distributions, and proportions of events observed in the control group at a few knot points in the study period. Simulation results show that the proposed sample size calculation is adequate in the sense that it leads to studies with empirical power very close to the planned power level. We illustrate practical applications of the proposal through examples from an animal tumorigenicity study and a cross-sectional survey on osteoporosis status in the elderly.


Subject(s)
Proportional Hazards Models , Survival Analysis , Aged , Animals , Computer Simulation , Humans , Lung Neoplasms/mortality , Mice , Osteoporosis/epidemiology , Research Design , Sample Size
20.
BMC Bioinformatics ; 19(1): 335, 2018 Sep 21.
Article in English | MEDLINE | ID: mdl-30241463

ABSTRACT

BACKGROUND: The development of a disease is a complex process that may result from joint effects of multiple genes. In this article, we propose the overlapping group screening (OGS) approach to determining active genes and gene-gene interactions incorporating prior pathway information. The OGS method is developed to overcome the challenges in genome-wide data analysis that the number of the genes and gene-gene interactions is far greater than the sample size, and the pathways generally overlap with one another. The OGS method is further proposed for patients' survival prediction based on gene expression data. RESULTS: Simulation studies demonstrate that the performance of the OGS approach in identifying the true main and interaction effects is good and the survival prediction accuracy of OGS with the Lasso penalty is better than the ordinary Lasso method. In real data analysis, we identify several significant genes and/or epistasis interactions that are associated with clinical survival outcomes of diffuse large B-cell lymphoma (DLBCL) and non-small-cell lung cancer (NSCLC) by utilizing prior pathway information from the KEGG pathway and the GO biological process databases, respectively. CONCLUSIONS: The OGS approach is useful for selecting important genes and epistasis interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The OGS approach is generally applicable to various types of outcome data (quantitative, qualitative, censored event time data) and regression models (e.g. linear, logistic, and Cox's regression models).


Subject(s)
Carcinoma, Non-Small-Cell Lung/mortality , Epistasis, Genetic , Genetic Loci , Lung Neoplasms/mortality , Lymphoma, Large B-Cell, Diffuse/mortality , Transcriptome , Algorithms , Carcinoma, Non-Small-Cell Lung/genetics , Computer Simulation , Databases, Factual , Gene Expression Profiling , Humans , Lung Neoplasms/genetics , Lymphoma, Large B-Cell, Diffuse/genetics , Predictive Value of Tests , Survival Rate
SELECTION OF CITATIONS
SEARCH DETAIL
...