Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
1.
Yale J Biol Med ; 96(3): 327-346, 2023 09.
Article in English | MEDLINE | ID: mdl-37781001

ABSTRACT

Objectives: To evaluate the comparative effectiveness of treatments, a randomized clinical trial remains the gold standard but can be challenged by a high cost, a limited sample size, an inability to fully reflect the real world, and feasibility concerns. The objective is to showcase a big data approach that takes advantage of large electronic medical record (EMR) data to emulate clinical trials. To overcome the limitations of regression analysis, a deep learning-based analysis pipeline was developed. Study Design and Setting: Lumpectomy (breast-conserving surgery) and mastectomy are the two most commonly used surgical procedures for early-stage female breast cancer patients. An emulation trial was designed using the Surveillance, Epidemiology, and End Results (SEER)-Medicare data to evaluate their relative effectiveness in overall survival. The analysis pipeline consisted of a propensity score step, a weighted survival analysis step, and a bootstrap inference step. Results: A total of 65,997 subjects were enrolled in the emulated trial, with 50,704 and 15,293 in the lumpectomy and mastectomy arms, respectively. The two surgery procedures had comparable effects in terms of overall survival (survival year change = 0.08, 95% confidence interval (CI): -0.08, 0.25) for the elderly SEER-Medicare early-stage female breast cancer patients. Conclusion: This study demonstrated the power of "mining large EMR data + deep learning-based analysis," and the proposed analysis strategy and technique can be potentially broadly applicable. It provided convincing evidence of the comparative effectiveness of lumpectomy and mastectomy.


Subject(s)
Breast Neoplasms , Deep Learning , Mastectomy , Aged , Female , Humans , Big Data , Breast Neoplasms/surgery , Mastectomy, Segmental , Medicare , United States , Comparative Effectiveness Research
2.
J Biomed Inform ; 144: 104434, 2023 08.
Article in English | MEDLINE | ID: mdl-37391115

ABSTRACT

OBJECTIVE: Deep neural network (DNN) techniques have demonstrated significant advantages over regression and some other techniques. In recent studies, DNN-based analysis has been conducted on data with high-dimensional input such as omics measurements. In such analysis, regularization, in particular penalization, has been applied to regularize estimation and distinguish relevant input variables from irrelevant ones. A unique challenge arises from the "lack of information" attributable to high dimensionality of input and limited size of training data. For many data/studies, there exist other data/studies that may be relevant and can potentially provide additional information to boost performance. METHODS: In this study, we conduct integrative analysis of multiple independent datasets/studies, with the goal of borrowing information across each other and improving overall performance. Significantly different from regression-based integrative analysis (where alignment can be easily achieved based on covariates), alignment across multiple DNNs can be nontrivial. We develop ANNI, an Aligned DNN technique for Integrative analysis with high-dimensional input. Penalization is applied for regularized estimation, selection of important input variables, and, equally importantly, information borrowing across multiple DNNs. An effective computational algorithm is developed. RESULTS: Extensive simulations demonstrate competitive performance of the proposed technique. The analysis of cancer omics data further establishes its practical utility.


Subject(s)
Neoplasms , Neural Networks, Computer , Humans , Algorithms
3.
Biometrics ; 78(2): 512-523, 2022 06.
Article in English | MEDLINE | ID: mdl-33527365

ABSTRACT

In the analysis of gene expression data, network approaches take a system perspective and have played an irreplaceably important role. Gaussian graphical models (GGMs) have been popular in the network analysis of gene expression data. They investigate the conditional dependence between genes and "transform" the problem of estimating network structures into a sparse estimation of precision matrices. When there is a moderate to large number of genes, the number of parameters to be estimated may overwhelm the limited sample size, leading to unreliable estimation and selection. In this article, we propose incorporating information from previous studies (for example, those deposited at PubMed) to assist estimating the network structure in the present data. It is recognized that such information can be partial, biased, or even wrong. A penalization-based estimation approach is developed, shown to have consistency properties, and realized using an effective computational algorithm. Simulation demonstrates its competitive performance under various information accuracy scenarios. The analysis of TCGA lung cancer prognostic genes leads to network structures different from the alternatives.


Subject(s)
Gene Regulatory Networks , Models, Statistical , Algorithms , Gene Expression , Normal Distribution
4.
Genet Epidemiol ; 45(6): 604-620, 2021 09.
Article in English | MEDLINE | ID: mdl-34174112

ABSTRACT

In the analysis of gene expression data, when there are two or more disease conditions/groups (e.g., diseased and normal, responder and nonresponder, and multiple stages/subtypes), differential analysis has been extensively conducted to identify key differences and has important implications. Network analysis takes a system perspective and can be more informative than that limited to simple statistics such as mean and variance. In differential network analysis, a common practice is to first estimate a gene expression network for each condition/group, and then spectral clustering can be applied to the network difference(s) to identify key genes and biological mechanisms that lead to the differences. Compared to "simple" analysis such as regression, differential network analysis can be more challenging with the significantly larger number of parameters. In this study, taking advantage of the increasing popularity of multidimensional profiling data, we develop an assisted analysis strategy and propose incorporating regulator information to improve the identification of key genes (that lead to the differences in gene expression networks). An effective computational algorithm is developed. Comprehensive simulation is conducted, showing that the proposed approach can outperform the benchmark alternatives in identification accuracy. With the The Cancer Genome Atlas lung adenocarcinoma data, we analyze the expressions of genes in the KEGG cell cycle pathway, assisted by copy number variation data. The proposed assisted analysis leads to identification results similar to the alternatives but different estimations. Overall, this study can deliver an efficient and cost-effective way of improving differential network analysis.


Subject(s)
DNA Copy Number Variations , Gene Expression Profiling , Gene Expression , Gene Regulatory Networks , Humans , Models, Genetic
5.
Genet Epidemiol ; 45(4): 372-385, 2021 06.
Article in English | MEDLINE | ID: mdl-33527531

ABSTRACT

In the study of gene expression data, network analysis has played a uniquely important role. To accommodate the high dimensionality and low sample size and generate interpretable results, regularized estimation is usually conducted in the construction of gene expression Gaussian Graphical Models (GGM). Here we use GeO-GGM to represent gene-expression-only GGM. Gene expressions are regulated by regulators. gene-expression-regulator GGMs (GeR-GGMs), which accommodate gene expressions as well as their regulators, have been constructed accordingly. In practical data analysis, with a "lack of information" caused by the large number of model parameters, limited sample size, and weak signals, the construction of both GeO-GGMs and GeR-GGMs is often unsatisfactory. In this article, we recognize that with the regulation between gene expressions and regulators, the sparsity structures of a GeO-GGM and its GeR-GGM counterpart can satisfy a hierarchy. Accordingly, we propose a joint estimation which reinforces the hierarchical structure and use the construction of a GeO-GGM to assist that of its GeR-GGM counterpart and vice versa. Consistency properties are rigorously established, and an effective computational algorithm is developed. In simulation, the assisted construction outperforms the separation construction of GeO-GGM and GeR-GGM. Two The Cancer Genome Atlas data sets are analyzed, leading to findings different from the direct competitors.


Subject(s)
Algorithms , Models, Genetic , Computer Simulation , Gene Expression , Humans , Normal Distribution
6.
Brief Bioinform ; 22(3)2021 05 20.
Article in English | MEDLINE | ID: mdl-32793970

ABSTRACT

Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.


Subject(s)
Gene Expression , Cluster Analysis , Data Analysis , Humans , Unsupervised Machine Learning
7.
BMC Cancer ; 18(1): 637, 2018 Jun 05.
Article in English | MEDLINE | ID: mdl-29871608

ABSTRACT

BACKGROUND: Growing evidence demonstrates that exposure to organophosphate flame retardants (PFRs) is widespread and that these chemicals can alter thyroid hormone regulation and function. We investigated the relationship between PFR exposure and thyroid cancer and whether individual or temporal factors predict PFR exposure. METHODS: We analyzed interview data and spot urine samples collected in 2010-2013 from 100 incident female, papillary thyroid cancer cases and 100 female controls of a Connecticut-based thyroid cancer case-control study. We measured urinary concentrations of six PFR metabolites with mass spectrometry. We estimated odds ratios (OR) and 95% confidence intervals (95% CI) for continuous and categories (low, medium, high) of concentrations of individual and summed metabolites, adjusting for potential confounders. We examined relationships between concentrations of PFR metabolites and individual characteristics (age, smoking status, alcohol consumption, body mass index [BMI], income, education) and temporal factors (season, year) using multiple linear regression analysis. RESULTS: No PFRs were significantly associated with papillary thyroid cancer risk. Results remained null when stratified by microcarcinomas (tumor diameter ≤ 1 cm) and larger tumor sizes (> 1 cm). We observed higher urinary PFR concentrations with increasing BMI and in the summer season. CONCLUSIONS: Urinary PFR concentrations, measured at time of diagnosis, are not linked to increased risk of thyroid cancer. Investigations in a larger population or with repeated pre-diagnosis urinary biomarker measurements would provide additional insights into the relationship between PFR exposure and thyroid cancer risk.


Subject(s)
Flame Retardants/analysis , Organophosphates/urine , Thyroid Cancer, Papillary/epidemiology , Thyroid Cancer, Papillary/urine , Adult , Aged , Case-Control Studies , Connecticut , Environmental Exposure , Female , Humans , Middle Aged
8.
Stat Med ; 36(3): 509-559, 2017 02 10.
Article in English | MEDLINE | ID: mdl-27667129

ABSTRACT

In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets - or equivalently, similarity of model sparsity structures - across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance. Copyright © 2016 John Wiley & Sons, Ltd.


Subject(s)
Models, Statistical , Neoplasms/genetics , Biomarkers, Tumor/genetics , Breast Neoplasms/diagnosis , Breast Neoplasms/genetics , Data Interpretation, Statistical , Female , Genetic Markers/genetics , Genetic Predisposition to Disease/epidemiology , Genetic Predisposition to Disease/genetics , Humans , Prognosis
9.
BMC Health Serv Res ; 15: 69, 2015 Feb 20.
Article in English | MEDLINE | ID: mdl-25879667

ABSTRACT

BACKGROUND: Illness and the medical expenditure that follows have a profound impact on the well-being of individuals and households. China is a huge country with significant regional differences. The goal of this study is to investigate the associations of illness and medical expenditure with other categories of household expenditures, with special attention paid to the differences in observations between the western and eastern regions. METHODS: A survey was conducted in six major cities in China, three in the east and three in the west, in 2011. Data on demographics, illness conditions, and medical and other expenditures were collected from 12,515 households. RESULTS: In the analysis of the associations of illness conditions and medical expenditure with demographics, multiple significant associations were observed, and there are differences between the eastern and western regions. In univariate analyses, illness conditions and medical expenditure were found as having significant associations with other categories of expenditures. In multivariate analyses adjusting for household and household head characteristics, few associations were observed, and there exist differences between the regions. CONCLUSIONS: This study has provided empirical evidence on the associations of illness/medical expenditure with demographics and with other categories of expenditures. Differences across regions were observed in multiple aspects. The reasons underlying such differences are worth investigating further.


Subject(s)
Cost of Illness , Economics/statistics & numerical data , Health Expenditures/statistics & numerical data , Adolescent , Adult , Aged , Aged, 80 and over , Child , Child, Preschool , China , Cohort Studies , Female , Geography , Humans , Infant , Infant, Newborn , Male , Middle Aged , Socioeconomic Factors , Surveys and Questionnaires , Young Adult
10.
Brief Bioinform ; 16(5): 735-44, 2015 Sep.
Article in English | MEDLINE | ID: mdl-25552438

ABSTRACT

For cancer and many other complex diseases, a large number of gene signatures have been generated. In this study, we use cancer as an example and note that other diseases can be analyzed in a similar manner. For signatures generated in multiple independent studies on the same cancer type and outcome, and for signatures on different cancer types, it is of interest to evaluate their degree of overlap. Many of the existing studies simply count the number (or percentage) of overlapped genes shared by two signatures. Such an approach has serious limitations. In this study, as a demonstrating example, we consider cancer prognosis data under the Cox model. Lasso, which is representative of a large number of regularization methods, is adopted for generating gene signatures. We examine two families of measures for quantifying the degree of overlap. The first family is based on the Cox-Lasso estimates at the optimal tunings, and the second family is based on estimates across the whole solution paths. Within each family, multiple measures, which describe the overlap from different perspectives, are introduced. The analysis of TCGA (The Cancer Genome Atlas) data on five cancer types shows that the degree of overlap varies across measures, cancer types and types of (epi)genetic measurements. More investigations are needed to better describe and understand the overlaps among gene signatures.


Subject(s)
Gene Expression Profiling , Neoplasms/genetics , Humans , Prognosis , Proportional Hazards Models
SELECTION OF CITATIONS
SEARCH DETAIL
...