Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
BMC Bioinformatics ; 25(1): 226, 2024 Jun 27.
Article in English | MEDLINE | ID: mdl-38937668

ABSTRACT

BACKGROUND: The matched case-control design, up until recently mostly pertinent to epidemiological studies, is becoming customary in biomedical applications as well. For instance, in omics studies, it is quite common to compare cancer and healthy tissue from the same patient. Furthermore, researchers today routinely collect data from various and variable sources that they wish to relate to the case-control status. This highlights the need to develop and implement statistical methods that can take these tendencies into account. RESULTS: We present an R package penalizedclr, that provides an implementation of the penalized conditional logistic regression model for analyzing matched case-control studies. It allows for different penalties for different blocks of covariates, and it is therefore particularly useful in the presence of multi-source omics data. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection for variable selection in the considered regression model. CONCLUSIONS: The proposed method fills a gap in the available software for fitting high-dimensional conditional logistic regression models accounting for the matched design and block structure of predictors/features. The output consists of a set of selected variables that are significantly associated with case-control status. These variables can then be investigated in terms of functional interpretation or validation in further, more targeted studies.


Subject(s)
Software , Logistic Models , Case-Control Studies , Humans , Genomics/methods , Computational Biology/methods
2.
Br J Radiol ; 97(1158): 1169-1179, 2024 May 29.
Article in English | MEDLINE | ID: mdl-38688660

ABSTRACT

OBJECTIVES: This study aimed to develop a model to predict World Health Organization/International Society of Urological Pathology (WHO/ISUP) low-grade or high-grade clear cell renal cell carcinoma (ccRCC) using 3D multiphase enhanced CT radiomics features (RFs). METHODS: CT data of 138 low-grade and 60 high-grade ccRCC cases were included. RFs were extracted from four CT phases: non-contrast phase (NCP), corticomedullary phase, nephrographic phase, and excretory phase (EP). Models were developed using various combinations of RFs and subjected to cross-validation. RESULTS: There were 107 RFs extracted from each phase of the CT images. The NCP-EP model had the best overall predictive value (AUC = 0.78), but did not significantly differ from that of the NCP model (AUC = 0.76). By considering the predictive ability of the model, the level of radiation exposure, and model simplicity, the overall best model was the Conventional image and clinical features (CICFs)-NCP model (AUC = 0.77; sensitivity 0.75, specificity 0.69, positive predictive value 0.85, negative predictive value 0.54, accuracy 0.73). The second-best model was the NCP model (AUC = 0.76). CONCLUSIONS: Combining clinical features with unenhanced CT images of the kidneys seems to be optimal for prediction of WHO/ISUP grade of ccRCC. This noninvasive method may assist in guiding more accurate treatment decisions for ccRCC. ADVANCES IN KNOWLEDGE: This study innovatively employed stability selection for RFs, enhancing model reliability. The CICFs-NCP model's simplicity and efficacy mark a significant advancement, offering a practical tool for clinical decision-making in ccRCC management.


Subject(s)
Carcinoma, Renal Cell , Kidney Neoplasms , Neoplasm Grading , Tomography, X-Ray Computed , Humans , Carcinoma, Renal Cell/diagnostic imaging , Carcinoma, Renal Cell/pathology , Kidney Neoplasms/diagnostic imaging , Kidney Neoplasms/pathology , Tomography, X-Ray Computed/methods , Male , Middle Aged , Female , Aged , World Health Organization , Retrospective Studies , Predictive Value of Tests , Adult , Imaging, Three-Dimensional/methods , Sensitivity and Specificity , Aged, 80 and over , Radiomics
3.
Med Image Anal ; 91: 103010, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37950937

ABSTRACT

Conventionally, analysis of functional MRI (fMRI) data relies on available information about the experimental paradigm to establish hypothesized models of brain activity. However, this information can be inaccurate, incomplete or unavailable in multiple scenarios such as resting-state, naturalistic paradigms or clinical conditions. In these cases, blind estimates of neuronal-related activity can be obtained with paradigm-free analysis methods such as hemodynamic deconvolution. Yet, current formulations of the hemodynamic deconvolution problem have three important limitations: (1) their efficacy strongly depends on the appropriate selection of regularization parameters, (2) being univariate, they do not take advantage of the information present across the brain, and (3) they do not provide any measure of statistical certainty associated with each detected event. Here we propose a novel approach that addresses all these limitations. Specifically, we introduce multivariate sparse paradigm free mapping (Mv-SPFM), a novel hemodynamic deconvolution algorithm that operates at the whole brain level and adds spatial information via a mixed-norm regularization term over all voxels. Additionally, Mv-SPFM employs a stability selection procedure that removes the need to select regularization parameters and also lets us obtain an estimate of the true probability of having a neuronal-related BOLD event at each voxel and time-point based on the area under the curve (AUC) of the stability paths. Besides, we present a formulation tailored for multi-echo fMRI acquisitions (MvME-SPFM), which allows us to better isolate fluctuations of BOLD origin on the basis of their linear dependence with the echo time (TE) and to assign physiologically interpretable units (i.e., changes in the apparent transverse relaxation ΔR2∗) to the resulting deconvolved events. Remarkably, we demonstrate that Mv-SPFM achieves comparable performance even when using a single-echo formulation. We demonstrate that this algorithm outperforms existing state-of-the-art deconvolution approaches, and shows higher spatial and temporal agreement with the activation maps and BOLD signals obtained with a standard model-based linear regression approach, even at the level of individual neuronal events. Furthermore, we show that by employing stability selection, the performance of the algorithm depends less on the selection of temporal and spatial regularization parameters λ and ρ. Consequently, the proposed algorithm provides more reliable estimates of neuronal-related activity, here in terms of ΔR2∗, for the study of the dynamics of brain activity when no information about the timings of the BOLD events is available. This algorithm will be made publicly available as part of the splora Python package.


Subject(s)
Brain Mapping , Brain , Humans , Brain Mapping/methods , Brain/diagnostic imaging , Brain/physiology , Magnetic Resonance Imaging/methods , Algorithms , Hemodynamics
4.
J R Stat Soc Ser C Appl Stat ; 72(5): 1375-1393, 2023 Nov.
Article in English | MEDLINE | ID: mdl-38143734

ABSTRACT

Stability selection represents an attractive approach to identify sparse sets of features jointly associated with an outcome in high-dimensional contexts. We introduce an automated calibration procedure via maximisation of an in-house stability score and accommodating a priori-known block structure (e.g. multi-OMIC) data. It applies to [Least Absolute Shrinkage Selection Operator (LASSO)] penalised regression and graphical models. Simulations show our approach outperforms non-stability-based and stability selection approaches using the original calibration. Application to multi-block graphical LASSO on real (epigenetic and transcriptomic) data from the Norwegian Women and Cancer study reveals a central/credible and novel cross-OMIC role of LRRN3 in the biological response to smoking. Proposed approaches were implemented in the R package sharp.

5.
Clin Epigenetics ; 15(1): 114, 2023 07 13.
Article in English | MEDLINE | ID: mdl-37443060

ABSTRACT

BACKGROUND: DNA methylation (DNAm) is robustly associated with chronological age in children and adults, and gestational age (GA) in newborns. This property has enabled the development of several epigenetic clocks that can accurately predict chronological age and GA. However, the lack of overlap in predictive CpGs across different epigenetic clocks remains elusive. Our main aim was therefore to identify and characterize CpGs that are stably predictive of GA. RESULTS: We applied a statistical approach called 'stability selection' to DNAm data from 2138 newborns in the Norwegian Mother, Father, and Child Cohort study. Stability selection combines subsampling with variable selection to restrict the number of false discoveries in the set of selected variables. Twenty-four CpGs were identified as being stably predictive of GA. Intriguingly, only up to 10% of the CpGs in previous GA clocks were found to be stably selected. Based on these results, we used generalized additive model regression to develop a new GA clock consisting of only five CpGs, which showed a similar predictive performance as previous GA clocks (R2 = 0.674, median absolute deviation = 4.4 days). These CpGs were in or near genes and regulatory regions involved in immune responses, metabolism, and developmental processes. Furthermore, accounting for nonlinear associations improved prediction performance in preterm newborns. CONCLUSION: We present a methodological framework for feature selection that is broadly applicable to any trait that can be predicted from DNAm data. We demonstrate its utility by identifying CpGs that are highly predictive of GA and present a new and highly performant GA clock based on only five CpGs that is more amenable to a clinical setting.


Subject(s)
DNA Methylation , Epigenesis, Genetic , Adult , Female , Child , Humans , Infant, Newborn , Cohort Studies , Gestational Age , Mothers , CpG Islands
6.
Diabetologia ; 66(9): 1643-1654, 2023 09.
Article in English | MEDLINE | ID: mdl-37329449

ABSTRACT

AIMS/HYPOTHESIS: The euglycaemic-hyperinsulinaemic clamp (EIC) is the reference standard for the measurement of whole-body insulin sensitivity but is laborious and expensive to perform. We aimed to assess the incremental value of high-throughput plasma proteomic profiling in developing signatures correlating with the M value derived from the EIC. METHODS: We measured 828 proteins in the fasting plasma of 966 participants from the Relationship between Insulin Sensitivity and Cardiovascular disease (RISC) study and 745 participants from the Uppsala Longitudinal Study of Adult Men (ULSAM) using a high-throughput proximity extension assay. We used the least absolute shrinkage and selection operator (LASSO) approach using clinical variables and protein measures as features. Models were tested within and across cohorts. Our primary model performance metric was the proportion of the M value variance explained (R2). RESULTS: A standard LASSO model incorporating 53 proteins in addition to routinely available clinical variables increased the M value R2 from 0.237 (95% CI 0.178, 0.303) to 0.456 (0.372, 0.536) in RISC. A similar pattern was observed in ULSAM, in which the M value R2 increased from 0.443 (0.360, 0.530) to 0.632 (0.569, 0.698) with the addition of 61 proteins. Models trained in one cohort and tested in the other also demonstrated significant improvements in R2 despite differences in baseline cohort characteristics and clamp methodology (RISC to ULSAM: 0.491 [0.433, 0.539] for 51 proteins; ULSAM to RISC: 0.369 [0.331, 0.416] for 67 proteins). A randomised LASSO and stability selection algorithm selected only two proteins per cohort (three unique proteins), which improved R2 but to a lesser degree than in standard LASSO models: 0.352 (0.266, 0.439) in RISC and 0.495 (0.404, 0.585) in ULSAM. Reductions in improvements of R2 with randomised LASSO and stability selection were less marked in cross-cohort analyses (RISC to ULSAM R2 0.444 [0.391, 0.497]; ULSAM to RISC R2 0.348 [0.300, 0.396]). Models of proteins alone were as effective as models that included both clinical variables and proteins using either standard or randomised LASSO. The single most consistently selected protein across all analyses and models was IGF-binding protein 2. CONCLUSIONS/INTERPRETATION: A plasma proteomic signature identified using a standard LASSO approach improves the cross-sectional estimation of the M value over routine clinical variables. However, a small subset of these proteins identified using a stability selection algorithm affords much of this improvement, especially when considering cross-cohort analyses. Our approach provides opportunities to improve the identification of insulin-resistant individuals at risk of insulin resistance-related adverse health consequences.


Subject(s)
Cardiovascular Diseases , Insulin Resistance , Male , Adult , Humans , Longitudinal Studies , Proteomics , Cross-Sectional Studies , Insulin
7.
Article in English | MEDLINE | ID: mdl-37090139

ABSTRACT

A novel variable selection method for low-dimensional generalized linear models is introduced. The new approach called AIC OPTimization via STABility Selection (OPT-STABS) repeatedly subsamples the data, minimizes Akaike's Information Criterion (AIC) over a sequence of nested models for each subsample, and includes in the final model those predictors selected in the minimum AIC model in a large fraction of the subsamples. New methods are also introduced to establish an optimal variable selection cutoff over repeated subsamples. An extensive simulation study examining a variety of proposec variable selection methods shows that, although no single method uniformly outperforms the others in all the scenarios considered, OPT-STABS is consistently among the best-performing methods in most settings while it performs competitively for the rest. This is in contrast to other candidate methods which either have poor performance across the board or exhibit good performance in some settings, but very poor in others. In addition, the asymptotic properties of the OPT-STABS estimator are derived, and its root-n consistency and asymptotic normality are proved. The methods are applied to two datasets involving logistic and Poisson regressions.

8.
Stat Methods Med Res ; 31(11): 2201-2216, 2022 11.
Article in English | MEDLINE | ID: mdl-36113157

ABSTRACT

In many biomedical research, multiple views of data (e.g. genomics, proteomics) are available, and a particular interest might be the detection of sample subgroups characterized by specific groups of variables. Biclustering methods are well-suited for this problem as they assume that specific groups of variables might be relevant only to specific groups of samples. Many biclustering methods exist for detecting row-column clusters in a view but few methods exist for data from multiple views. The few existing algorithms are heavily dependent on regularization parameters for getting row-column clusters, and they impose unnecessary burden on users thus limiting their use in practice. We extend an existing biclustering method based on sparse singular value decomposition for single-view data to data from multiple views. Our method, integrative sparse singular value decomposition (iSSVD), incorporates stability selection to control Type I error rates, estimates the probability of samples and variables to belong to a bicluster, finds stable biclusters, and results in interpretable row-column associations. Simulations and real data analyses show that integrative sparse singular value decomposition outperforms several other single- and multi-view biclustering methods and is able to detect meaningful biclusters. iSSVD is a user-friendly, computationally efficient algorithm that will be useful in many disease subtyping applications.


Subject(s)
Algorithms , Gene Expression Profiling , Cluster Analysis , Oligonucleotide Array Sequence Analysis/methods , Gene Expression Profiling/methods
9.
Front Neurosci ; 16: 895560, 2022.
Article in English | MEDLINE | ID: mdl-35812216

ABSTRACT

Cochlear nerve deficiency (CND) is often associated with variable outcomes of cochlear implantation (CI). We assessed previous investigations aiming to identify the main factors that determine CI outcomes, which would enable us to develop predictive models. Seventy patients with CND and normal cochlea who underwent CI surgery were retrospectively examined. First, using a data-driven approach, we collected demographic information, radiographic measurements, audiological findings, and audition and speech assessments. Next, CI outcomes were evaluated based on the scores obtained after 2 years of CI from the Categories of Auditory Performance index, Speech Intelligibility Rating, Infant/Toddler Meaningful Auditory Integration Scale or Meaningful Auditory Integration Scale, and Meaningful Use of Speech Scale. Then, we measured and averaged the audiological and radiographic characteristics of the patients to form feature vectors, adopting a multivariate feature selection method, called stability selection, to select the features that were consistent within a certain range of model parameters. Stability selection analysis identified two out of six characteristics, namely the vestibulocochlear nerve (VCN) area and the number of nerve bundles, which played an important role in predicting the hearing and speech rehabilitation results of CND patients. Finally, we used a parameter-optimized support vector machine (SVM) as a classifier to study the postoperative hearing and speech rehabilitation of the patients. For hearing rehabilitation, the accuracy rate was 71% for both the SVM classification and the area under the curve (AUC), whereas for speech rehabilitation, the accuracy rate for SVM classification and AUC was 93% and 94%, respectively. Our results identified that a greater number of nerve bundles and a larger VCN area were associated with better CI outcomes. The number of nerve bundles and VCN area can predict CI outcomes in patients with CND. These findings can help surgeons in selecting the side for CI and provide reasonable expectations for the outcomes of CI surgery.

10.
Proc Math Phys Eng Sci ; 478(2262): 20210916, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35756878

ABSTRACT

We present a statistical learning framework for robust identification of differential equations from noisy spatio-temporal data. We address two issues that have so far limited the application of such methods, namely their robustness against noise and the need for manual parameter tuning, by proposing stability-based model selection to determine the level of regularization required for reproducible inference. This avoids manual parameter tuning and improves robustness against noise in the data. Our stability selection approach, termed PDE-STRIDE, can be combined with any sparsity-promoting regression method and provides an interpretable criterion for model component importance. We show that the particular combination of stability selection with the iterative hard-thresholding algorithm from compressed sensing provides a fast and robust framework for equation inference that outperforms previous approaches with respect to accuracy, amount of data required, and robustness. We illustrate the performance of PDE-STRIDE on a range of simulated benchmark problems, and we demonstrate the applicability of PDE-STRIDE on real-world data by considering purely data-driven inference of the protein interaction network for embryonic polarization in Caenorhabditis elegans. Using fluorescence microscopy images of C. elegans zygotes as input data, PDE-STRIDE is able to learn the molecular interactions of the proteins.

11.
Front Comput Neurosci ; 15: 735991, 2021.
Article in English | MEDLINE | ID: mdl-34795570

ABSTRACT

Structural MRI (sMRI) has been widely used to examine the cerebral changes that occur in Parkinson's disease (PD). However, previous studies have aimed for brain changes at the group level rather than at the individual level. Additionally, previous studies have been inconsistent regarding the changes they identified. It is difficult to identify which brain regions are the true biomarkers of PD. To overcome these two issues, we employed four different feature selection methods [ReliefF, graph-theory, recursive feature elimination (RFE), and stability selection] to obtain a minimal set of relevant features and nonredundant features from gray matter (GM) and white matter (WM). Then, a support vector machine (SVM) was utilized to learn decision models from selected features. Based on machine learning technique, this study has not only extended group level statistical analysis with identifying group difference to individual level with predicting patients with PD from healthy controls (HCs), but also identified most informative brain regions with feature selection methods. Furthermore, we conducted horizontal and vertical analyses to investigate the stability of the identified brain regions. On the one hand, we compared the brain changes found by different feature selection methods and considered these brain regions found by feature selection methods commonly as the potential biomarkers related to PD. On the other hand, we compared these brain changes with previous findings reported by conventional statistical analysis to evaluate their stability. Our experiments have demonstrated that the proposed machine learning techniques achieve satisfactory and robust classification performance. The highest classification performance was 92.24% (specificity), 92.42% (sensitivity), 89.58% (accuracy), and 89.77% (AUC) for GM and 71.93% (specificity), 74.87% (sensitivity), 71.18% (accuracy), and 71.82% (AUC) for WM. Moreover, most brain regions identified by machine learning were consistent with previous findings, which means that these brain regions are related to the pathological brain changes characteristic of PD and can be regarded as potential biomarkers of PD. Besides, we also found the brain abnormality of superior frontal gyrus (dorsolateral, SFGdor) and lingual gyrus (LING), which have been confirmed in other studies of PD. This further demonstrates that machine learning models are beneficial for clinicians as a decision support system in diagnosing PD.

12.
Front Genet ; 12: 696956, 2021.
Article in English | MEDLINE | ID: mdl-34267783

ABSTRACT

Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future.

13.
J Neural Eng ; 18(4)2021 03 23.
Article in English | MEDLINE | ID: mdl-33690177

ABSTRACT

Objective.Categorical perception (CP) of audio is critical to understand how the human brain perceives speech sounds despite widespread variability in acoustic properties. Here, we investigated the spatiotemporal characteristics of auditory neural activity that reflects CP for speech (i.e. differentiates phonetic prototypes from ambiguous speech sounds).Approach.We recorded 64-channel electroencephalograms as listeners rapidly classified vowel sounds along an acoustic-phonetic continuum. We used support vector machine classifiers and stability selection to determine when and where in the brain CP was best decoded across space and time via source-level analysis of the event-related potentials.Main results. We found that early (120 ms) whole-brain data decoded speech categories (i.e. prototypical vs. ambiguous tokens) with 95.16% accuracy (area under the curve 95.14%;F1-score 95.00%). Separate analyses on left hemisphere (LH) and right hemisphere (RH) responses showed that LH decoding was more accurate and earlier than RH (89.03% vs. 86.45% accuracy; 140 ms vs. 200 ms). Stability (feature) selection identified 13 regions of interest (ROIs) out of 68 brain regions [including auditory cortex, supramarginal gyrus, and inferior frontal gyrus (IFG)] that showed categorical representation during stimulus encoding (0-260 ms). In contrast, 15 ROIs (including fronto-parietal regions, IFG, motor cortex) were necessary to describe later decision stages (later 300-800 ms) of categorization but these areas were highly associated with the strength of listeners' categorical hearing (i.e. slope of behavioral identification functions).Significance.Our data-driven multivariate models demonstrate that abstract categories emerge surprisingly early (∼120 ms) in the time course of speech processing and are dominated by engagement of a relatively compact fronto-temporal-parietal brain network.


Subject(s)
Auditory Cortex , Speech Perception , Acoustic Stimulation/methods , Auditory Cortex/physiology , Brain/physiology , Evoked Potentials, Auditory , Humans , Machine Learning , Speech , Speech Perception/physiology
14.
Biostatistics ; 22(2): 421-436, 2021 04 10.
Article in English | MEDLINE | ID: mdl-31631216

ABSTRACT

Identifying biomarkers as surrogates for clinical endpoints in randomized vaccine trials is useful for reducing study duration and costs, relieving participants of unnecessary discomfort, and understanding vaccine-effect mechanism. In this article, we use risk models with multiple vaccine-induced immune response biomarkers to measure the causal association between a vaccine's effects on these biomarkers and that on the clinical endpoint. In this setup, our main objective is to combine and select markers with high surrogacy from a list of many candidate markers, allowing us to get a more parsimonious model which can potentially increase the predictive quality of the true markers. To address the missing "potential" biomarker value if a subject receives placebo, we utilize the baseline immunogenicity predictor design augmented with a "closeout placebo vaccination" group. We then impute the missing potential marker values and conduct marker selection through a stepwise resampling and imputation method called stability selection. We test our proposed strategy under relevant simulation settings and on (partially simulated) biomarker data from a HIV vaccine trial (RV144).


Subject(s)
AIDS Vaccines , Biomarkers , Causality , Humans , Immunity , Randomized Controlled Trials as Topic , Research Design
15.
Stat Med ; 40(4): 897-919, 2021 02 20.
Article in English | MEDLINE | ID: mdl-33219557

ABSTRACT

In this article, we present a new variable selection method for regression and classification purposes, particularly for microbiome analysis. Our method, called subsampling ranking forward selection (SuRF), is based on LASSO penalized regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome datasets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.


Subject(s)
Data Analysis , Microbiota , Humans
16.
BMC Res Notes ; 13(1): 521, 2020 Nov 10.
Article in English | MEDLINE | ID: mdl-33172489

ABSTRACT

OBJECTIVES: Numerous software has been developed to infer the gene regulatory network, a long-standing key topic in biology and computational biology. Yet the slowness and inaccuracy inherited in current software hamper their applications to the increasing massive data. Here, we develop a software, FINET (Fast Inferring NETwork), to infer a network with high accuracy and rapidity from big data. RESULTS: The high accuracy results from integrating algorithms with stability-selection, elastic-net, and parameter optimization. Tested by a known biological network, FINET infers interactions with over 94% precision. The high speed comes from partnering parallel computations implemented with Julia, a new compiled language that runs much faster than existing languages used in the current software, such as R, Python, and MATLAB. Regardless of FINET's implementations with Julia, users with no background in the language or computer science can easily operate it, with only a user-friendly single command line. In addition, FINET can infer other networks such as chemical networks and social networks. Overall, FINET provides a confident way to efficiently and accurately infer any type of network for any scale of data.


Subject(s)
Computational Biology , Gene Regulatory Networks , Algorithms , Computers , Software
17.
Mol Genet Genomic Med ; 8(10): e1400, 2020 10.
Article in English | MEDLINE | ID: mdl-32869517

ABSTRACT

BACKGROUND: Neurofibromatosis type 1 (NF1) is a tumor-predisposition disorder that arises due to pathogenic variants in tumor suppressor NF1. NF1 has variable expressivity that may be due, at least in part, from heritable elements such as modifier genes; however, few genetic modifiers have been identified to date. METHODS: In this study, we performed a genome-wide association analysis of the number of café-au-lait macules (CALM) that are considered a tumor-like trait as a clinical phenotype modifying NF1. RESULTS: A borderline genome-wide significant association was identified in the discovery cohort (CALM1, N = 112) between CALM number and rs12190451 (and rs3799603, r2  = 1.0; p = 7.4 × 10-8 ) in the intronic region of RPS6KA2. Although, this association was not replicated in the second cohort (CALM2, N = 59) and a meta-analysis did not show significantly associated variants in this region, a significant corroboration score (0.72) was obtained for the RPS6KA2 signal in the discovery cohort (CALM1) using Complementary Pairs Stability Selection for Genome-Wide Association Studies (ComPaSS-GWAS) analysis, suggesting that the lack of replication may be due to heterogeneity of the cohorts rather than type I error. CONCLUSION: rs12190451 is located in a melanocyte-specific enhancer and may influence RPS6KA2 expression in melanocytes-warranting further functional studies.


Subject(s)
Cafe-au-Lait Spots/genetics , Neurofibromatosis 1/genetics , Polymorphism, Single Nucleotide , Ribosomal Protein S6 Kinases, 90-kDa/genetics , Adult , Female , Humans , Male , Middle Aged
18.
Front Neurosci ; 14: 748, 2020.
Article in English | MEDLINE | ID: mdl-32765215

ABSTRACT

Speech perception in noisy environments depends on complex interactions between sensory and cognitive systems. In older adults, such interactions may be affected, especially in those individuals who have more severe age-related hearing loss. Using a data-driven approach, we assessed the temporal (when in time) and spatial (where in the brain) characteristics of cortical speech-evoked responses that distinguish older adults with or without mild hearing loss. We performed source analyses to estimate cortical surface signals from the EEG recordings during a phoneme discrimination task conducted under clear and noise-degraded conditions. We computed source-level ERPs (i.e., mean activation within each ROI) from each of the 68 ROIs of the Desikan-Killiany (DK) atlas, averaged over a randomly chosen 100 trials without replacement to form feature vectors. We adopted a multivariate feature selection method called stability selection and control to choose features that are consistent over a range of model parameters. We use parameter optimized support vector machine (SVM) as a classifiers to investigate the time course and brain regions that segregate groups and speech clarity. For clear speech perception, whole-brain data revealed a classification accuracy of 81.50% [area under the curve (AUC) 80.73%; F1-score 82.00%], distinguishing groups within ∼60 ms after speech onset (i.e., as early as the P1 wave). We observed lower accuracy of 78.12% [AUC 77.64%; F1-score 78.00%] and delayed classification performance when speech was embedded in noise, with group segregation at 80 ms. Separate analysis using left (LH) and right hemisphere (RH) regions showed that LH speech activity was better at distinguishing hearing groups than activity measured in the RH. Moreover, stability selection analysis identified 12 brain regions (among 1428 total spatiotemporal features from 68 regions) where source activity segregated groups with >80% accuracy (clear speech); whereas 16 regions were critical for noise-degraded speech to achieve a comparable level of group segregation (78.7% accuracy). Our results identify critical time-courses and brain regions that distinguish mild hearing loss from normal hearing in older adults and confirm a larger number of active areas, particularly in RH, when processing noise-degraded speech information.

19.
Stat Methods Med Res ; 29(12): 3684-3694, 2020 Dec.
Article in English | MEDLINE | ID: mdl-32646307

ABSTRACT

OBJECTIVE: We propose a data-driven method to detect temporal patterns of disease progression in high-dimensional claims data based on gradient boosting with stability selection. MATERIALS AND METHODS: We identified patients with chronic obstructive pulmonary disease in a German health insurance claims database with 6.5 million individuals and divided them into a group of patients with the highest disease severity and a group of control patients with lower severity. We then used gradient boosting with stability selection to determine variables correlating with a chronic obstructive pulmonary disease diagnosis of highest severity and subsequently model the temporal progression of the disease using the selected variables. RESULTS: We identified a network of 20 diagnoses (e.g. respiratory failure), medications (e.g. anticholinergic drugs) and procedures associated with a subsequent chronic obstructive pulmonary disease diagnosis of highest severity. Furthermore, the network successfully captured temporal patterns, such as disease progressions from lower to higher severity grades. DISCUSSION: The temporal trajectories identified by our data-driven approach are compatible with existing knowledge about chronic obstructive pulmonary disease showing that the method can reliably select relevant variables in a high-dimensional context. CONCLUSION: We provide a generalizable approach for the automatic detection of disease trajectories in claims data. This could help to diagnose diseases early, identify unknown risk factors and optimize treatment plans.


Subject(s)
Pulmonary Disease, Chronic Obstructive , Databases, Factual , Humans , Insurance, Health , Risk Factors , Severity of Illness Index
20.
Stat Appl Genet Mol Biol ; 18(5)2019 10 07.
Article in English | MEDLINE | ID: mdl-31586968

ABSTRACT

The instability in the selection of models is a major concern with data sets containing a large number of covariates. We focus on stability selection which is used as a technique to improve variable selection performance for a range of selection methods, based on aggregating the results of applying a selection procedure to sub-samples of the data where the observations are subject to right censoring. The accelerated failure time (AFT) models have proved useful in many contexts including the heavy censoring (as for example in cancer survival) and the high dimensionality (as for example in micro-array data). We implement the stability selection approach using three variable selection techniques-Lasso, ridge regression, and elastic net applied to censored data using AFT models. We compare the performances of these regularized techniques with and without stability selection approaches with simulation studies and two real data examples-a breast cancer data and a diffuse large B-cell lymphoma data. The results suggest that stability selection gives always stable scenario about the selection of variables and that as the dimension of data increases the performance of methods with stability selection also improves compared to methods without stability selection irrespective of the collinearity between the covariates.


Subject(s)
Computer Simulation , Probability , Algorithms , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Breast Neoplasms/mortality , Breast Neoplasms/pathology , Female , Humans , Linear Models , Lymphoma, B-Cell/genetics , Lymphoma, B-Cell/metabolism , Lymphoma, B-Cell/mortality , Neoplasm Metastasis
SELECTION OF CITATIONS
SEARCH DETAIL