Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 26
Filter
1.
ArXiv ; 2024 Jan 23.
Article in English | MEDLINE | ID: mdl-38344224

ABSTRACT

Recent advancements in protein docking site prediction have highlighted the limitations of traditional rigid docking algorithms, like PIPER, which often neglect critical stochastic elements such as solvent-induced fluctuations. These oversights can lead to inaccuracies in identifying viable docking sites due to the complexity of high-dimensional, stochastic energy manifolds with low regularity. To address this issue, our research introduces a novel model where the molecular shapes of ligands and receptors are represented using multi-variate Karhunen-Lo `eve (KL) expansions. This method effectively captures the stochastic nature of energy manifolds, allowing for a more accurate representation of molecular interactions.Developed as a plugin for PIPER, our scientific computing software enhances the platform, delivering robust uncertainty measures for the energy manifolds of ranked binding sites. Our results demonstrate that top-ranked binding sites, characterized by lower uncertainty in the stochastic energy manifold, align closely with actual docking sites. Conversely, sites with higher uncertainty correlate with less optimal docking positions. This distinction not only validates our approach but also sets a new standard in protein docking predictions, offering substantial implications for future molecular interaction research and drug development.

2.
Entropy (Basel) ; 24(3)2022 Mar 18.
Article in English | MEDLINE | ID: mdl-35327933

ABSTRACT

We present a coupled variational autoencoder (VAE) method, which improves the accuracy and robustness of the model representation of handwritten numeral images. The improvement is measured in both increasing the likelihood of the reconstructed images and in reducing divergence between the posterior and a prior latent distribution. The new method weighs outlier samples with a higher penalty by generalizing the original evidence lower bound function using a coupled entropy function based on the principles of nonlinear statistical coupling. We evaluated the performance of the coupled VAE model using the Modified National Institute of Standards and Technology (MNIST) dataset and its corrupted modification C-MNIST. Histograms of the likelihood that the reconstruction matches the original image show that the coupled VAE improves the reconstruction and this improvement is more substantial when seeded with corrupted images. All five corruptions evaluated showed improvement. For instance, with the Gaussian corruption seed the accuracy improves by 1014 (from 10-57.2 to 10-42.9) and robustness improves by 1022 (from 10-109.2 to 10-87.0). Furthermore, the divergence between the posterior and prior distribution of the latent distribution is reduced. Thus, in contrast to the ß-VAE design, the coupled VAE algorithm improves model representation, rather than trading off the performance of the reconstruction and latent distribution divergence.

3.
Front Mol Biosci ; 8: 663532, 2021.
Article in English | MEDLINE | ID: mdl-34222331

ABSTRACT

Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables ("rules") frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn's disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.

4.
Adv Comput Math ; 46(3)2020 Mar.
Article in English | MEDLINE | ID: mdl-32377059

ABSTRACT

In this paper we introduce concepts from uncertainty quantification (UQ) and numerical analysis for the efficient evaluation of stochastic high dimensional Newton iterates. In particular, we develop complex analytic regularity theory of the solution with respect to the random variables. This justifies the application of sparse grids for the computation of statistical measures. Convergence rates are derived and are shown to be subexponential or algebraic with respect to the number of realizations of random perturbations. Due the accuracy of the method, sparse grids are well suited for computing low probability events with high confidence. We apply our method to the power flow problem. Numerical experiments on the non-trivial 39 bus New England power system model with large stochastic loads are consistent with the theoretical convergence rates. Moreover, compared to the Monte Carlo method our approach is at least 1011 times faster for the same accuracy.

5.
Explor Med ; 1(1): 27-41, 2020.
Article in English | MEDLINE | ID: mdl-33554217

ABSTRACT

AIM: Racial disparities in opioid use disorder (OUD) management exist, however, and there is limited research on factors that influence opioid cessation in different population groups. METHODS: We employed multiple machine learning prediction algorithms least absolute shrinkage and selection operator, random forest, deep neural network, and support vector machine to assess factors associated with ceasing opioid use in a sample of 1,192 African Americans (AAs) and 2,557 individuals of European ancestry (EAs) who met Diagnostic and Statistical Manual of Mental Disorders, 5th Edition criteria for OUD. Values for nearly 4,000 variables reflecting demographics, alcohol and other drug use, general health, non-drug use behaviors, and diagnoses for other psychiatric disorders, were obtained for each participant from the Semi-Structured Assessment for Drug Dependence and Alcoholism, a detailed semi-structured interview. RESULTS: Support vector machine models performed marginally better on average than other machine learning methods with maximum prediction accuracies of 75.4% in AAs and 79.4% in EAs. Subsequent stepwise regression considered the 83 most highly ranked variables across all methods and models and identified less recent cocaine use (AAs: odds ratio (OR) = 1.82, P = 9.19 × 10-5; EAs: OR = 1.91, P = 3.30 × 10-15), shorter duration of opioid use (AAs: OR = 0.55, P = 5.78 × 10-6; EAs: OR = 0.69, P = 3.01 × 10-7), and older age (AAs: OR = 2.44, P = 1.41 × 10-12; EAs: OR = 2.00, P = 5.74 × 10-9) as the strongest independent predictors of opioid cessation in both AAs and EAs. Attending self-help groups for OUD was also an independent predictor (P < 0.05) in both population groups, while less gambling severity (OR = 0.80, P = 3.32 × 10-2) was specific to AAs and post-traumatic stress disorder recovery (OR = 1.93, P = 7.88 × 10-5), recent antisocial behaviors (OR = 0.64, P = 2.69 × 10-3), and atheism (OR = 1.45, P = 1.34 × 10-2) were specific to EAs. Factors related to drug use comprised about half of the significant independent predictors in both AAs and EAs, with other predictors related to non-drug use behaviors, psychiatric disorders, overall health, and demographics. CONCLUSIONS: These proof-of-concept findings provide avenues for hypothesis-driven analysis, and will lead to further research on strategies to improve OUD management in EAs and AAs.

6.
Analyst ; 143(24): 5935-5939, 2018 Dec 03.
Article in English | MEDLINE | ID: mdl-30406772

ABSTRACT

This paper reviews methods to arrive at optimum decision tree or label tree structures to analyze large SHP datasets. Supervised methods of analysis can utilize either sequential or (flat) multi-classifiers depending on the variance in the data, and on the number of spectral classes to be distinguished. For small number of spectral classes, multi-classifiers have been used in the past, but for the analysis of datasets containing large numbers (∼20) of disease or tissue types, mixed decision tree structures were found to be advantageous. In these mixed structures, discrimination into classes and subclasses is achieved via hierarchical decision/label tree structures.


Subject(s)
Decision Trees , Pathology/methods , Algorithms , Breast Neoplasms/classification , Humans , Lung Neoplasms/classification
7.
mSystems ; 3(5)2018.
Article in English | MEDLINE | ID: mdl-30417106

ABSTRACT

Microbes affect each other's growth in multiple, often elusive, ways. The ensuing interdependencies form complex networks, believed to reflect taxonomic composition as well as community-level functional properties and dynamics. The elucidation of these networks is often pursued by measuring pairwise interactions in coculture experiments. However, the combinatorial complexity precludes an exhaustive experimental analysis of pairwise interactions, even for moderately sized microbial communities. Here, we used a machine learning random forest approach to address this challenge. In particular, we show how partial knowledge of a microbial interaction network, combined with trait-level representations of individual microbial species, can provide accurate inference of missing edges in the network and putative mechanisms underlying the interactions. We applied our algorithm to three case studies: an experimentally mapped network of interactions between auxotrophic Escherichia coli strains, a community of soil microbes, and a large in silico network of metabolic interdependencies between 100 human gut-associated bacteria. For this last case, 5% of the network was sufficient to predict the remaining 95% with 80% accuracy, and the mechanistic hypotheses produced by the algorithm accurately reflected known metabolic exchanges. Our approach, broadly applicable to any microbial or other ecological network, may drive the discovery of new interactions and new molecular mechanisms, both for therapeutic interventions involving natural communities and for the rational design of synthetic consortia. IMPORTANCE Different organisms in a microbial community may drastically affect each other's growth phenotypes, significantly affecting the community dynamics, with important implications for human and environmental health. Novel culturing methods and the decreasing costs of sequencing will gradually enable high-throughput measurements of pairwise interactions in systematic coculturing studies. However, a thorough characterization of all interactions that occur within a microbial community is greatly limited both by the combinatorial complexity of possible assortments and by the limited biological insight that interaction measurements typically provide without laborious specific follow-ups. Here, we show how a simple and flexible formal representation of microbial pairs can be used for the classification of interactions via machine learning. The approach we propose predicts with high accuracy the outcome of yet-to-be performed experiments and generates testable hypotheses about the mechanisms of specific interactions.

8.
Sensors (Basel) ; 16(9)2016 Sep 14.
Article in English | MEDLINE | ID: mdl-27649187

ABSTRACT

We still do not know how the brain and its computations are affected by nerve cell deaths and their compensatory learning processes, as these develop in neurodegenerative diseases (ND). Compensatory learning processes are ND symptoms usually observed at a point when the disease has already affected large parts of the brain. We can register symptoms of ND such as motor and/or mental disorders (dementias) and even provide symptomatic relief, though the structural effects of these are in most cases not yet understood. It is very important to obtain early diagnosis, which can provide several years in which we can monitor and partly compensate for the disease's symptoms, with the help of various therapies. In the case of Parkinson's disease (PD), in addition to classical neurological tests, measurements of eye movements are diagnostic. We have performed measurements of latency, amplitude, and duration in reflexive saccades (RS) of PD patients. We have compared the results of our measurement-based diagnoses with standard neurological ones. The purpose of our work was to classify how condition attributes predict the neurologist's diagnosis. For n = 10 patients, the patient age and parameters based on RS gave a global accuracy in predictions of neurological symptoms in individual patients of about 80%. Further, by adding three attributes partly related to patient 'well-being' scores, our prediction accuracies increased to 90%. Our predictive algorithms use rough set theory, which we have compared with other classifiers such as Naïve Bayes, Decision Trees/Tables, and Random Forests (implemented in KNIME/WEKA). We have demonstrated that RS are powerful biomarkers for assessment of symptom progression in PD.


Subject(s)
Machine Learning , Parkinson Disease/diagnosis , Algorithms , Female , Humans , Male , Middle Aged , Saccades/physiology
9.
Biomed Res Int ; 2015: 467514, 2015.
Article in English | MEDLINE | ID: mdl-25949998

ABSTRACT

New data sources for the analysis of cancer data are rapidly supplementing the large number of gene-expression markers used for current methods of analysis. Significant among these new sources are copy number variation (CNV) datasets, which typically enumerate several hundred thousand CNVs distributed throughout the genome. Several useful algorithms allow systems-level analyses of such datasets. However, these rich data sources have not yet been analyzed as deeply as gene-expression data. To address this issue, the extensive toolsets used for analyzing expression data in cancerous and noncancerous tissue (e.g., gene set enrichment analysis and phenotype prediction) could be redirected to extract a great deal of predictive information from CNV data, in particular those derived from cancers. Here we present a software package capable of preprocessing standard Agilent copy number datasets into a form to which essentially all expression analysis tools can be applied. We illustrate the use of this toolset in predicting the survival time of patients with ovarian cancer or glioblastoma multiforme and also provide an analysis of gene- and pathway-level deletions in these two types of cancer.


Subject(s)
DNA Copy Number Variations/genetics , Databases, Genetic , Glioblastoma/genetics , Ovarian Neoplasms/genetics , Software , Algorithms , Datasets as Topic , Female , Genome, Human , Humans
11.
Lab Invest ; 95(4): 406-21, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25664390

ABSTRACT

We report results of a study utilizing a novel tissue classification method, based on label-free spectral techniques, for the classification of lung cancer histopathological samples on a tissue microarray. The spectral diagnostic method allows reproducible and objective classification of unstained tissue sections. This is accomplished by acquiring infrared data sets containing thousands of spectra, each collected from tissue pixels ∼6 µm on edge; these pixel spectra contain an encoded snapshot of the entire biochemical composition of the pixel area. The hyperspectral data sets are subsequently decoded by methods of multivariate analysis that reveal changes in the biochemical composition between tissue types, and between various stages and states of disease. In this study, a detailed comparison between classical and spectral histopathology is presented, suggesting that spectral histopathology can achieve levels of diagnostic accuracy that is comparable to that of multipanel immunohistochemistry.


Subject(s)
Histological Techniques/methods , Lung Neoplasms/classification , Lung Neoplasms/pathology , Spectrophotometry, Infrared/methods , Tissue Array Analysis/methods , Humans , Multivariate Analysis
12.
Analyst ; 140(7): 2449-64, 2015 Apr 07.
Article in English | MEDLINE | ID: mdl-25664623

ABSTRACT

We report results on a statistical analysis of an infrared spectral dataset comprising a total of 388 lung biopsies from 374 patients. The method of correlating classical and spectral results and analyzing the resulting data has been referred to as spectral histopathology (SHP) in the past. Here, we show that standard bio-statistical procedures, such as strict separation of training and blinded test sets, result in a balanced accuracy of better than 95% for the distinction of normal, necrotic and cancerous tissues, and better than 90% balanced accuracy for the classification of small cell, squamous cell and adenocarcinomas. Preliminary results indicate that further sub-classification of adenocarcinomas should be feasible with similar accuracy once sufficiently large datasets have been collected.


Subject(s)
Data Interpretation, Statistical , Lung Neoplasms/diagnosis , Lung Neoplasms/pathology , Algorithms , Artificial Intelligence , Humans , Spectrophotometry, Infrared
13.
Artif Intell Med ; 62(1): 23-31, 2014 Sep.
Article in English | MEDLINE | ID: mdl-24997860

ABSTRACT

OBJECTIVE: Although numerous studies related to cancer survival have been published, increasing the prediction accuracy of survival classes still remains a challenge. Integration of different data sets, such as microRNA (miRNA) and mRNA, might increase the accuracy of survival class prediction. Therefore, we suggested a machine learning (ML) approach to integrate different data sets, and developed a novel method based on feature selection with Cox proportional hazard regression model (FSCOX) to improve the prediction of cancer survival time. METHODS: FSCOX provides us with intermediate survival information, which is usually discarded when separating survival into 2 groups (short- and long-term), and allows us to perform survival analysis. We used an ML-based protocol for feature selection, integrating information from miRNA and mRNA expression profiles at the feature level. To predict survival phenotypes, we used the following classifiers, first, existing ML methods, support vector machine (SVM) and random forest (RF), second, a new median-based classifier using FSCOX (FSCOX_median), and third, an SVM classifier using FSCOX (FSCOX_SVM). We compared these methods using 3 types of cancer tissue data sets: (i) miRNA expression, (ii) mRNA expression, and (iii) combined miRNA and mRNA expression. The latter data set included features selected either from the combined miRNA/mRNA profile or independently from miRNAs and mRNAs profiles (IFS). RESULTS: In the ovarian data set, the accuracy of survival classification using the combined miRNA/mRNA profiles with IFS was 75% using RF, 86.36% using SVM, 84.09% using FSCOX_median, and 88.64% using FSCOX_SVM with a balanced 22 short-term and 22 long-term survivor data set. These accuracies are higher than those using miRNA alone (70.45%, RF; 75%, SVM; 75%, FSCOX_median; and 75%, FSCOX_SVM) or mRNA alone (65.91%, RF; 63.64%, SVM; 72.73%, FSCOX_median; and 70.45%, FSCOX_SVM). Similarly in the glioblastoma multiforme data, the accuracy of miRNA/mRNA using IFS was 75.51% (RF), 87.76% (SVM) 85.71% (FSCOX_median), 85.71% (FSCOX_SVM). These results are higher than the results of using miRNA expression and mRNA expression alone. In addition we predict 16 hsa-miR-23b and hsa-miR-27b target genes in ovarian cancer data sets, obtained by SVM-based feature selection through integration of sequence information and gene expression profiles. CONCLUSION: Among the approaches used, the integrated miRNA and mRNA data set yielded better results than the individual data sets. The best performance was achieved using the FSCOX_SVM method with independent feature selection, which uses intermediate survival information between short-term and long-term survival time and the combination of the 2 different data sets. The results obtained using the combined data set suggest that there are some strong interactions between miRNA and mRNA features that are not detectable in the individual analyses.


Subject(s)
Artificial Intelligence , Brain Neoplasms/mortality , Glioblastoma/mortality , Ovarian Neoplasms/mortality , Algorithms , Brain Neoplasms/genetics , Brain Neoplasms/metabolism , Datasets as Topic , Female , Glioblastoma/genetics , Glioblastoma/metabolism , Humans , MicroRNAs/metabolism , Ovarian Neoplasms/genetics , Ovarian Neoplasms/metabolism , Proportional Hazards Models , RNA, Messenger/metabolism , Sensitivity and Specificity , Survival Rate
15.
ScientificWorldJournal ; 2013: 769639, 2013.
Article in English | MEDLINE | ID: mdl-23431259

ABSTRACT

The volumes of current patient data as well as their complexity make clinical decision making more challenging than ever for physicians and other care givers. This situation calls for the use of biomedical informatics methods to process data and form recommendations and/or predictions to assist such decision makers. The design, implementation, and use of biomedical informatics systems in the form of computer-aided decision support have become essential and widely used over the last two decades. This paper provides a brief review of such systems, their application protocols and methodologies, and the future challenges and directions they suggest.


Subject(s)
Decision Making, Computer-Assisted , Decision Support Systems, Clinical , Medical Informatics/methods , Artificial Intelligence , Biomedical Technology , Computational Biology/methods , Computational Biology/trends , Data Collection , Decision Support Techniques , Dentistry/methods , Emergency Medicine , Humans , Image Processing, Computer-Assisted , Intensive Care Units , Neoplasms/therapy , Radiology/methods
16.
Methods Mol Biol ; 939: 233-51, 2013.
Article in English | MEDLINE | ID: mdl-23192550

ABSTRACT

A host of data on genetic variation from the Human Genome and International HapMap projects, and advances in high-throughput genotyping technologies, have made genome-wide association (GWA) studies technically feasible. GWA studies help in the discovery and quantification of the genetic components of disease risks, many of which have not been unveiled before and have opened a new avenue to understanding disease, treatment, and prevention. This chapter presents an overview of GWA, an important tool for discovering regions of the genome that harbor common genetic variants to confer susceptibility for various diseases or health outcomes in the post-Human Genome Project era. A tutorial on how to conduct a GWA study and some practical challenges specifically related to the GWA design is presented, followed by a detailed GWA case study involving the identification of loci associated with glioma as an example and an illustration of current technologies.


Subject(s)
Computational Biology/methods , Genome, Human , Genome-Wide Association Study/methods , Genetic Loci , Genetic Markers , Genetic Predisposition to Disease , Genotyping Techniques , HapMap Project , Humans , Linkage Disequilibrium , Meta-Analysis as Topic , Polymorphism, Single Nucleotide , Reproducibility of Results
17.
Biol Direct ; 7: 21, 2012 Jul 03.
Article in English | MEDLINE | ID: mdl-22759382

ABSTRACT

BACKGROUND: Molecular markers based on gene expression profiles have been used in experimental and clinical settings to distinguish cancerous tumors in stage, grade, survival time, metastasis, and drug sensitivity. However, most significant gene markers are unstable (not reproducible) among data sets. We introduce a standardized method for representing cancer markers as 2-level hierarchical feature vectors, with a basic gene level as well as a second level of (more stable) pathway markers, for the purpose of discriminating cancer subtypes. This extends standard gene expression arrays with new pathway-level activation features obtained directly from off-the-shelf gene set enrichment algorithms such as GSEA. Such so-called pathway-based expression arrays are significantly more reproducible across datasets. Such reproducibility will be important for clinical usefulness of genomic markers, and augment currently accepted cancer classification protocols. RESULTS: The present method produced more stable (reproducible) pathway-based markers for discriminating breast cancer metastasis and ovarian cancer survival time. Between two datasets for breast cancer metastasis, the intersection of standard significant gene biomarkers totaled 7.47% of selected genes, compared to 17.65% using pathway-based markers; the corresponding percentages for ovarian cancer datasets were 20.65% and 33.33% respectively. Three pathways, consisting of Type_1_diabetes mellitus, Cytokine-cytokine_receptor_interaction and Hedgehog_signaling (all previously implicated in cancer), are enriched in both the ovarian long survival and breast non-metastasis groups. In addition, integrating pathway and gene information, we identified five (ID4, ANXA4, CXCL9, MYLK, FBXL7) and six (SQLE, E2F1, PTTG1, TSTA3, BUB1B, MAD2L1) known cancer genes significant for ovarian and breast cancer respectively. CONCLUSIONS: Standardizing the analysis of genomic data in the process of cancer staging, classification and analysis is important as it has implications for both pre-clinical as well as clinical studies. The paradigm of diagnosis and prediction using pathway-based biomarkers as features can be an important part of the process of biomarker-based cancer analysis, and the resulting canonical (clinically reproducible) biomarkers can be important in standardizing genomic data. We expect that identification of such canonical biomarkers will improve clinical utility of high-throughput datasets for diagnostic and prognostic applications.


Subject(s)
Biomarkers, Tumor/analysis , Breast Neoplasms/diagnosis , Neoplasm Staging/methods , Ovarian Neoplasms/diagnosis , Signal Transduction , Algorithms , Biomarkers, Tumor/metabolism , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Diabetes Mellitus, Type 1/pathology , Female , Genes, Neoplasm , Hedgehog Proteins/metabolism , Humans , Neoplasm Grading , Neoplasm Metastasis/diagnosis , Ovarian Neoplasms/genetics , Ovarian Neoplasms/metabolism , Prognosis , RNA, Messenger/genetics , RNA, Messenger/metabolism , Receptors, Cytokine/metabolism , Reproducibility of Results , Survival Analysis , Time Factors , Transcriptome
18.
Lab Invest ; 92(9): 1358-73, 2012 Sep.
Article in English | MEDLINE | ID: mdl-22751349

ABSTRACT

We report results of a study utilizing a recently developed tissue diagnostic method, based on label-free spectral techniques, for the classification of lung cancer histopathological samples from a tissue microarray. The spectral diagnostic method allows reproducible and objective diagnosis of unstained tissue sections. This is accomplished by acquiring infrared hyperspectral data sets containing thousands of spectra, each collected from tissue pixels about 6 µm on edge; these pixel spectra contain an encoded snapshot of the entire biochemical composition of the pixel area. The hyperspectral data sets are subsequently decoded by methods of multivariate analysis, which reveal changes in the biochemical composition between tissue types, and between various stages and states of disease. In this study, a detailed comparison between classical and spectral histopathology (SHP) is presented, which suggests SHP can achieve levels of diagnostic accuracy that is comparable to that of multi-panel immunohistochemistry.


Subject(s)
Lung Neoplasms/diagnosis , Spectrophotometry, Infrared/methods , Humans , Lung Neoplasms/classification
19.
BMC Bioinformatics ; 12: 375, 2011 Sep 23.
Article in English | MEDLINE | ID: mdl-21939564

ABSTRACT

BACKGROUND: The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers. RESULTS: We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets CONCLUSIONS: The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.


Subject(s)
Algorithms , Artificial Intelligence , Neoplasms/drug therapy , Neoplasms/genetics , Humans , Neoplasms/metabolism , Neoplasms/radiotherapy , Prognosis , Software , Support Vector Machine
20.
BMC Med Genomics ; 4: 63, 2011 Aug 09.
Article in English | MEDLINE | ID: mdl-21827660

ABSTRACT

BACKGROUND: Glioblastoma multiforme (GBM) tends to occur between the ages of 45 and 70. This relatively early onset and its poor prognosis make the impact of GBM on public health far greater than would be suggested by its relatively low frequency. Tissue and blood samples have now been collected for a number of populations, and predisposing alleles have been sought by several different genome-wide association (GWA) studies. The Cancer Genome Atlas (TCGA) at NIH has also collected a considerable amount of data. Because of the low concordance between the results obtained using different populations, only 14 predisposing single nucleotide polymorphism (SNP) candidates in five genomic regions have been replicated in two or more studies. The purpose of this paper is to present an improved approach to biomarker identification. METHODS: Association analysis was performed with control of population stratifications using the EIGENSTRAT package, under the null hypothesis of "no association between GBM and control SNP genotypes," based on an additive inheritance model. Genes that are strongly correlated with identified SNPs were determined by linkage disequilibrium (LD) or expression quantitative trait locus (eQTL) analysis. A new approach that combines meta-analysis and pathway enrichment analysis identified additional genes. RESULTS: (i) A meta-analysis of SNP data from TCGA and the Adult Glioma Study identifies 12 predisposing SNP candidates, seven of which are reported for the first time. These SNPs fall in five genomic regions (5p15.33, 9p21.3, 1p21.2, 3q26.2 and 7p15.3), three of which have not been previously reported. (ii) 25 genes are strongly correlated with these 12 SNPs, eight of which are known to be cancer-associated. (iii) The relative risk for GBM is highest for risk allele combinations on chromosomes 1 and 9. (iv) A combined meta-analysis/pathway analysis identified an additional four genes. All of these have been identified as cancer-related, but have not been previously associated with glioma. (v) Some SNPs that do not occur reproducibly across populations are in reproducible (invariant) pathways, suggesting that they affect the same biological process, and that population discordance can be partially resolved by evaluating processes rather than genes. CONCLUSION: We have uncovered 29 glioma-associated gene candidates; 12 of them known to be cancer related (p = 1. 4 × 10-6), providing additional statistical support for the relevance of the new candidates. This additional information on risk loci is potentially important for identifying Caucasian individuals at risk for glioma, and for assessing relative risk.


Subject(s)
Chromosomes, Human, Pair 1/genetics , Chromosomes, Human, Pair 9/genetics , Glioblastoma/genetics , Aged , Genetic Predisposition to Disease , Genotype , Humans , Linkage Disequilibrium , Middle Aged , Polymorphism, Single Nucleotide , Quantitative Trait Loci
SELECTION OF CITATIONS
SEARCH DETAIL
...