Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 9 de 9
Filter
Add more filters










Database
Language
Publication year range
1.
PLoS One ; 16(5): e0251902, 2021.
Article in English | MEDLINE | ID: mdl-34019571

ABSTRACT

The volume of Amharic digital documents has grown rapidly in recent years. As a result, automatic document categorization is highly essential. In this paper, we present a novel dimension reduction approach for improving classification accuracy by combining feature selection and feature extraction. The new dimension reduction method utilizes Information Gain (IG), Chi-square test (CHI), and Document Frequency (DF) to select important features and Principal Component Analysis (PCA) to refine the features that have been selected. We evaluate the proposed dimension reduction method with a dataset containing 9 news categories. Our experimental results verified that the proposed dimension reduction method outperforms other methods. Classification accuracy with the new dimension reduction is 92.60%, which is 13.48%, 16.51% and 10.19% higher than with IG, CHI, and DF respectively. Further work is required since classification accuracy still decreases as we reduce the feature size to save computational time.


Subject(s)
Data Mining/methods , Information Technology , Linguistics/statistics & numerical data , Multifactor Dimensionality Reduction/statistics & numerical data , Support Vector Machine , Datasets as Topic , Ethiopia , Humans , Language , Principal Component Analysis
2.
Nat Biomed Eng ; 5(6): 624-635, 2021 06.
Article in English | MEDLINE | ID: mdl-33139824

ABSTRACT

Dimensionality reduction is widely used in the visualization, compression, exploration and classification of data. Yet a generally applicable solution remains unavailable. Here, we report an accurate and broadly applicable data-driven algorithm for dimensionality reduction. The algorithm, which we named 'feature-augmented embedding machine' (FEM), first learns the structure of the data and the inherent characteristics of the data components (such as central tendency and dispersion), denoises the data, increases the separation of the components, and then projects the data onto a lower number of dimensions. We show that the technique is effective at revealing the underlying dominant trends in datasets of protein expression and single-cell RNA sequencing, computed tomography, electroencephalography and wearable physiological sensors.


Subject(s)
Algorithms , Biomedical Research/statistics & numerical data , Datasets as Topic , Multifactor Dimensionality Reduction/statistics & numerical data , Electroencephalography/statistics & numerical data , Humans , Protein Biosynthesis , Sequence Analysis, RNA/statistics & numerical data , Single-Cell Analysis/statistics & numerical data , Tomography, X-Ray Computed/statistics & numerical data
3.
Biomed Res Int ; 2019: 4578983, 2019.
Article in English | MEDLINE | ID: mdl-31380425

ABSTRACT

To understand the pathophysiology of complex diseases, including hypertension, diabetes, and autism, deleterious phenotypes are unlikely due to the effects of single genes, but rather, gene-gene interactions (GGIs), which are widely analyzed by multifactor dimensionality reduction (MDR). Early MDR methods mainly focused on binary traits. More recently, several extensions of MDR have been developed for analyzing various traits such as quantitative traits and survival times. Newer technologies, such as genome-wide association studies (GWAS), have now been developed for assessing multiple traits, to simultaneously identify genetic variants associated with various pathological phenotypes. It has also been well demonstrated that analyzing multiple traits has several advantages over single trait analysis. While there remains a need to find GGIs for multiple traits, such studies have become more difficult, due to a lack of novel methods and software. Herein, we propose a novel multi-CMDR method, by combining fuzzy clustering and MDR, to find GGIs for multiple traits. Multi-CMDR showed similar power to existing methods, when phenotypes followed bivariate normal distributions, and showed better power than others for skewed distributions. The validity of multi-CMDR was confirmed by analyzing real-life Korean GWAS data.


Subject(s)
Epistasis, Genetic/genetics , Genome-Wide Association Study/statistics & numerical data , Multifactor Dimensionality Reduction/statistics & numerical data , Quantitative Trait Loci/genetics , Algorithms , Cluster Analysis , Humans , Polymorphism, Single Nucleotide , Software
4.
Article in English | MEDLINE | ID: mdl-29483354

ABSTRACT

Models and constructs of individual differences are numerous and diverse. But detecting commonalities, differences and interrelations is hindered by the common abstract terms (e.g. 'personality', 'temperament', 'traits') that do not reveal the particular phenomena denoted. This article applies a transdisciplinary paradigm for research on individuals that builds on complexity theory and epistemological complementarity. Its philosophical, metatheoretical and methodological frameworks provide concepts to differentiate various kinds of phenomena (e.g. physiology, behaviour, psyche, language). They are used to scrutinize the field's basic concepts and to elaborate methodological foundations for taxonomizing individual variations in humans and other species. This guide to developing comprehensive and representative models explores the decisions taxonomists must make about which individual variations to include, which to retain and how to model them. Selection and reduction approaches from various disciplines are classified by their underlying rationales, pinpointing possibilities and limitations. Analyses highlight that individuals' complexity cannot be captured by one universal model. Instead, multiple models phenotypically taxonomizing different kinds of variability in different kinds of phenomena are needed to explore their causal and functional interrelations and ontogenetic development that are then modelled in integrative and explanatory taxonomies. This research agenda requires the expertise of many disciplines and is inherently transdisciplinary.This article is part of the theme issue 'Diverse perspectives on diversity: multi-disciplinary approaches to taxonomies of individual differences'.


Subject(s)
Emotions/physiology , Individuality , Models, Psychological , Psychophysiology/methods , Temperament/physiology , Brain/physiology , Humans , Interdisciplinary Research/methods , Language , Multifactor Dimensionality Reduction/statistics & numerical data , Psychomotor Performance/physiology , Terminology as Topic
5.
Nucleic Acids Res ; 45(17): e156, 2017 Sep 29.
Article in English | MEDLINE | ID: mdl-28973464

ABSTRACT

While only recently developed, the ability to profile expression data in single cells (scRNA-Seq) has already led to several important studies and findings. However, this technology has also raised several new computational challenges. These include questions about the best methods for clustering scRNA-Seq data, how to identify unique group of cells in such experiments, and how to determine the state or function of specific cells based on their expression profile. To address these issues we develop and test a method based on neural networks (NN) for the analysis and retrieval of single cell RNA-Seq data. We tested various NN architectures, some of which incorporate prior biological knowledge, and used these to obtain a reduced dimension representation of the single cell expression data. We show that the NN method improves upon prior methods in both, the ability to correctly group cells in experiments not used in the training and the ability to correctly infer cell type or state by querying a database of tens of thousands of single cell profiles. Such database queries (which can be performed using our web server) will enable researchers to better characterize cells when analyzing heterogeneous scRNA-Seq samples.


Subject(s)
Gene Expression Regulation , Multifactor Dimensionality Reduction/statistics & numerical data , Neural Networks, Computer , RNA/genetics , Single-Cell Analysis/methods , Software , Cluster Analysis , Computational Biology/methods , Databases, Genetic , Datasets as Topic , Gene Expression Profiling , Humans , Protein Interaction Mapping , RNA/metabolism , Sequence Analysis, RNA
6.
PLoS One ; 11(4): e0154222, 2016.
Article in English | MEDLINE | ID: mdl-27110937

ABSTRACT

UK Biobank includes 502,649 middle- and older-aged adults from the general population who have undergone detailed phenotypic assessment. The majority of participants completed tests of cognitive functioning, and on average four years later a sub-group of N = 20,346 participants repeated most of the assessment. These measures will be used in a range of future studies of health outcomes in this cohort. The format and content of the cognitive tasks were partly novel. The aim of the present study was to validate and characterize the cognitive data: to describe the inter-correlational structure of the cognitive variables at baseline assessment, and the degree of stability in scores across longitudinal assessment. Baseline cognitive data were used to examine the inter-correlational/factor-structure, using principal components analysis (PCA). We also assessed the degree of stability in cognitive scores in the subsample of participants with repeat data. The different tests of cognitive ability showed significant raw inter-correlations in the expected directions. PCA suggested a one-factor solution (eigenvalue = 1.60), which accounted for around 40% of the variance. Scores showed varying levels of stability across time-points (intraclass correlation range = 0.16 to 0.65). UK Biobank cognitive data has the potential to be a significant resource for researchers looking to investigate predictors and modifiers of cognitive abilities and associated health outcomes in the general population.


Subject(s)
Cognition Disorders/diagnosis , Cognition Disorders/epidemiology , Cognition/physiology , Multifactor Dimensionality Reduction/statistics & numerical data , Adult , Aged , Biological Specimen Banks , Cognition Disorders/physiopathology , Female , Humans , Longitudinal Studies , Male , Middle Aged , Principal Component Analysis , Psychological Tests , United Kingdom/epidemiology
7.
J Biosci ; 40(4): 721-30, 2015 Oct.
Article in English | MEDLINE | ID: mdl-26564974

ABSTRACT

Reduction of dimensionality has emerged as a routine process in modelling complex biological systems. A large number of feature selection techniques have been reported in the literature to improve model performance in terms of accuracy and speed. In the present article an unsupervised feature selection technique is proposed, using maximum information compression index as the dissimilarity measure and the well-known density-based cluster identification technique DBSCAN for identifying the largest natural group of dissimilar features. The algorithm is fast and less sensitive to the user-supplied parameters. Moreover, the method automatically determines the required number of features and identifies them. We used the proposed method for reducing dimensionality of a number of benchmark data sets of varying sizes. Its performance was also extensively compared with some other well-known feature selection methods.


Subject(s)
Algorithms , Computational Biology/statistics & numerical data , Multifactor Dimensionality Reduction/statistics & numerical data , Multigene Family , Arrhythmias, Cardiac/genetics , Cluster Analysis , Datasets as Topic , Gene Expression , Gene Expression Profiling , Humans , Neoplasms/genetics , Oligonucleotide Array Sequence Analysis , Parkinson Disease/genetics
8.
PLoS One ; 8(9): e73289, 2013.
Article in English | MEDLINE | ID: mdl-24058466

ABSTRACT

In contrast to most other sensory modalities, the basic perceptual dimensions of olfaction remain unclear. Here, we use non-negative matrix factorization (NMF)--a dimensionality reduction technique--to uncover structure in a panel of odor profiles, with each odor defined as a point in multi-dimensional descriptor space. The properties of NMF are favorable for the analysis of such lexical and perceptual data, and lead to a high-dimensional account of odor space. We further provide evidence that odor dimensions apply categorically. That is, odor space is not occupied homogenously, but rather in a discrete and intrinsically clustered manner. We discuss the potential implications of these results for the neural coding of odors, as well as for developing classifiers on larger datasets that may be useful for predicting perceptual qualities from chemical structures.


Subject(s)
Multifactor Dimensionality Reduction/statistics & numerical data , Odorants/analysis , Olfactory Perception/physiology , Smell/physiology , Algorithms , Cluster Analysis , Humans , Sensory Thresholds
9.
Asian Pac J Cancer Prev ; 13(5): 2031-7, 2012.
Article in English | MEDLINE | ID: mdl-22901167

ABSTRACT

BACKGROUND: Analysis of gene-gene and gene-environment interactions for complex multifactorial human disease faces challenges regarding statistical methodology. One major difficulty is partly due to the limitations of parametric-statistical methods for detection of gene effects that are dependent solely or partially on interactions with other genes or environmental exposures. Based on our previous case-control study in Chongqing of China, we have found increased risk of colorectal cancer exists in individuals carrying a novel homozygous TT at locus rs1329149 and known homozygous AA at locus rs671. METHODS: In this study, we proposed statistical method- crossover analysis in combination with logistic regression model, to further analyze our data and focus on assessing gene-environmental interactions for colorectal cancer. RESULTS: The results of the crossover analysis showed that there are possible multiplicative interactions between loci rs671 and rs1329149 with alcohol consumption. Multi- factorial logistic regression analysis also validated that loci rs671 and rs1329149 both exhibited a multiplicative interaction with alcohol consumption. Moreover, we also found additive interactions between any pair of two factors (among the four risk factors: gene loci rs671, rs1329149, age and alcohol consumption) through the crossover analysis, which was not evident on logistic regression. CONCLUSIONS: In conclusion, the method based on crossover analysis-logistic regression is successful in assessing additive and multiplicative gene-environment interactions, and in revealing synergistic effects of gene loci rs671 and rs1329149 with alcohol consumption in the pathogenesis and development of colorectal cancer.


Subject(s)
Colorectal Neoplasms/etiology , Gene-Environment Interaction , Genetic Predisposition to Disease , Multifactor Dimensionality Reduction/statistics & numerical data , Polymorphism, Single Nucleotide/genetics , Case-Control Studies , China , Cross-Over Studies , Female , Humans , Logistic Models , Male , Middle Aged , Prognosis , Risk Factors
SELECTION OF CITATIONS
SEARCH DETAIL
...