Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
1.
IEEE Trans Cybern ; 54(5): 3211-3224, 2024 May.
Article in English | MEDLINE | ID: mdl-37134031

ABSTRACT

Software-defined networking (SDN) allows flexible and centralized control in cloud data centers. An elastic set of distributed SDN controllers is often required to provide sufficient yet cost-effective processing capacity. However, this introduces a new challenge: Request Dispatching among the controllers by SDN switches. It is essential to design a dispatching policy for each switch to guide the request distribution. Existing policies are designed under certain assumptions, including a single centralized agent, global network knowledge, and a fixed number of controllers, which often cannot be satisfied in practice. This article proposes MADRina, Multiagent Deep Reinforcement Learning for request dispatching, to design policies with high dispatching adaptability and performance. First, we design a multiagent system to address the limitation of using a centralized agent with global network knowledge. Second, we propose a Deep Neural Network-based adaptive policy to enable request dispatching over an elastic set of controllers. Third, we develop a new algorithm to train the adaptive policies in a multiagent context. We prototype MADRina and build a simulation tool to evaluate its performance using real-world network data and topology. The results show that MADRina can significantly reduce response time by up to 30% compared to existing approaches.

2.
Article in English | MEDLINE | ID: mdl-37030879

ABSTRACT

The deep neural networks are envisaged for the early disease diagnosis from medical images. However, in the early stage of the disease, the medical images of patients and healthy people have only subtle visual differences. Distinguishing the medical images for early diagnosis belongs to the Fine-Grained Visual Classification (FGVC) task. Many recent works are based on a standard FGVC learning paradigm: locate the discriminative regions first and then classify by fusing the information of these regions. However, it is still not enough for medical images. Because the shape and size of the lesions are variable, and the relationship between lesions and the background is complex. In order to solve these problems, we propose a fine-grained lesion classification framework for early auxiliary diagnosis. We first locate and extract multiple lesions with different sizes and shapes from the original image and then fuse the feature of lesion and background based on attention mechanism. As shown by experiment results in two real-world clinical data sets, our model can locate accurately and perform better.

3.
Article in English | MEDLINE | ID: mdl-37028351

ABSTRACT

Multiview clustering algorithms have attracted intensive attention and achieved superior performance in various fields recently. Despite the great success of multiview clustering methods in realistic applications, we observe that most of them are difficult to apply to large-scale datasets due to their cubic complexity. Moreover, they usually use a two-stage scheme to obtain the discrete clustering labels, which inevitably causes a suboptimal solution. In light of this, an efficient and effective one-step multiview clustering (E 2 OMVC) method is proposed to directly obtain clustering indicators with a small-time burden. Specifically, according to the anchor graphs, the smaller similarity graph of each view is constructed, from which the low-dimensional latent features are generated to form the latent partition representation. By introducing a label discretization mechanism, the binary indicator matrix can be directly obtained from the unified partition representation which is formed by fusing all latent partition representations from different views. In addition, by coupling the fusion of all latent information and the clustering task into a joint framework, the two processes can help each other and obtain a better clustering result. Extensive experimental results demonstrate that the proposed method can achieve comparable or better performance than the state-of-the-art methods. The demo code of this work is publicly available at https://github.com/WangJun2023/EEOMVC.

4.
Sci Rep ; 13(1): 6569, 2023 Apr 21.
Article in English | MEDLINE | ID: mdl-37085586

ABSTRACT

Improving energy efficiency is a crucial aspect of building a sustainable smart city and, more broadly, relevant for improving environmental, economic, and social well-being. Non-intrusive load monitoring (NILM) is a computing technique that estimates energy consumption in real-time and helps raise energy awareness among users to facilitate energy management. Most NILM solutions are still a single machine approach and do not fit well in smart cities. This work proposes a model-agnostic hybrid federated learning framework to collaboratively train NILM models for city-wide energy-saving applications. The framework supports both centralised and decentralised training modes to provide a cluster-based, customisable and optimal learning solution for users. The proposed framework is evaluated on a real-world energy disaggregation dataset. The results show that all NILM models trained in our proposed framework outperform the locally trained ones in accuracy. The results also suggest that the NILM models trained in our framework are resistant to privacy leakage.

5.
IEEE Trans Cybern ; 53(8): 5250-5263, 2023 Aug.
Article in English | MEDLINE | ID: mdl-35994538

ABSTRACT

Hyperspectral band selection aims to identify an optimal subset of bands for hyperspectral images (HSIs). For most existing clustering-based band selection methods, they directly stretch each band into a single feature vector and employ the pixelwise features to address band redundancy. In this way, they do not take full consideration of the spatial information and deal with the importance of different regions in HSIs, which leads to a nonoptimal selection. To address these issues, a region-aware hierarchical latent feature representation learning-guided clustering (HLFC) method is proposed. Specifically, in order to fully preserve the spatial information of HSIs, the superpixel segmentation algorithm is adopted to segment HSIs into multiple regions first. For each segmented region, the similarity graph is constructed to reflect the bands-wise similarity, and its corresponding Laplacian matrix is generated for learning low-dimensional latent features in a hierarchical way. All latent features are then fused to form a unified feature representation of HSIs. Finally, k -means clustering is utilized on the unified feature representation matrix to generate multiple clusters from which the band with maximum information entropy is selected to form the final subset of bands. Extensive experimental results demonstrate that the proposed clustering method can achieve superior performance than the state-of-the-art representative methods on the band selection. The demo code of this work is publicly available at https://github.com/WangJun2023/HLFC.

6.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 955-968, 2022 Feb.
Article in English | MEDLINE | ID: mdl-32759080

ABSTRACT

Albeit great success has been achieved in image defocus blur detection, there are still several unsolved challenges, e.g., interference of background clutter, scale sensitivity and missing boundary details of blur regions. To deal with these issues, we propose a deep neural network which recurrently fuses and refines multi-scale deep features (DeFusionNet) for defocus blur detection. We first fuse the features from different layers of FCN as shallow features and semantic features, respectively. Then, the fused shallow features are propagated to deep layers for refining the details of detected defocus blur regions, and the fused semantic features are propagated to shallow layers to assist in better locating blur regions. The fusion and refinement are carried out recurrently. In order to narrow the gap between low-level and high-level features, we embed a feature adaptation module before feature propagating to exploit the complementary information as well as reduce the contradictory response of different feature layers. Since different feature channels are with different extents of discrimination for detecting blur regions, we design a channel attention module to select discriminative features for feature refinement. Finally, the output of each layer at last recurrent step are fused to obtain the final result. We collect a new dataset consists of various challenging images and their pixel-wise annotations for promoting further study. Extensive experiments on two commonly used datasets and our newly collected one are conducted to demonstrate both the efficacy and efficiency of DeFusionNet.

7.
IEEE Trans Cybern ; 51(9): 4515-4527, 2021 Sep.
Article in English | MEDLINE | ID: mdl-31880579

ABSTRACT

This article presents an efficient fingerprint identification system that implements an initial classification for search-space reduction followed by minutiae neighbor-based feature encoding and matching. The current state-of-the-art fingerprint classification methods use a deep convolutional neural network (DCNN) to assign confidence for the classification prediction, and based on this prediction, the input fingerprint is matched with only the subset of the database that belongs to the predicted class. It can be observed for the DCNNs that as the architectures deepen, the farthest layers of the network learn more abstract information from the input images that result in higher prediction accuracies. However, the downside is that the DCNNs are data hungry and require lots of annotated (labeled) data to learn generalized network parameters for deeper layers. In this article, a shallow multifeature view CNN (SMV-CNN) fingerprint classifier is proposed that extracts: 1) fine-grained features from the input image and 2) abstract features from explicitly derived representations obtained from the input image. The multifeature views are fed to a fully connected neural network (NN) to compute a global classification prediction. The classification results show that the SMV-CNN demonstrated an improvement of 2.8% when compared to baseline CNN consisting of a single grayscale view on an open-source database. Moreover, in comparison with the state-of-the-art residual network (ResNet-50) image classification model, the proposed method performs comparably while being less complex and more efficient during training. The result of classification-based fingerprint identification has shown that the search space is reduced by over 50% without degradation of identification accuracies.


Subject(s)
Neural Networks, Computer
8.
Sci Rep ; 10(1): 8055, 2020 05 15.
Article in English | MEDLINE | ID: mdl-32415130

ABSTRACT

El Niño-Southern Oscillation (ENSO), which is one of the main drivers of Earth's inter-annual climate variability, often causes a wide range of climate anomalies, and the advance prediction of ENSO is always an important and challenging scientific issue. Since a unified and complete ENSO theory has yet to be established, people often use related indicators, such as the Niño 3.4 index and southern oscillation index (SOI), to predict the development trends of ENSO through appropriate numerical simulation models. However, because the ENSO phenomenon is a highly complex and dynamic model and the Niño 3.4 index and SOI mix many low- and high-frequency components, the prediction accuracy of current popular numerical prediction methods is not high. Therefore, this paper proposed the ensemble empirical mode decomposition-temporal convolutional network (EEMD-TCN) hybrid approach, which decomposes the highly variable Niño 3.4 index and SOI into relatively flat subcomponents and then uses the TCN model to predict each subcomponent in advance, finally combining the sub-prediction results to obtain the final ENSO prediction results. Niño 3.4 index and SOI reanalysis data from 1871 to 1973 were used for model training, and the data for 1984-2019 were predicted 1 month, 3 months, 6 months, and 12 months in advance. The results show that the accuracy of the 1-month-lead Niño 3.4 index prediction was the highest, the 12-month-lead SOI prediction was the slowest, and the correlation coefficient between the worst SOI prediction result and the actual value reached 0.6406. Furthermore, the overall prediction accuracy on the Niño 3.4 index was better than that on the SOI, which may have occurred because the SOI contains too many high-frequency components, making prediction difficult. The results of comparative experiments with the TCN, LSTM, and EEMD-LSTM methods showed that the EEMD-TCN provides the best overall prediction of both the Niño 3.4 index and SOI in the 1-, 3-, 6-, and 12-month-lead predictions among all the methods considered. This result means that the TCN approach performs well in the advance prediction of ENSO and will be of great guiding significance in studying it.

9.
Front Cell Infect Microbiol ; 10: 586054, 2020.
Article in English | MEDLINE | ID: mdl-33747973

ABSTRACT

Background: The outbreak of coronavirus disease 2019 (COVID-19) has become a global public health concern. Many inpatients with COVID-19 have shown clinical symptoms related to sepsis, which will aggravate the deterioration of patients' condition. We aim to diagnose Viral Sepsis Caused by SARS-CoV-2 by analyzing laboratory test data of patients with COVID-19 and establish an early predictive model for sepsis risk among patients with COVID-19. Methods: This study retrospectively investigated laboratory test data of 2,453 patients with COVID-19 from electronic health records. Extreme gradient boosting (XGBoost) was employed to build four models with different feature subsets of a total of 69 collected indicators. Meanwhile, the explainable Shapley Additive ePlanation (SHAP) method was adopted to interpret predictive results and to analyze the feature importance of risk factors. Findings: The model for classifying COVID-19 viral sepsis with seven coagulation function indicators achieved the area under the receiver operating characteristic curve (AUC) 0.9213 (95% CI, 89.94-94.31%), sensitivity 97.17% (95% CI, 94.97-98.46%), and specificity 82.05% (95% CI, 77.24-86.06%). The model for identifying COVID-19 coagulation disorders with eight features provided an average of 3.68 (±) 4.60 days in advance for early warning prediction with 0.9298 AUC (95% CI, 86.91-99.04%), 82.22% sensitivity (95% CI, 67.41-91.49%), and 84.00% specificity (95% CI, 63.08-94.75%). Interpretation: We found that an abnormality of the coagulation function was related to the occurrence of sepsis and the other routine laboratory test represented by inflammatory factors had a moderate predictive value on coagulopathy, which indicated that early warning of sepsis in COVID-19 patients could be achieved by our established model to improve the patient's prognosis and to reduce mortality.


Subject(s)
COVID-19/blood , Sepsis/virology , Adult , Aged , COVID-19/diagnosis , COVID-19/epidemiology , COVID-19 Testing , China/epidemiology , Female , Humans , Logistic Models , Machine Learning , Male , Middle Aged , Prognosis , ROC Curve , Retrospective Studies , Risk Factors , SARS-CoV-2/isolation & purification , Sepsis/blood , Sepsis/diagnosis
10.
Sensors (Basel) ; 19(5)2019 Mar 09.
Article in English | MEDLINE | ID: mdl-30857334

ABSTRACT

The JPEG-XR encoding process utilizes two types of transform operations: Photo Overlap Transform (POT) and Photo Core Transform (PCT). Using the Device Porting Kit (DPK) provided by Microsoft, we performed encoding and decoding processes on JPEG XR images. It was discovered that when the quantization parameter is >1-lossy compression conditions, the resulting image displays chequerboard block artefacts, border artefacts and corner artefacts. These artefacts are due to the nonlinearity of transforms used by JPEG-XR. Typically, it is not so visible; however, it can cause problems while copying and scanning applications, as it shows nonlinear transforms when the source and the target of the image have different configurations. Hence, it is important for document image processing pipelines to take such artefacts into account. Additionally, these artefacts are most problematic for high-quality settings and appear more visible at high compression ratios. In this paper, we analyse the cause of the above artefacts. It was found that the main problem lies in the step of POT and quantization. To solve this problem, the use of a "uniform matrix" is proposed. After POT (encoding) and before inverse POT (decoding), an extra step is added to multiply this uniform matrix. Results suggest that it is an easy and effective way to decrease chequerboard, border and corner artefacts, thereby improving the image quality of lossy encoding JPEG XR than the original DPK program with no increased calculation complexity or file size.

11.
IEEE Trans Cybern ; 49(5): 1932-1943, 2019 May.
Article in English | MEDLINE | ID: mdl-29993676

ABSTRACT

Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.


Subject(s)
Computational Biology/methods , Machine Learning , Proteins , Algorithms , Models, Statistical , Phosphoproteins/chemistry , Phosphoproteins/genetics , Phosphoproteins/metabolism , Phosphotransferases/chemistry , Phosphotransferases/genetics , Phosphotransferases/metabolism , Proteins/chemistry , Proteins/genetics , Proteins/metabolism , Transcription Factors
12.
Int J Bioinform Res Appl ; 11(1): 10-29, 2015.
Article in English | MEDLINE | ID: mdl-25667383

ABSTRACT

Single nucleotide polymorphism studies have recently received significant amount of attention from researchers in many life science disciplines. Previous researches indicated that a series of SNPs from the same chromosome, called haplotype, contains more information than individual SNPs. Hence, discovering ways to reconstruct reliable Single Individual Haplotypes becomes one of the core issues in the whole-genome research nowadays. However, obtaining sequence from current high-throughput sequencing technologies always contain inevitable sequencing errors and/or missing information. The SIH reconstruction problem can be formulated as bi-partitioning the input SNP fragment matrix into paternal and maternal sections to achieve minimum error correction; a problem that is proved to be NP-hard. In this study, we introduce a greedy approach, named RadixHap, to handle data sets with high error rates. The experimental results show that RadixHap can generate highly reliable results in most cases. Furthermore, the algorithm structure of RadixHap is particularly suitable for whole-genome scale data sets.


Subject(s)
Algorithms , Chromosome Mapping/methods , DNA Mutational Analysis/methods , DNA/genetics , Haplotypes/genetics , Polymorphism, Single Nucleotide/genetics , Base Sequence , Humans , Molecular Sequence Data , Reproducibility of Results , Sensitivity and Specificity , Software
13.
IEEE Trans Cybern ; 44(3): 445-55, 2014 Mar.
Article in English | MEDLINE | ID: mdl-24108722

ABSTRACT

Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.


Subject(s)
Algorithms , Computational Biology/methods , Data Interpretation, Statistical , Machine Learning , Models, Statistical , Pattern Recognition, Automated/methods , Sample Size , Computer Simulation
14.
J Biomed Inform ; 45(5): 922-30, 2012 Oct.
Article in English | MEDLINE | ID: mdl-22465411

ABSTRACT

Discovering ways to reconstruct reliable Single Individual Haplotypes (SIHs) becomes one of the core issues in the whole-genome research nowadays as previous research showed that haplotypes contain more information than individual Singular Nucleotide Polymorphisms (SNPs). Although with advances in high-throughput sequencing technologies obtaining sequence information is becoming easier in today's laboratories, obtained sequences from current technologies always contain inevitable sequence errors and missing information. The SIH reconstruction problem can be formulated as bi-partitioning the input SNP fragment matrix into paternal and maternal sections to achieve minimum error correction (MEC) time; the problem that is proved to be NP-hard. Several heuristics or greedy algorithms have already been designed and implemented to solve this problem, most of them however (1) do not have the ability to handle data sets with high error rates and/or (2) can only handle binary input matrices. In this study, we introduce a Genetic Algorithm (GA) based method, named GAHap, to reconstruct SIHs with lowest MEC times. GAHap is equipped with a well-designed fitness function to obtain better reconstruction rates. GAHap is also compared with existing methods to show its ability in generating highly reliable solutions.


Subject(s)
Algorithms , Genomics/methods , Haplotypes , Models, Genetic , Genetic Testing , Humans , Polymorphism, Single Nucleotide , Sequence Analysis, DNA
15.
Theory Biosci ; 131(3): 193-203, 2012 Sep.
Article in English | MEDLINE | ID: mdl-22127956

ABSTRACT

We have recently presented a framework for the information dynamics of distributed computation that locally identifies the component operations of information storage, transfer, and modification. We have observed that while these component operations exist to some extent in all types of computation, complex computation is distinguished in having coherent structure in its local information dynamics profiles. In this article, we conjecture that coherent information structure is a defining feature of complex computation, particularly in biological systems or artificially evolved computation that solves human-understandable tasks. We present a methodology for studying coherent information structure, consisting of state-space diagrams of the local information dynamics and a measure of structure in these diagrams. The methodology identifies both clear and "hidden" coherent structure in complex computation, most notably reconciling conflicting interpretations of the complexity of the Elementary Cellular Automata rule 22.


Subject(s)
Computational Biology/methods , Information Storage and Retrieval/methods , Models, Biological
16.
BMC Genomics ; 11 Suppl 4: S22, 2010 Dec 02.
Article in English | MEDLINE | ID: mdl-21143806

ABSTRACT

Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS disease progression. The characterization of glycosylation differences at the sequence level is inadequate as the placement of carbohydrates is structurally complex. However, no structural framework is available to date for the study of HIV disease progression. In this study, we propose a novel machine-learning based framework for the prediction of AIDS disease progression in three stages (RP, SP, and LTNP) using the HIV structural gp120 profile. This new intelligent framework proves to be accurate and provides an important benchmark for predicting AIDS disease progression computationally. The model is trained using a novel HIV gp120 glycosylation structural profile to detect possible stages of AIDS disease progression for the target sequences of HIV+ individuals. The performance of the proposed model was compared to seven existing different machine-learning models on newly proposed gp120-Benchmark_1 dataset in terms of error-rate (MSE), accuracy (CCI), stability (STD), and complexity (TBM). The novel framework showed better predictive performance with 67.82% CCI, 30.21 MSE, 0.8 STD, and 2.62 TBM on the three stages of AIDS disease progression of 50 HIV+ individuals. This framework is an invaluable bioinformatics tool that will be useful to the clinical assessment of viral pathogenesis.


Subject(s)
Acquired Immunodeficiency Syndrome/diagnosis , Computational Biology , HIV Envelope Protein gp120 , Software , Artificial Intelligence , Benchmarking , Cohort Studies , Disease Progression , Glycosylation , HIV Envelope Protein gp120/genetics , HIV Envelope Protein gp120/metabolism , HIV Seropositivity , Humans , Models, Biological , Predictive Value of Tests
17.
BMC Bioinformatics ; 11: 524, 2010 Oct 21.
Article in English | MEDLINE | ID: mdl-20961462

ABSTRACT

BACKGROUND: It has now become clear that gene-gene interactions and gene-environment interactions are ubiquitous and fundamental mechanisms for the development of complex diseases. Though a considerable effort has been put into developing statistical models and algorithmic strategies for identifying such interactions, the accurate identification of those genetic interactions has been proven to be very challenging. METHODS: In this paper, we propose a new approach for identifying such gene-gene and gene-environment interactions underlying complex diseases. This is a hybrid algorithm and it combines genetic algorithm (GA) and an ensemble of classifiers (called genetic ensemble). Using this approach, the original problem of SNP interaction identification is converted into a data mining problem of combinatorial feature selection. By collecting various single nucleotide polymorphisms (SNP) subsets as well as environmental factors generated in multiple GA runs, patterns of gene-gene and gene-environment interactions can be extracted using a simple combinatorial ranking method. Also considered in this study is the idea of combining identification results obtained from multiple algorithms. A novel formula based on pairwise double fault is designed to quantify the degree of complementarity. CONCLUSIONS: Our simulation study demonstrates that the proposed genetic ensemble algorithm has comparable identification power to Multifactor Dimensionality Reduction (MDR) and is slightly better than Polymorphism Interaction Analysis (PIA), which are the two most popular methods for gene-gene interaction identification. More importantly, the identification results generated by using our genetic ensemble algorithm are highly complementary to those obtained by PIA and MDR. Experimental results from our simulation studies and real world data application also confirm the effectiveness of the proposed genetic ensemble algorithm, as well as the potential benefits of combining identification results from different algorithms.


Subject(s)
Algorithms , Polymorphism, Single Nucleotide , Computer Simulation , Genes , Models, Genetic , Multifactor Dimensionality Reduction
18.
Chaos ; 20(3): 037109, 2010 Sep.
Article in English | MEDLINE | ID: mdl-20887075

ABSTRACT

Distributed computation can be described in terms of the fundamental operations of information storage, transfer, and modification. To describe the dynamics of information in computation, we need to quantify these operations on a local scale in space and time. In this paper we extend previous work regarding the local quantification of information storage and transfer, to explore how information modification can be quantified at each spatiotemporal point in a system. We introduce the separable information, a measure which locally identifies information modification events where separate inspection of the sources to a computation is misleading about its outcome. We apply this measure to cellular automata, where it is shown to be the first direct quantitative measure to provide evidence for the long-held conjecture that collisions between emergent particles therein are the dominant information modification events.

19.
Int J Comput Biol Drug Des ; 3(1): 1-14, 2010.
Article in English | MEDLINE | ID: mdl-20693606

ABSTRACT

Existing phylogenetic methods cannot realistically model the evolutionary process. It has become a serious issue for real-life applications which demand accurate phylogenetic results. It is desirable to have an integrative approach which can effectively incorporate multi-disciplinary analyses and synthesise results from various sources. A novel integrative and interactive computing system has been developed to address such an issue. We introduce the concept of super-quartet and explain how it is adopted for effective integration of other computational methods to the algorithm during computation – a key issue of the development of integrative and interactive computing system.


Subject(s)
Computational Biology/methods , Models, Genetic , Phylogeny , Algorithms , Evolution, Molecular
20.
BMC Bioinformatics ; 11 Suppl 1: S5, 2010 Jan 18.
Article in English | MEDLINE | ID: mdl-20122224

ABSTRACT

BACKGROUND: Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses. RESULTS: In this paper we describe an improved hybrid system for gene selection. It is based on a recently proposed genetic ensemble (GE) system. To enhance the generalization property of the selected genes or gene subsets and to overcome the overfitting problem of the GE system, we devised a mapping strategy to fuse the goodness information of each gene provided by multiple filtering algorithms. This information is then used for initialization and mutation operation of the genetic ensemble system. CONCLUSION: We used four benchmark microarray datasets (including both binary-class and multi-class classification problems) for concept proving and model evaluation. The experimental results indicate that the proposed multi-filter enhanced genetic ensemble (MF-GE) system is able to improve sample classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. The MF-GE system is very flexible as various combinations of multiple filters and classifiers can be incorporated based on the data characteristics and the user preferences.


Subject(s)
Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Genes
SELECTION OF CITATIONS
SEARCH DETAIL
...