Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
Mais filtros










Intervalo de ano de publicação
1.
Front Biosci (Landmark Ed) ; 29(5): 197, 2024 May 21.
Artigo em Inglês | MEDLINE | ID: mdl-38812315

RESUMO

BACKGROUND: Ubiquitination is a crucial post-translational modification of proteins that regulates diverse cellular functions. Accurate identification of ubiquitination sites in proteins is vital for understanding fundamental biological mechanisms, such as cell cycle and DNA repair. Conventional experimental approaches are resource-intensive, whereas machine learning offers a cost-effective means of accurately identifying ubiquitination sites. The prediction of ubiquitination sites is species-specific, with many existing models being tailored for Arabidopsis thaliana (A. thaliana) and Homo sapiens (H. sapiens). However, these models have shortcomings in sequence window selection and feature extraction, leading to suboptimal performance. METHODS: This study initially employed the chi-square test to determine the optimal sequence window. Subsequently, a combination of six features was assessed: Binary Encoding (BE), Composition of K-Spaced Amino Acid Pair (CKSAAP), Enhanced Amino Acid Composition (EAAC), Position Weight Matrix (PWM), 531 Properties of Amino Acids (AA531), and Position-Specific Scoring Matrix (PSSM). Comparative evaluation involved three feature selection methods: Minimum Redundancy-Maximum Relevance (mRMR), Elastic net, and Null importances. Alongside these were four classifiers: Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The Null importances combined with the RF model exhibited superior predictive performance, and was denoted as UbNiRF (A. thaliana: ArUbNiRF; H. sapiens: HoUbNiRF). RESULTS: A comprehensive assessment indicated that UbNiRF is superior to existing prediction tools across five performance metrics. It notably excelled in the Matthews Correlation Coefficient (MCC), with values of 0.827 for the A. thaliana dataset and 0.781 for the H. sapiens dataset. Feature analysis underscores the significance of integrating six features and demonstrates their critical role in enhancing model performance. CONCLUSIONS: UbNiRF is a valuable predictive tool for identifying ubiquitination sites in both A. thaliana and H. sapiens. Its robust performance and species-specific discovery capabilities make it extremely useful for elucidating biological processes and disease mechanisms associated with ubiquitination.


Assuntos
Arabidopsis , Ubiquitinação , Arabidopsis/metabolismo , Arabidopsis/genética , Humanos , Biologia Computacional/métodos , Aprendizado de Máquina , Proteínas de Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Algoritmos , Máquina de Vetores de Suporte , Algoritmo Florestas Aleatórias
2.
J Bioinform Comput Biol ; 21(5): 2350024, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37899352

RESUMO

O-glycosylation (Oglyc) plays an important role in various biological processes. The key to understanding the mechanisms of Oglyc is identifying the corresponding glycosylation sites. Two critical steps, feature selection and classifier design, greatly affect the accuracy of computational methods for predicting Oglyc sites. Based on an efficient feature selection algorithm and a classifier capable of handling imbalanced datasets, a new computational method, ChiMIC-based balanced decision table O-glycosylation (CBDT-Oglyc), is proposed. ChiMIC-based balanced decision table for O-glycosylation (CBDT-Oglyc), is proposed to predict Oglyc sites in proteins. Sequence characterization is performed by combining amino acid composition (AAC), undirected composition of [Formula: see text]-spaced amino acid pairs (undirected-CKSAAP) and pseudo-position-specific scoring matrix (PsePSSM). Chi-MIC-share algorithm is used for feature selection, which simplifies the model and improves predictive accuracy. For imbalanced classification, a backtracking method based on local chi-square test is designed, and then cost-sensitive learning is incorporated to construct a novel classifier named ChiMIC-based balanced decision table (CBDT). Based on a 1:49 (positives:negatives) training set, the CBDT classifier achieves significantly better prediction performance than traditional classifiers. Moreover, the independent test results on separate human and mouse glycoproteins show that CBDT-Oglyc outperforms previous methods in global accuracy. CBDT-Oglyc shows great promise in predicting Oglyc sites and is expected to facilitate further experimental studies on protein glycosylation.


Assuntos
Algoritmos , Proteínas , Humanos , Animais , Camundongos , Glicosilação , Proteínas/química , Aminoácidos , Biologia Computacional/métodos
3.
Front Genet ; 14: 1165648, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37576555

RESUMO

Background: The tumor microenvironment (TME) of breast cancer (BRCA) is a complex and dynamic micro-ecosystem that influences BRCA occurrence, progression, and prognosis through its cellular and molecular components. However, as the tumor progresses, the dynamic changes of stromal and immune cells in TME become unclear. Objective: The aim of this study was to identify differentially co-expressed genes (DCGs) associated with the proportion of stromal cells in TME of BRCA, to explore the patterns of cell proportion changes, and ultimately, their impact on prognosis. Methods: A new heuristic feature selection strategy (CorDelSFS) was combined with differential co-expression analysis to identify TME-key DCGs. The expression pattern and co-expression network of TME-key DCGs were analyzed across different TMEs. A prognostic model was constructed using six TME-key DCGs, and the correlation between the risk score and the proportion of stromal cells and immune cells in TME was evaluated. Results: TME-key DCGs mimicked the dynamic trend of BRCA TME and formed cell type-specific subnetworks. The IG gene-related subnetwork, plasmablast-specific expression, played a vital role in the BRCA TME through its adaptive immune function and tumor progression inhibition. The prognostic model showed that the risk score was significantly correlated with the proportion of stromal cells and immune cells in TME, and low-risk patients had stronger adaptive immune function. IGKV1D-39 was identified as a novel BRCA prognostic marker specifically expressed in plasmablasts and involved in adaptive immune responses. Conclusions: This study explores the role of proportionate-related genes in the tumor microenvironment using a machine learning approach and provides new insights for discovering the key biological processes in tumor progression and clinical prognosis.

4.
Nat Commun ; 14(1): 3930, 2023 07 04.
Artigo em Inglês | MEDLINE | ID: mdl-37402793

RESUMO

Genetic improvement of grain quality is more challenging in hybrid rice than in inbred rice due to additional nonadditive effects such as dominance. Here, we describe a pipeline developed for joint analysis of phenotypes, effects, and generations (JPEG). As a demonstration, we analyze 12 grain quality traits of 113 inbred lines (male parents), five tester lines (female parents), and 565 (113×5) of their hybrids. We sequence the parents for single nucleotide polymorphisms calling and infer the genotypes of the hybrids. Genome-wide association studies with JPEG identify 128 loci associated with at least one of the 12 traits, including 44, 97, and 13 loci with additive effects, dominant effects, and both additive and dominant effects, respectively. These loci together explain more than 30% of the genetic variation in hybrid performance for each of the traits. The JEPG statistical pipeline can help to identify superior crosses for breeding rice hybrids with improved grain quality.


Assuntos
Oryza , Oryza/genética , Estudo de Associação Genômica Ampla , Melhoramento Vegetal , Fenótipo , Genótipo , Grão Comestível/genética
5.
PeerJ ; 11: e14706, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36710872

RESUMO

Background: Identifying the cell types using unsupervised methods is essential for scRNA-seq research. However, conventional similarity measures introduce challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout. Methods: We proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. To accurately capture the intrinsic relationship among cells or genes, a new distance metric that combines Euclidean and Pearson distance through a weighting strategy was proposed. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom. Results: We compared SSWD with seven prevailing methods on eight publicly scRNA-seq datasets. The experimental results show that the SSWD has better clustering accuracy and the partitioning ability of cell groups. SSWD can be downloaded at https://github.com/ningzilan/SSWD.


Assuntos
Perfilação da Expressão Gênica , Análise da Expressão Gênica de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Análise por Conglomerados
6.
BioData Min ; 15(1): 3, 2022 Feb 10.
Artigo em Inglês | MEDLINE | ID: mdl-35144656

RESUMO

BACKGROUND: Lysine succinylation is a type of protein post-translational modification which is widely involved in cell differentiation, cell metabolism and other important physiological activities. To study the molecular mechanism of succinylation in depth, succinylation sites need to be accurately identified, and because experimental approaches are costly and time-consuming, there is a great demand for reliable computational methods. Feature extraction is a key step in building succinylation site prediction models, and the development of effective new features improves predictive accuracy. Because the number of false succinylation sites far exceeds that of true sites, traditional classifiers perform poorly, and designing a classifier to effectively handle highly imbalanced datasets has always been a challenge. RESULTS: A new computational method, iSuc-ChiDT, is proposed to identify succinylation sites in proteins. In iSuc-ChiDT, chi-square statistical difference table encoding is developed to extract positional features, and has a higher predictive accuracy and fewer features compared to common position-based encoding schemes such as binary encoding and physicochemical property encoding. Single amino acid and undirected pair-coupled amino acid composition features are supplemented to improve the fault tolerance for residue insertions and deletions. After feature selection by Chi-MIC-share algorithm, the chi-square decision table (ChiDT) classifier is constructed for imbalanced classification. With a training set of 4748:50,551(true: false sites), ChiDT clearly outperforms traditional classifiers in predictive accuracy, and runs fast. Using an independent testing set of experimentally identified succinylation sites, iSuc-ChiDT achieves a sensitivity of 70.47%, a specificity of 66.27%, a Matthews correlation coefficient of 0.205, and a global accuracy index Q9 of 0.683, showing a significant improvement in sensitivity and overall accuracy compared to PSuccE, Success, SuccinSite, and other existing succinylation site predictors. CONCLUSIONS: iSuc-ChiDT shows great promise in predicting succinylation sites and is expected to facilitate further experimental investigation of protein succinylation.

7.
BMC Bioinformatics ; 23(1): 30, 2022 Jan 10.
Artigo em Inglês | MEDLINE | ID: mdl-35012448

RESUMO

BACKGROUND: Plant variety identification is the one most important of agricultural systems. Development of DNA marker profiles of released varieties to compare with candidate variety or future variety is required. However, strictly speaking, scientists did not use most existing variety identification techniques for "identification" but for "distinction of a limited number of cultivars," of which generalization ability always not be well estimated. Because many varieties have similar genetic backgrounds, even some essentially derived varieties (EDVs) are involved, which brings difficulties for identification and breeding progress. A fast, accurate variety identification method, which also has good performance on EDV determination, needs to be developed. RESULTS: In this study, with the strategy of "Divide and Conquer," a variety identification method Conditional Random Selection (CRS) method based on SNP of the whole genome of 3024 rice varieties was developed and be applied in essentially derived variety (EDV) identification of rice. CRS is a fast, efficient, and automated variety identification method. Meanwhile, in practical, with the optimal threshold of identity score searched in this study, the set of SNP (including 390 SNPs) showed optimal performance on EDV and non-EDV identification in two independent testing datasets. CONCLUSION: This approach first selected a minimal set of SNPs to discriminate non-EDVs in the 3000 Rice Genome Project, then united several simplified SNP sets to improve its generalization ability for EDV and non-EDV identification in testing datasets. The results suggested that the CRS method outperformed traditional feature selection methods. Furthermore, it provides a new way to screen out core SNP loci from the whole genome for DNA fingerprinting of crop varieties and be useful for crop breeding.


Assuntos
Oryza , Impressões Digitais de DNA , Marcadores Genéticos , Genoma de Planta , Nucleotídeos , Oryza/genética , Polimorfismo de Nucleotídeo Único
8.
Interdiscip Sci ; 14(1): 245-257, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-34694561

RESUMO

The weighted gene co-expression network analysis (WGCNA) method constructs co-expressed gene modules based on the linear similarity between paired gene expressions. Linear correlations are the main form of similarity between genes, however, nonlinear correlations still existed and had always been ignored. We proposed a modified network analysis method, WGCNA-P + M, which combines Pearson's correlation coefficient and the maximum information coefficient (MIC) as the similarity measures to assess the linear and nonlinear correlations between genes, respectively. Taking two real datasets, GSE44861 and liver hepatocellular carcinoma (TCGA-LIHC), as examples, we compared the gene modules constructed by WGCNA-P + M and WGCNA from four perspectives: the "Usefulness" score, GO enrichment analysis on genes in the gray module, prediction performance of the top hub gene, survival analysis and literature reports on different hub genes. The results showed that the modules obtained by WGCNA-P + M are more biological meaningful, the hub genes obtained from WGCNA-P + M have more potential cancer genes.


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Carcinoma Hepatocelular/genética , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Genes Neoplásicos/genética , Humanos , Neoplasias Hepáticas/genética
9.
R Soc Open Sci ; 8(2): 201424, 2021 Feb 10.
Artigo em Inglês | MEDLINE | ID: mdl-33972855

RESUMO

The maximal information coefficient (MIC) captures both linear and nonlinear correlations between variable pairs. In this paper, we proposed the BackMIC algorithm for MIC estimation. The BackMIC algorithm adds a searching back process on the equipartitioned axis to obtain a better grid partition than the original implementation algorithm ApproxMaxMI. And similar to the ChiMIC algorithm, it terminates the grid search process by the χ 2-test instead of the maximum number of bins B(n, α). Results on simulated data show that the BackMIC algorithm maintains the generality of MIC, and gives more reasonable grid partition and MIC values for independent and dependent variable pairs under comparable running times. Moreover, it is robust under different α in B(n, α). MIC calculated by the BackMIC algorithm reveals an improvement in statistical power and equitability. We applied (1-MIC) as the distance measurement in the K-means algorithm to perform a clustering of the cancer/normal samples. The results on four cancer datasets demonstrated that the MIC values calculated by the BackMIC algorithm can obtain better clustering results, indicating the correlations between samples measured by the BackMIC algorithm were more credible than those measured by other algorithms.

10.
Bull World Health Organ ; 98(7): 484-494, 2020 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-32742034

RESUMO

OBJECTIVE: To design a simple model to assess the effectiveness of measures to prevent the spread of coronavirus disease 2019 (COVID-19) to different regions of mainland China. METHODS: We extracted data on population movements from an internet company data set and the numbers of confirmed cases of COVID-19 from government sources. On 23 January 2020 all travel in and out of the city of Wuhan was prohibited to control the spread of the disease. We modelled two key factors affecting the cumulative number of COVID-19 cases in regions outside Wuhan by 1 March 2020: (i) the total the number of people leaving Wuhan during 20-26 January 2020; and (ii) the number of seed cases from Wuhan before 19 January 2020, represented by the cumulative number of confirmed cases on 29 January 2020. We constructed a regression model to predict the cumulative number of cases in non-Wuhan regions in three assumed epidemic control scenarios. FINDINGS: Delaying the start date of control measures by only 3 days would have increased the estimated 30 699 confirmed cases of COVID-19 by 1 March 2020 in regions outside Wuhan by 34.6% (to 41 330 people). Advancing controls by 3 days would reduce infections by 30.8% (to 21 235 people) with basic control measures or 48.6% (to 15 796 people) with strict control measures. Based on standard residual values from the model, we were able to rank regions which were most effective in controlling the epidemic. CONCLUSION: The control measures in Wuhan combined with nationwide traffic restrictions and self-isolation reduced the ongoing spread of COVID-19 across China.


Assuntos
Controle de Doenças Transmissíveis/organização & administração , Infecções por Coronavirus/epidemiologia , Pneumonia Viral/epidemiologia , Viagem , Betacoronavirus , COVID-19 , China/epidemiologia , Cidades , Controle de Doenças Transmissíveis/normas , Férias e Feriados , Humanos , Pandemias , SARS-CoV-2
11.
Preprint em Inglês | medRxiv | ID: ppmedrxiv-20029561

RESUMO

Since COVID-19 emerged in early December, 2019 in Wuhan and swept across China Mainland, a series of large-scale public health interventions, especially Wuhan lock-down combined with nationwide traffic restrictions and Stay At Home Movement, have been taken by the government to control the epidemic. Based on Baidu Migration data and the confirmed cases data, we identified two key factors affecting the later (e.g February 27, 2020) cumulative confirmed cases in non-Wuhan region (y). One is the sum travelers from Wuhan during January 20 to January 26 (x1), which had higher infected probability but lower transmission ability because the human-to-human transmission risk of COVID-19 was confirmed and announced on January 20. The other is the "seed cases" from Wuhan before January 19, which had higher transmission ability and could be represented with the confirmed cases before January 29 (x2) due to a mean 10-day delay between infection and detection. A simple yet effective regression model then was established as follow: y= 70.0916+0.0054xx1+2.3455xx2 (n = 44, R2 = 0.9330, P<10-7). Even the lock-down date only delay or in advance 3 days, the estimated confirmed cases by February 27 in non-Wuhan region will increase 35.21% or reduce 30.74% - 48.59%. Although the above interventions greatly reduced the human mobility, Wuhan lock-down combined with nationwide traffic restrictions and Stay At Home Movement do have a determining effect on the ongoing spread of COVID-19 across China Mainland. The strategy adopted by China has changed the fast-rising curve of newly diagnosed cases, the international community should learn from lessons of Wuhan and experience from China. Efforts of 29 Provinces and 44 prefecture-level cities against COVID-19 were also assessed preliminarily according to the interpretive model. Big data has played and will continue playing an important role in public health.

12.
RSC Adv ; 10(34): 19852-19860, 2020 May 26.
Artigo em Inglês | MEDLINE | ID: mdl-35520405

RESUMO

Quantitative structure-activity relationship models are used in toxicology to predict the effects of organic compounds on aquatic organisms. Common filter feature selection methods use correlation statistics to rank features, but this approach considers only the correlation between a single feature and the response variable and does not take into account feature redundancy. Although the minimal redundancy maximal relevance approach considers the redundancy among features, direct removal of the redundant features may result in loss of prediction accuracy, and cross-validation of training sets to select an optimal subset of features is time-consuming. In this paper, we describe the development of a feature selection method, Chi-MIC-share, which can terminate feature selection automatically and is based on an improved maximal information coefficient and a redundant allocation strategy. We validated Chi-MIC-share using three environmental toxicology datasets and a support vector regression model. The results show that Chi-MIC-share is more accurate than other feature selection methods. We also performed a significance test on the model and analyzed the single-factor effects of the reserved descriptors.

13.
Biol Direct ; 14(1): 6, 2019 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-30975175

RESUMO

BACKGROUND: Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT-AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. RESULTS: Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. CONCLUSIONS: Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. REVIEWERS: This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.


Assuntos
Biologia Computacional/métodos , Sítios de Splice de RNA , Distribuição de Qui-Quadrado
14.
Front Genet ; 10: 1410, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-32082366

RESUMO

For precision medicine, there is a need to identify genes that accurately distinguish the physiological state or response to a particular therapy, but this can be challenging. Many methods of analyzing differential expression have been established and applied to this problem, such as t-test, edgeR, and DEseq2. A common feature of these methods is their focus on a linear relationship (differential expression) between gene expression and phenotype. However, they may overlook nonlinear relationships due to various factors, such as the degree of disease progression, sex, age, ethnicity, and environmental factors. Maximal information coefficient (MIC) was proposed to capture a wide range of associations of two variables in both linear and nonlinear relationships. However, with MIC it is difficult to highlight genes with nonlinear expression patterns as the genes giving the most strongly supported hits are linearly expressed, especially for noisy data. It is thus important to also efficiently identify nonlinearly expressed genes in order to unravel the molecular basis of disease and to reveal new therapeutic targets. We propose a novel nonlinearity measure called normalized differential correlation (NDC) to efficiently highlight nonlinearly expressed genes in transcriptome datasets. Validation using six real-world cancer datasets revealed that the NDC method could highlight nonlinearly expressed genes that could not be highlighted by t-test, MIC, edgeR, and DEseq2, although MIC could capture nonlinear correlations. The classification accuracy indicated that analysis of these genes could adequately distinguish cancer and paracarcinoma tissue samples. Furthermore, the results of biological interpretation of the identified genes suggested that some of them were involved in key functional pathways associated with cancer progression and metastasis. All of this evidence suggests that these nonlinearly expressed genes may play a central role in regulating cancer progression.

15.
PLoS One ; 13(5): e0198562, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29847579

RESUMO

[This corrects the article DOI: 10.1371/journal.pone.0189054.].

16.
PLoS One ; 12(12): e0189054, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29240818

RESUMO

The use of heterosis has considerably increased the productivity of many crops; however, the biological mechanism underpinning the technique remains elusive. The North Carolina design III (NCIII) and the triple test cross (TTC) are powerful and popular genetic mating design that can be used to decipher the genetic basis of heterosis. However, when using the NCIII design with the present quantitative trait locus (QTL) mapping method, if epistasis exists, the estimated additive or dominant effects are confounded with epistatic effects. Here, we propose a two-step approach to dissect all genetic effects of QTL and digenic interactions on a whole genome without sacrificing statistical power based on an augmented TTC (aTTC) design. Because the aTTC design has more transformation combinations than do the NCIII and TTC designs, it greatly enriches the QTL mapping for studying heterosis. When the basic population comprises recombinant inbred lines (RIL), we can use the same materials in the NCIII design for aTTC-design QTL mapping with transformation combination Z1, Z2, and Z4 to obtain genetic effect of QTL and digenic interactions. Compared with RIL-based TTC design, RIL-based aTTC design saves time, money, and labor for basic population crossed with F1. Several Monte Carlo simulation studies were carried out to confirm the proposed approach; the present genetic parameters could be identified with high statistical power, precision, and calculation speed, even at small sample size or low heritability. Additionally, two elite rice hybrid datasets for nine agronomic traits were estimated for real data analysis. We dissected the genetic effects and calculated the dominance degree of each QTL and digenic interaction. Real mapping results suggested that the dominance degree in Z2 that mainly characterize heterosis showed overdominance and dominance for QTL and digenic interactions. Dominance and overdominance were the major genetic foundations of heterosis in rice.


Assuntos
Epistasia Genética , Locos de Características Quantitativas , Vigor Híbrido , Modelos Genéticos
17.
Sci Rep ; 7(1): 16437, 2017 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-29180805

RESUMO

Selecting informative genes, including individually discriminant genes and synergic genes, from expression data has been useful for medical diagnosis and prognosis. Detecting synergic genes is more difficult than selecting individually discriminant genes. Several efforts have recently been made to detect gene-gene synergies, such as dendrogram-based I(X 1; X 2; Y) (mutual information), doublets (gene pairs) and MIC(X 1; X 2; Y) based on the maximal information coefficient. It is unclear whether dendrogram-based I(X 1; X 2; Y) and doublets can capture synergies efficiently. Although MIC(X 1; X 2; Y) can capture a wide range of interaction, it has a high computational cost triggered by its 3-D search. In this paper, we developed a simple and fast approach based on abs conversion type (i.e. Z = |X 1 - X 2|) and t-test, to detect interactions in simulation and real-world datasets. Our results showed that dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions, our approach can discover typical pair-wise synergic genes efficiently. These synergic genes can reach comparable accuracy to the individually discriminant genes using the same number of genes. Classifier cannot learn well if synergic genes have not been converted properly. Combining individually discriminant and synergic genes can improve the prediction performance.


Assuntos
Algoritmos , Genes , Simulação por Computador , Bases de Dados Genéticas , Regulação da Expressão Gênica , Humanos , Neoplasias/genética
18.
Sci Rep ; 6: 30672, 2016 07 29.
Artigo em Inglês | MEDLINE | ID: mdl-27470995

RESUMO

Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.


Assuntos
Bioestatística/métodos , Estudos de Associação Genética/métodos , Análise em Microsséries/métodos , Neoplasias/patologia , Humanos
19.
PLoS One ; 11(6): e0157567, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27333001

RESUMO

The maximal information coefficient (MIC) captures dependences between paired variables, including both functional and non-functional relationships. In this paper, we develop a new method, ChiMIC, to calculate the MIC values. The ChiMIC algorithm uses the chi-square test to terminate grid optimization and then removes the restriction of maximal grid size limitation of original ApproxMaxMI algorithm. Computational experiments show that ChiMIC algorithm can maintain same MIC values for noiseless functional relationships, but gives much smaller MIC values for independent variables. For noise functional relationship, the ChiMIC algorithm can reach the optimal partition much faster. Furthermore, the MCN values based on MIC calculated by ChiMIC can capture the complexity of functional relationships in a better way, and the statistical powers of MIC calculated by ChiMIC are higher than those calculated by ApproxMaxMI. Moreover, the computational costs of ChiMIC are much less than those of ApproxMaxMI. We apply the MIC values tofeature selection and obtain better classification accuracy using features selected by the MIC values from ChiMIC.


Assuntos
Algoritmos , Teoria da Informação , Modelos Teóricos , Bases de Dados como Assunto , Humanos , Neoplasias/classificação , Estatística como Assunto , Fatores de Tempo
20.
BMC Bioinformatics ; 17: 44, 2016 Jan 20.
Artigo em Inglês | MEDLINE | ID: mdl-26792270

RESUMO

BACKGROUND: Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes. RESULTS: Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40%), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM. CONCLUSIONS: Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.


Assuntos
Genes Neoplásicos , Neoplasias/classificação , Neoplasias/genética , Teorema de Bayes , Bases de Dados Genéticas , Humanos , Modelos Moleculares , Neoplasias/diagnóstico , Máquina de Vetores de Suporte
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...