Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
J Comput Chem ; 45(19): 1643-1656, 2024 Jul 15.
Article in English | MEDLINE | ID: mdl-38551129

ABSTRACT

Ni-CeO2 nanoparticles (NPs) are promising nanocatalysts for water splitting and water gas shift reactions due to the ability of ceria to temporarily donate oxygen to the catalytic reaction and accept oxygen after the reaction is completed. Therefore, elucidating how different properties of the Ni-Ceria NPs relate to the activity and selectivity of the catalytic reaction, is of crucial importance for the development of novel catalysts. In this work the active learning (AL) method based on machine learning regression and its uncertainty is used for the global optimization of Ce(4-x)NixO(8-x) (x = 1, 2, 3) nanoparticles, employing density functional theory calculations. Additionally, further investigation of the NPs by mass-scaled parallel-tempering Born-Oppenheimer molecular dynamics resulted in the same putative global minimum structures found by AL, demonstrating the robustness of our AL search to learn from small datasets and assist in the global optimization of complex electronic structure systems.

2.
J Comput Chem ; 45(15): 1193-1214, 2024 Jun 05.
Article in English | MEDLINE | ID: mdl-38329198

ABSTRACT

This paper (i) explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties, and (ii) further explores an inverse design approach to molecular design consisting of using machine learning methods to approximate the atomic composition of molecules, using QM9 data. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from physical-chemical properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well-defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner, core region that concentrates clustered inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inliner nature. The spatial structure exhibits a relationship with molecular weight. Despite the structural differences between the two datasets, the predictability of variables of interest for inverse molecular design is high. This is exemplified by models estimating the number of atoms of the molecule from both the original properties and from lower dimensional embedding spaces. In the generative approach the input is given by a set of desired properties of the molecule and the output is an approximation of the atomic composition in terms of its constituent chemical elements. This could serve as the starting region for further search in the huge space determined by the set of possible chemical compounds. The quantum mechanic's dataset QM9 is used in the study, composed of 133,885 small organic molecules and 19 electronic properties. Different multi-target regression approaches were considered for predicting the atomic composition from the properties, including feature engineering techniques in an auto-machine learning framework. High-quality models were found that predict the atomic composition of the molecules from their electronic properties, as well as from a subset of only 52.6% size. Feature selection worked better than feature generation. The results validate the generative approach to inverse molecular design.

3.
J Comput Chem ; 45(15): 1289-1302, 2024 Jun 05.
Article in English | MEDLINE | ID: mdl-38357973

ABSTRACT

Reinforcement learning (RL) methods have helped to define the state of the art in the field of modern artificial intelligence, mostly after the breakthrough involving AlphaGo and the discovery of novel algorithms. In this work, we present a RL method, based on Q-learning, for the structural determination of adsorbate@substrate models in silico, where the minimization of the energy landscape resulting from adsorbate interactions with a substrate is made by actions on states (translations and rotations) chosen from an agent's policy. The proposed RL method is implemented in an early version of the reinforcement learning software for materials design and discovery (RLMaterial), developed in Python3.x. RLMaterial interfaces with deMon2k, DFTB+, ORCA, and Quantum Espresso codes to compute the adsorbate@substrate energies. The RL method was applied for the structural determination of (i) the amino acid glycine and (ii) 2-amino-acetaldehyde, both interacting with a boron nitride (BN) monolayer, (iii) host-guest interactions between phenylboronic acid and ß-cyclodextrin and (iv) ammonia on naphthalene. Density functional tight binding calculations were used to build the complex search surfaces with a reasonably low computational cost for systems (i)-(iii) and DFT for system (iv). Artificial neural network and gradient boosting regression techniques were employed to approximate the Q-matrix or Q-table for better decision making (policy) on next actions. Finally, we have developed a transfer-learning protocol within the RL framework that allows learning from one chemical system and transferring the experience to another, as well as from different DFT or DFTB levels.

4.
J Chem Phys ; 159(18)2023 Nov 14.
Article in English | MEDLINE | ID: mdl-37947508

ABSTRACT

Since the form of the exact functional in density functional theory is unknown, we must rely on density functional approximations (DFAs). In the past, very promising results have been reported by combining semi-local DFAs with exact, i.e. Hartree-Fock, exchange. However, the spin-state energy ordering and the predictions of global minima structures are particularly sensitive to the choice of the hybrid functional and to the amount of exact exchange. This has been already qualitatively described for single conformations, reactions, and a limited number of conformations. Here, we have analyzed the mixing of exact exchange in exchange functionals for a set of several hundred isomers of the transition metal carbide, Mo4C2. The analysis of the calculated energies and charges using PBE0-type functional with varying amounts of exact exchange yields the following insights: (1) The sensitivity of spin-energy splitting is strongly correlated with the amount of exact exchange mixing. (2) Spin contamination is exacerbated when correlation is omitted from the exchange-correlation functional. (3) There is not one ideal value for the exact exchange mixing which can be used to parametrize or choose among the functionals. Calculated energies and electronic structures are influenced by exact exchange at a different magnitude within a given distribution; therefore, to extend the application range of hybrid functionals to the full periodic table the spin-energy splitting energies should be investigated.

5.
Biosystems ; 232: 104989, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37544406

ABSTRACT

Drug design and optimization are challenging tasks that call for strategic and efficient exploration of the extremely vast search space. Multiple fragmentation strategies have been proposed in the literature to mitigate the complexity of the molecular search space. From an optimization standpoint, drug design can be considered as a multi-objective optimization problem. Deep reinforcement learning (DRL) frameworks have demonstrated encouraging results in the field of drug design. However, the scalability of these frameworks is impeded by substantial training intervals and inefficient use of sample data. In this paper, we (1) examine the core principles of deep or multi-objective RL methods and their applications in molecular design, (2) analyze the performance of a recent multi-objective DRL-based and fragment-based drug design framework, named DeepFMPO, in a real-world application by incorporating optimization of protein-ligand docking affinity with varying numbers of other objectives, and (3) compare this method with a single-objective variant. Through trials, our results indicate that the DeepFMPO framework (with docking score) can achieve success, however, it suffers from training instability. Our findings encourage additional exploration and improvement of the framework. Potential sources of the framework's instability and suggestions of further modifications to stabilize the framework are discussed.


Subject(s)
Drug Design , Reinforcement, Psychology
6.
J Chem Theory Comput ; 19(17): 5999-6010, 2023 Sep 12.
Article in English | MEDLINE | ID: mdl-37581570

ABSTRACT

Structural elucidation of chemical compounds is challenging experimentally, and theoretical chemistry methods have added important insight into molecules, nanoparticles, alloys, and materials geometries and properties. However, finding the optimum structures is a bottleneck due to the huge search space, and global search algorithms have been used successfully for this purpose. In this work, we present the quantum machine learning software/agent for materials design and discovery (QMLMaterial), intended for automatic structural determination in silico for several chemical systems: atomic clusters, atomic clusters and the spin multiplicity together, doping in clusters or solids, vacancies in clusters or solids, adsorption of molecules or adsorbents on surfaces, and finally atomic clusters on solid surfaces/materials or encapsulated in porous materials. QMLMaterial is an artificial intelligence (AI) software based on the active learning method, which uses machine learning regression algorithms and their uncertainties for decision making on the next unexplored structures to be computed, increasing the probability of finding the global minimum with few calculations as more data is obtained. The software has different acquisition functions for decision making (e.g., expected improvement and lower confidence bound). Also, the Gaussian process is available in the AI framework for regression, where the uncertainty is obtained analytically from Bayesian statistics. For the artificial neural network and support vector regressor algorithms, the uncertainty can be obtained by K-fold cross-validation or nonparametric bootstrap resampling methods. The software is interfaced with several quantum chemistry codes and atomic descriptors, such as the many-body tensor representation. QMLMaterial's capabilities are highlighted in the current work by its applications in the following systems: Na20, Mo6C3 (where the spin multiplicity was considered), H2O@CeNi3O5, Mg8@graphene, Na3Mg3@CNT (carbon nanotube).

7.
J Comput Chem ; 44(7): 814-823, 2023 Mar 15.
Article in English | MEDLINE | ID: mdl-36444916

ABSTRACT

Genetic algorithms (GAs) are stochastic global search methods inspired by biological evolution. They have been used extensively in chemistry and materials science coupled with theoretical methods, ranging from force-fields to high-throughput first-principles methods. The methodology allows an accurate and automated structural determination for molecules, atomic clusters, nanoparticles, and solid surfaces, fundamental to understanding chemical processes in catalysis and environmental sciences, for instance. In this work, we propose a new genetic algorithm software, GAMaterial, implemented in Python3.x, that performs global searches to elucidate the structures of atomic clusters, doped clusters or materials and atomic clusters on surfaces. For all these applications, it is possible to accelerate the GA search by using machine learning (ML), the ML@GA method, to build subsequent populations. Results for ML@GA applied for the dopant distributions in atomic clusters are presented. The GAMaterial software was applied for the automatic structural search for the Ti6 O12 cluster, doping Al in Si11 (4Al@Si11 ) and Na10 supported on graphene (Na10 @graphene), where DFTB calculations were used to sample the complex search surfaces with reasonably low computational cost. Finally, the global search by GA of the Mo8 C4 cluster was considered, where DFT calculations were made with the deMon2k code, which is interfaced with GAMaterial.

8.
Biosystems ; 222: 104790, 2022 Dec.
Article in English | MEDLINE | ID: mdl-36228831

ABSTRACT

The design of a new therapeutic agent is a time-consuming and expensive process. The rise of machine intelligence provides a grand opportunity of expeditiously discovering novel drug candidates through smart search in the vast molecular structural space. In this paper, we propose a new approach called adversarial deep evolutionary learning (ADEL) to search for novel molecules in the latent space of an adversarial generative model and keep improving the latent representation space. In ADEL, a custom-made adversarial autoencoder (AAE) model is developed and trained under a deep evolutionary learning (DEL) process. This involves an initial training of the AAE model, followed by an integration of multi-objective evolutionary optimization in the continuous latent representation space of the AAE rather than the discrete structural space of molecules. By using the AAE, an arbitrary distribution can be provided to the training of AAE such that the latent representation space is set to that distribution. This allows for a starting latent space from which new samples can be produced. Throughout the process of learning, new samples of high quality are generated after each iteration of training and then added back into the full dataset, therefore, allowing for a more comprehensive procedure of understanding the data structure. This combination of evolving data and continuous learning not only enables improvement in the generative model, but the data as well. By comparing ADEL to the previous work in DEL, we see that ADEL can obtain better property distributions. We show that ADEL is able to design high-quality molecular structures which can be further used for virtual and experimental screenings.


Subject(s)
Machine Learning , Neural Networks, Computer , Drug Design , Artificial Intelligence , Learning
9.
Phys Chem Chem Phys ; 24(41): 25227-25239, 2022 Oct 27.
Article in English | MEDLINE | ID: mdl-36222106

ABSTRACT

Finding the optimum structures of non-stoichiometric or berthollide materials, such as (1D, 2D, 3D) materials or nanoparticles (0D), is challenging due to the huge chemical/structural search space. Computational methods coupled with global optimization algorithms have been used successfully for this purpose. In this work, we have developed an artificial intelligence method based on active learning (AL) or Bayesian optimization for the automatic structural elucidation of vacancies in solids and nanoparticles. AL uses machine learning regression algorithms and their uncertainties to take decisions (from a policy) on the next unexplored structures to be computed, increasing the probability of finding the global minimum with few calculations. The methodology allows an accurate and automated structural elucidation for vacancies, which are common in non-stoichiometric (berthollide) materials, helping to understand chemical processes in catalysis and environmental sciences, for instance. The AL vacancies method was implemented in the quantum machine learning software/agent for material design and discovery (QMLMaterial). Also, two additional acquisition functions for decision making were implemented, besides the expected improvement (EI): the lower confidence bound (LCB) and the probability of improvement (PI). The new software was applied for the automatic structural search for graphite (C36) with 3 (C36-3) and 4 (C36-4) carbon vacancies and C60 (C60-4) fullerene with 4 carbon vacancies. DFTB calculations were used to build the complex search surfaces with reasonably low computational cost. Furthermore, with the AL method for vacancies, it was possible to elucidate the optimum oxygen vacancy distribution in CaTiO3 perovskite by DFT, where a semiconductor behavior results from oxygen vacancies. Throughout the work, a Gaussian process with its uncertainty was employed in the AL framework using different acquisition functions (EI, LCB and PI), and taking into account different descriptors: Ewald sum matrix and sine matrix. Finally, the performance of the proposed AL method was compared to random search and genetic algorithm.

10.
Front Pharmacol ; 13: 920747, 2022.
Article in English | MEDLINE | ID: mdl-35860028

ABSTRACT

Drug discovery is a challenging process with a huge molecular space to be explored and numerous pharmacological properties to be appropriately considered. Among various drug design protocols, fragment-based drug design is an effective way of constraining the search space and better utilizing biologically active compounds. Motivated by fragment-based drug search for a given protein target and the emergence of artificial intelligence (AI) approaches in this field, this work advances the field of in silico drug design by (1) integrating a graph fragmentation-based deep generative model with a deep evolutionary learning process for large-scale multi-objective molecular optimization, and (2) applying protein-ligand binding affinity scores together with other desired physicochemical properties as objectives. Our experiments show that the proposed method can generate novel molecules with improved property values and binding affinities.

11.
J Mol Model ; 28(6): 178, 2022 Jun 03.
Article in English | MEDLINE | ID: mdl-35654918

ABSTRACT

Adsorbate interactions with substrates (e.g. surfaces and nanoparticles) are fundamental for several technologies, such as functional materials, supramolecular chemistry, and solvent interactions. However, modeling these kinds of systems in silico, such as finding the optimum adsorption geometry and energy, is challenging, due to the huge number of possibilities of assembling the adsorbate on the surface. In the current work, we have developed an artificial intelligence (AI) approach based on an active learning (AL) method for adsorption optimization on the surface of materials. AL uses machine learning (ML) regression algorithms and their uncertainties to make a decision (based on a policy) for the next unexplored structures to be computed, increasing, though, the probability of finding the global minimum with a small number of calculations. The methodology allows an accurate and automated structural elucidation of the adsorbate on the surface, based on the minimization of the total electronic energy. The new AL method for adsorption optimization was developed and implemented in the quantum machine learning software/agent for material design and discovery (QMLMaterial) program and was applied for C60@TiO2 anatase (101). It marks another software extension with a new feature in addition to the automatic structural elucidation of defects in materials and of nanoparticles as well. SCC-DFTB calculations were used to build the complex search surfaces with a reasonably low computational cost. An artificial neural network (NN) was employed in the AL framework evaluated together with two uncertainty quantification methods: K-fold cross-validation and non-parametric bootstrap (BS) resampling. Also, two different acquisition functions for decision-making were used: expected improvement (EI) and the lower confidence bound (LCB).


Subject(s)
Artificial Intelligence , Machine Learning , Adsorption , Neural Networks, Computer , Software
12.
Annu Int Conf IEEE Eng Med Biol Soc ; 2018: 303-306, 2018 Jul.
Article in English | MEDLINE | ID: mdl-30440398

ABSTRACT

Targeted therapy is a treatment that targets the cancer's specific genes, proteins, or the tissue environment that contributes to cancer growth and survival. Identification of therapeutics targets is a very challenging problem in bioinformatics. An integrative and iterative approach for the identification of drug-gene modules (i.e., groups of genes and drugs such that genes in the same module may regulate each other and are targets of some of the drugs in the same module) is developed. Application to clear cell carcinoma of the ovary data reveals several drug-gene modules and a target network that may play important roles in treating this disease.


Subject(s)
Computational Biology , Gene Regulatory Networks , Female , Humans , Ovary
13.
BMC Bioinformatics ; 18(1): 174, 2017 Mar 16.
Article in English | MEDLINE | ID: mdl-28302069

ABSTRACT

BACKGROUND: Phenotypic studies in Triticeae have shown that low temperature-induced protective mechanisms are developmentally regulated and involve dynamic acclimation processes. Understanding these mechanisms is important for breeding cold-resistant wheat cultivars. In this study, we combined three computational techniques for the analysis of gene expression data from spring and winter wheat cultivars subjected to low temperature treatments. Our main objective was to construct a comprehensive network of cold response transcriptional events in wheat, and to identify novel cold tolerance candidate genes in wheat. RESULTS: We assigned novel cold stress-related roles to 35 wheat genes, uncovered novel transcription (TF)-gene interactions, and identified 127 genes representing known and novel candidate targets associated with cold tolerance in wheat. Our results also show that delays in terms of activation or repression of the same genes across wheat cultivars play key roles in phenotypic differences among winter and spring wheat cultivars, and adaptation to low temperature stress, cold shock and cold acclimation. CONCLUSIONS: Using three computational approaches, we identified novel putative cold-response genes and TF-gene interactions. These results provide new insights into the complex mechanisms regulating the expression of cold-responsive genes in wheat.


Subject(s)
Adaptation, Physiological , Computational Biology/methods , Triticum/genetics , Cold Temperature , Gene Expression Regulation, Plant , Gene Regulatory Networks , Linear Models , Plant Proteins/genetics , Plant Proteins/metabolism , Seasons , Stress, Physiological , Triticum/metabolism
14.
BMC Genomics ; 16: 299, 2015 Apr 15.
Article in English | MEDLINE | ID: mdl-25887590

ABSTRACT

BACKGROUND: While the gargantuan multi-nation effort of sequencing T. aestivum gets close to completion, the annotation process for the vast number of wheat genes and proteins is in its infancy. Previous experimental studies carried out on model plant organisms such as A. thaliana and O. sativa provide a plethora of gene annotations that can be used as potential starting points for wheat gene annotations, proven that solid cross-species gene-to-gene and protein-to-protein correspondences are provided. RESULTS: DNA and protein sequences and corresponding annotations for T. aestivum and 9 other plant species were collected from Ensembl Plants release 22 and curated. Cliques of predicted 1-to-1 orthologs were identified and an annotation enrichment model was defined based on existing gene-GO term associations and phylogenetic relationships among wheat and 9 other plant species. A total of 13 cliques of size 10 were identified, which represent putative functionally equivalent genes and proteins in the 10 plant species. Eighty-five new and more specific GO terms were associated with wheat genes in the 13 cliques of size 10, which represent a 65% increase compared with the previously 130 known GO terms. Similar expression patterns for 4 genes from Arabidopsis, barley, maize and rice in cliques of size 10 provide experimental evidence to support our model. Overall, based on clique size equal or larger than 3, our model enriched the existing gene-GO term associations for 7,838 (8%) wheat genes, of which 2,139 had no previous annotation. CONCLUSIONS: Our novel comparative genomics approach enriches existing T. aestivum gene annotations based on cliques of predicted 1-to-1 orthologs, phylogenetic relationships and existing gene ontologies from 9 other plant species.


Subject(s)
Genes, Plant , Triticum/genetics , Arabidopsis/genetics , Biological Evolution , DNA, Plant/chemistry , DNA, Plant/metabolism , Hordeum/genetics , Models, Genetic , Molecular Sequence Annotation , Oryza/genetics , Phylogeny , Plant Proteins/chemistry , Plant Proteins/metabolism , Triticum/classification , Zea mays/genetics
15.
BMC Bioinformatics ; 13: 54, 2012 Apr 04.
Article in English | MEDLINE | ID: mdl-22475802

ABSTRACT

BACKGROUND: Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space. RESULTS: We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (Plasmodium chabaudi), systemic acquired resistance in Arabidopsis thaliana, similarities and differences between inner and outer cotyledon in Brassica napus during seed development, and to Brassica napus whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples. CONCLUSIONS: Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.


Subject(s)
Algorithms , Data Mining , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis , Animals , Arabidopsis , Brassica napus/genetics , Brassica napus/growth & development , Cluster Analysis , Cotyledon/metabolism , Malaria/immunology , Mice
16.
BMC Bioinformatics ; 11: 229, 2010 May 06.
Article in English | MEDLINE | ID: mdl-20459620

ABSTRACT

BACKGROUND: Modern high throughput experimental techniques such as DNA microarrays often result in large lists of genes. Computational biology tools such as clustering are then used to group together genes based on their similarity in expression profiles. Genes in each group are probably functionally related. The functional relevance among the genes in each group is usually characterized by utilizing available biological knowledge in public databases such as Gene Ontology (GO), KEGG pathways, association between a transcription factor (TF) and its target genes, and/or gene networks. RESULTS: We developed GOAL: Gene Ontology AnaLyzer, a software tool specifically designed for the functional evaluation of gene groups. GOAL implements and supports efficient and statistically rigorous functional interpretations of gene groups through its integration with available GO, TF-gene association data, and association with KEGG pathways. In order to facilitate more specific functional characterization of a gene group, we implement three GO-tree search strategies rather than one as in most existing GO analysis tools. Furthermore, GOAL offers flexibility in deployment. It can be used as a standalone tool, a plug-in to other computational biology tools, or a web server application. CONCLUSION: We developed a functional evaluation software tool, GOAL, to perform functional characterization of a gene group. GOAL offers three GO-tree search strategies and combines its strength in function integration, portability and visualization, and its flexibility in deployment. Furthermore, GOAL can be used to evaluate and compare gene groups as the output from computational biology tools such as clustering algorithms.


Subject(s)
Genes , Genomics/methods , Software , Databases, Genetic , Gene Expression Profiling/methods , Gene Regulatory Networks , Oligonucleotide Array Sequence Analysis
17.
BMC Bioinformatics ; 10: 255, 2009 Aug 20.
Article in English | MEDLINE | ID: mdl-19695084

ABSTRACT

BACKGROUND: Time series gene expression data analysis is used widely to study the dynamics of various cell processes. Most of the time series data available today consist of few time points only, thus making the application of standard clustering techniques difficult. RESULTS: We developed two new algorithms that are capable of extracting biological patterns from short time point series gene expression data. The two algorithms, ASTRO and MiMeSR, are inspired by the rank order preserving framework and the minimum mean squared residue approach, respectively. However, ASTRO and MiMeSR differ from previous approaches in that they take advantage of the relatively few number of time points in order to reduce the problem from NP-hard to linear. Tested on well-defined short time expression data, we found that our approaches are robust to noise, as well as to random patterns, and that they can correctly detect the temporal expression profile of relevant functional categories. Evaluation of our methods was performed using Gene Ontology (GO) annotations and chromatin immunoprecipitation (ChIP-chip) data. CONCLUSION: Our approaches generally outperform both standard clustering algorithms and algorithms designed specifically for clustering of short time series gene expression data. Both algorithms are available at http://www.benoslab.pitt.edu/astro/.


Subject(s)
Algorithms , Computational Biology/methods , Gene Expression , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Databases, Genetic , Gene Expression Profiling
18.
Mol Cancer Ther ; 7(1): 27-37, 2008 Jan.
Article in English | MEDLINE | ID: mdl-18187805

ABSTRACT

One reason that ovarian cancer is such a deadly disease is because it is not usually diagnosed until it has reached an advanced stage. In this study, we developed a novel algorithm for group biomarkers identification using gene expression data. Group biomarkers consist of coregulated genes across normal and different stage diseased tissues. Unlike prior sets of biomarkers identified by statistical methods, genes in group biomarkers are potentially involved in pathways related to different types of cancer development. They may serve as an alternative to the traditional single biomarkers or combination of biomarkers used for the diagnosis of early-stage and/or recurrent ovarian cancer. We extracted group biomarkers by applying biclustering algorithms that we recently developed on the gene expression data of over 400 normal, cancerous, and diseased tissues. We identified several groups of coregulated genes that encode for secreted proteins and exhibit expression levels in ovarian cancer that are at least 2-fold (in log2 scale) higher than in normal ovary and nonovarian tissues. In particular, three candidate group biomarkers exhibited a conserved biological pattern that may be used for early detection or recurrence of ovarian cancer with specificity greater than 99% and sensitivity equal to 100%. We validated these group biomarkers using publicly available gene expression data sets downloaded from a NIH Web site (http://www.ncbi.nlm.nih.gov/geo). Statistical analysis showed that our methodology identified an optimum combination of genes that have the highest effect on the diagnosis of the disease compared with several computational techniques that we tested. Our study also suggests that single or group biomarkers correlate with the stage of the disease.


Subject(s)
Biomarkers, Tumor/genetics , Ovarian Neoplasms/diagnosis , Ovarian Neoplasms/genetics , Adolescent , Adult , Aged , Aged, 80 and over , Algorithms , Child , Early Diagnosis , Female , Humans , Middle Aged , ROC Curve , Reproducibility of Results
SELECTION OF CITATIONS
SEARCH DETAIL
...