RESUMO
Peptides are promising drug development frameworks that have been hindered by intrinsic undesired properties including hemolytic activity. We aim to get a better insight into the chemical space of hemolytic peptides using a novel approach based on network science and data mining. Metadata networks (METNs) were useful to characterize and find general patterns associated with hemolytic peptides, whereas Half-Space Proximal Networks (HSPNs), represented the hemolytic peptide space. The best candidate HSPNs were used to extract various subsets of hemolytic peptides (scaffolds) considering network centrality and peptide similarity. These scaffolds have been proved to be useful in developing robust similarity-based model classifiers. Finally, using an alignment-free approach, we reported 47 putative hemolytic motifs, which can be used as toxic signatures when developing novel peptide-based drugs. We provided evidence that the number of hemolytic motifs in a sequence might be related to the likelihood of being hemolytic.
Assuntos
Mineração de Dados , Hemólise , Peptídeos , Mineração de Dados/métodos , Hemólise/efeitos dos fármacos , Humanos , Biologia Computacional/métodosRESUMO
Molecular features play an important role in different bio-chem-informatics tasks, such as the Quantitative Structure-Activity Relationships (QSAR) modeling. Several pre-trained models have been recently created to be used in downstream tasks, either by fine-tuning a specific model or by extracting features to feed traditional classifiers. In this regard, a new family of Evolutionary Scale Modeling models (termed as ESM-2 models) was recently introduced, demonstrating outstanding results in protein structure prediction benchmarks. Herein, we studied the usefulness of the different-dimensional embeddings derived from the ESM-2 models to classify antimicrobial peptides (AMPs). To this end, we built a KNIME workflow to use the same modeling methodology across experiments in order to guarantee fair analyses. As a result, the 640- and 1280-dimensional embeddings derived from the 30- and 33-layer ESM-2 models, respectively, are the most valuable since statistically better performances were achieved by the QSAR models built from them. We also fused features of the different ESM-2 models, and it was concluded that the fusion contributes to getting better QSAR models than using features of a single ESM-2 model. Frequency studies revealed that only a portion of the ESM-2 embeddings is valuable for modeling tasks since between 43% and 66% of the features were never used. Comparisons regarding state-of-the-art deep learning (DL) models confirm that when performing methodologically principled studies in the prediction of AMPs, non-DL based QSAR models yield comparable-to-superior performances to DL-based QSAR models. The developed KNIME workflow is available-freely at https://github.com/cicese-biocom/classification-QSAR-bioKom. This workflow can be valuable to avoid unfair comparisons regarding new computational methods, as well as to propose new non-DL based QSAR models.
Assuntos
Peptídeos Antimicrobianos , Fluxo de TrabalhoRESUMO
MOTIVATION: Antimicrobial peptides (AMPs) are promising molecules to treat infectious diseases caused by multi-drug resistance pathogens, some types of cancer, and other conditions. Computer-aided strategies are efficient tools for the high-throughput screening of AMPs. RESULTS: This report highlights StarPep Toolbox, an open-source and user-friendly software to study the bioactive chemical space of AMPs using complex network-based representations, clustering, and similarity-searching models. The novelty of this research lies in the combination of network science and similarity-searching techniques, distinguishing it from conventional methods based on machine learning and other computational approaches. The network-based representation of the AMP chemical space presents promising opportunities for peptide drug repurposing, development, and optimization. This approach could serve as a baseline for the discovery of a new generation of therapeutics peptides. AVAILABILITY AND IMPLEMENTATION: All underlying code and installation files are accessible through GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/StarPep) under the Apache 2.0 license.
Assuntos
Peptídeos , Software , Análise por Conglomerados , Reposicionamento de Medicamentos , Ensaios de Triagem em Larga EscalaRESUMO
This study introduces a set of fuzzy spherically truncated three-dimensional (3D) multi-linear descriptors for proteins. These indices codify geometric structural information from kth spherically truncated spatial-(dis)similarity two-tuple and three-tuple tensors. The coefficients of these truncated tensors are calculated by applying a smoothing value to the 3D structural encoding based on the relationships between two and three amino acids of a protein embedded into a sphere. At considering, the geometrical center of the protein matches with center of the sphere, the distance between each amino acid involved in any specific interaction and the geometrical center of the protein can be computed. Then, the fuzzy membership degree of each amino acid from an spherical region of interest is computed by fuzzy membership functions (FMFs). The truncation value is finally a combination of the membership degrees from interacting amino acids, by applying the arithmetic mean as fusion rule. Several fuzzy membership functions with diverse biases on the calculation of amino acids memberships (e.g., Z-shaped (close to the center), PI-shaped (middle region), and A-Gaussian (far from the center)) were considered as well as traditional truncation functions (e.g., Switching). Such truncation functions were comparatively evaluated by exploring: 1) the frequency of membership degrees, 2) the variability and orthogonality analyses among them based on the Shannon Entropy's and Principal Component's methods, respectively, and 3) the prediction performance of alignment-free prediction of protein folding rates and structural classes. These analyses unraveled the singularity of the proposed fuzzy spherically truncated MDs with respect to the classical (non-truncated) ones and respect to the MDs truncated with traditional functions. They also showed an improved prediction power by attaining an external correlation coefficient of 95.82% in the folding rate modelling and an accuracy of 100% in distinguishing structural protein classes. These outcomes are better than the ones attained by existing approaches, justifying the theoretical contribution of this report. Thus, the fuzzy spherically truncated-based protein descriptors from MuLiMs-MCoMPAs (http://tomocomd.com/mulims-mcompas) are promising alignment-free predictors for modeling protein functions and properties.
RESUMO
The increasing interest in bioactive peptides with therapeutic potentials has been reflected in a large variety of biological databases published over the last years. However, the knowledge discovery process from these heterogeneous data sources is a nontrivial task, becoming the essence of our research endeavor. Therefore, we devise a unified data model based on molecular similarity networks for representing a chemical reference space of bioactive peptides, having an implicit knowledge that is currently not explicitly accessed in existing biological databases. Indeed, our main contribution is a novel workflow for the automatic construction of such similarity networks, enabling visual graph mining techniques to uncover new insights from the "ocean" of known bioactive peptides. The workflow presented here relies on the following sequential steps: (i) calculation of molecular descriptors by applying statistical and aggregation operators on amino acid property vectors; (ii) a two-stage unsupervised feature selection method to identify an optimized subset of descriptors using the concepts of entropy and mutual information; (iii) generation of sparse networks where nodes represent bioactive peptides, and edges between two nodes denote their pairwise similarity/distance relationships in the defined descriptor space; and (iv) exploratory analysis using visual inspection in combination with clustering and network science techniques. For practical purposes, the proposed workflow has been implemented in our visual analytics software tool ( http://mobiosd-hub.com/starpep/ ), to assist researchers in extracting useful information from an integrated collection of 45120 bioactive peptides, which is one of the largest and most diverse data in its field. Finally, we illustrate the applicability of the proposed workflow for discovering central nodes in molecular similarity networks that may represent a biologically relevant chemical space known to date.
Assuntos
Algoritmos , Antineoplásicos/química , Biologia Computacional/métodos , Gráficos por Computador , Modelos Químicos , Fragmentos de Peptídeos/química , Aprendizado de Máquina não Supervisionado , Simulação por Computador , Bases de Dados Factuais , Humanos , SoftwareRESUMO
Drug-induced liver injury (DILI) is a key safety issue in the drug discovery pipeline and a regulatory concern. Thus, many in silico tools have been proposed to improve the hepatotoxicity prediction of organic-type chemicals. Here, classifiers for the prediction of DILI were developed by using QuBiLS-MAS 0-2.5D molecular descriptors and shallow machine learning techniques, on a training set composed of 1075 molecules. The best ensemble model build, E13, was obtained with good statistical parameters for the learning series, namely, the following: accuracy = 0.840, sensibility = 0.890, specificity = 0.761, Matthew's correlation coefficient = 0.660, and area under the ROC curve = 0.904. The model was also satisfactorily evaluated with Y-scrambling test, and repeated k-fold cross-validation and repeated k-holdout validation. In addition, an exhaustive external validation was also carried out by using two test sets and five external test sets, with an average accuracy value equal to 0.854 (±0.062) and a coverage equal to 98.4% according to its applicability domain. A statistical comparison of the performance of the E13 model, with regard to results and tools (e.g., Padel DDPredictor Software, Deep Learning DILIserver, and Vslead) reported in the literature, was also performed. In general, E13 presented the best global performance in all experiments. The sum of the ranking differences procedure provided a very similar grouping pattern to that of the M-ANOVA statistical analysis, where E13 was identified as the best model for DILI predictions. A noncommercial and fully cross-platform software for the DILI prediction was also developed, which is freely available at http://tomocomd.com/apps/ptoxra. This software was used for the screening of seven data sets, containing natural products, leads, toxic materials, and FDA approved drugs, to assess the usefulness of the QSAR models in the DILI labeling of organic substances; it was found that 50-92% of the evaluated molecules are positive-DILI compounds. All in all, it can be stated that the E13 model is a relevant method for the prediction of DILI risk in humans, as it shows the best results among all of the methods analyzed.
Assuntos
Doença Hepática Induzida por Substâncias e Drogas , Modelos Biológicos , Descoberta de Drogas , Aprendizado de Máquina , Relação Quantitativa Estrutura-Atividade , SoftwareRESUMO
Quantum Chemical Topology (QCT) is a well established structural theoretical approach, but the development of its reactivity component is still a challenge. The hypothesis of this work is that the reactivity of an atom within a molecule is a function of its electronic population, its delocalization in the rest of the molecule, and the way it polarizes within an atomic domain. In this paper, we present a topological reactivity predictor for cabonyl additions, κ. It is a measure of the polarization of the electron density with the carbonyl functional group. κ is a model obtained from a QSAR procedure, using quantum-topological atomic descriptors and reported hydration equilibrium constants of carbonyl compounds. To validate the predictive capability of κ, we applied it to organic reactions, including a multicomponent reaction. κ was the only property that predicts the reactivity in each reaction step. The shape of κ can be interpreted as the change between two electrophilic states of a functional group, reactive and non-reactive.
RESUMO
Advances to the distributed, multi-core and fully cross-platform QuBiLS-MIDAS software v2.0 (http://tomocomd.com/qubils-midas) are reported in this article since the v1.0 release. The QuBiLS-MIDAS software is the only one that computes atom-pair and alignment-free geometrical MDs (3D-MDs) from several distance metrics other than the Euclidean distance, as well as alignment-free 3D-MDs that codify structural information regarding the relations among three and four atoms of a molecule. The most recent features added to the QuBiLS-MIDAS software v2.0 are related (a) to the calculation of atomic weightings from indices based on the vertex-degree invariant (e.g., Alikhanidi index); (b) to consider central chirality during the molecular encoding; (c) to use measures based on clustering methods and statistical functions to codify structural information among more than two atoms; (d) to the use of a novel method based on fuzzy membership functions to spherically truncate inter-atomic relations; and (e) to the use of weighted and fuzzy aggregation operators to compute global 3D-MDs according to the importance and/or interrelation of the atoms of a molecule during the molecular encoding. Moreover, a novel module to compute QuBiLS-MIDAS 3D-MDs from their headings was also developed. This module can be used either by the graphical user interface or by means of the software library. By using the library, both the predictive models built with the QuBiLS-MIDAS 3D-MDs and the QuBiLS-MIDAS 3D-MDs calculation can be embedded in other tools. A set of predefined QuBiLS-MIDAS 3D-MDs with high information content and low redundancy on a set comprised of 20,469 compounds is also provided to be employed in further cheminformatics tasks. This set of predefined 3D-MDs evidenced better performance than all the universe of Dragon (v5.5) and PaDEL 0D-to-3D MDs in variability studies, whereas a linear independence study proved that these QuBiLS-MIDAS 3D-MDs codify chemical information orthogonal to the Dragon 0D-to-3D MDs. This set of predefined 3D-MDs would be periodically updated as long as new results be achieved. In general, this report highlights our continued efforts to provide a better tool for a most suitable characterization of compounds, and in this way, to contribute to obtaining better outcomes in future applications.
RESUMO
A novel spherical truncation method, based on fuzzy membership functions, is introduced to truncate interatomic (or interaminoacid) relations according to smoothing values computed from fuzzy membership degrees. In this method, the molecules are circumscribed into a sphere, so that the geometric centers of the molecules are the centers of the spheres. The fuzzy membership degree of each atom (or aminoacid) is computed from its distance with respect to the geometric center of the molecule, by using a fuzzy membership function. So, the smoothing value to be applied in the truncation of a relation (or interaction) is computed by averaging the fuzzy membership degrees of the atoms (or aminoacids) involved in the relation. This truncation method is rather different from the existing ones, at considering the geometric center for the whole molecule and not only for atom-groups, as well as for using fuzzy membership functions to compute the smoothing values. A variability study on a set comprised of 20,469 compounds (15,050 drug-like compounds, 2994 drugs approved, 880 natural products from African sources, and 1545 plant-derived natural compounds exhibiting anti-cancerous activity) demonstrated that the truncation method proposed allows to determine molecular encodings with better ability for discriminating among structurally different molecules than the encodings obtained without applying truncation or applying non-fuzzy truncation functions. Moreover, a principal component analysis revealed that orthogonal chemical information of the molecules is encoded by using the method proposed. Lastly, a modeling study proved that the truncation method improves the modeling ability of existing geometric molecular descriptors, at allowing to develop more robust models than the ones built only using non-truncated descriptors. In this sense, a comparison and statistical assessment were performed on eight chemical datasets. As a result, the models based on the truncated molecular encodings yielded statistically better results than 12 procedures considered from the literature. It can thus be stated that the proposed truncation method is a relevant strategy for obtaining better molecular encodings, which will be ultimately useful in enhancing the modeling ability of existing encodings both on small-to-medium size molecules and biomacromolecules. © 2019 Wiley Periodicals, Inc.
RESUMO
This report introduces the MuLiMs-MCoMPAs software (acronym for Multi-Linear Maps based on N-Metric and Contact Matrices of 3D Protein and Amino-acid weightings), designed to compute tensor-based 3D protein structural descriptors by applying two- and three-linear algebraic forms. Moreover, these descriptors contemplate generalizing components such as novel 3D protein structural representations, (dis)similarity metrics, and multimetrics to extract geometrical related information between two and three amino acids, weighting schemes based on amino acid properties, matrix normalization procedures that consider simple-stochastic and mutual probability transformations, topological and geometrical cutoffs, amino acid, and group-based MD calculations, and aggregation operators for merging amino acidic and group MDs. The MuLiMs-MCoMPAs software, which belongs to the ToMoCoMD-CAMPS suite, was developed in Java (version 1.8) using the Chemistry Development Kit (CDK) (version 1.4.19) and the Jmol libraries. This software implemented a divide-and-conquer strategy to parallelize the computation of the indices as well as modules for data preprocessing and batch computing functionalities. Furthermore, it consists of two components: (i) a desktop-graphical user interface (GUI) and (ii) an API library. The relevance of this novel approach is demonstrated through two analyses that considered Shannon's entropy-based variability and a principal component analysis. These studies showed that the MuLiMs-MCoMPAs' three-linear descriptor family contains higher informational entropy than several other descriptors generated with available computation tools. Moreover, the MuLiMs-MCoMPAs indices capture additional orthogonal information to the one codified by the available calculation approaches. As a result, two sets of suggested theoretical configurations that contain 13648 two-linear indices and 20263 three-linear indices are available for download at tomocomd.com . Furthermore, as a demonstration of the applicability and easy integration of the MuLiMs library into a QSAR-based expert system, a software application (ProStAF) was generated to predict SCOP protein structural classes and folding rate. It can thus be anticipated that the MuLiMs-MCoMPAs framework will turn into a valuable contribution to the chem- and bioinformatics research fields.
Assuntos
Simulação por Computador , Proteínas/química , Software , Desenho de Fármacos , Modelos Moleculares , Conformação Proteica , Proteínas/metabolismoRESUMO
In this report, a new type of tridimensional (3D) biomacro-molecular descriptors for proteins are proposed. These descriptors make use of multi-linear algebra concepts based on the application of 3-linear forms (i.e., Canonical Trilinear (Tr), Trilinear Cubic (TrC), Trilinear-Quadratic-Bilinear (TrQB) and so on) as a specific case of the N-linear algebraic forms. The definition of the kth 3-tuple similarity-dissimilarity spatial matrices (Tensor's Form) are used for the transformation and for the representation of the existing chemical information available in the relationships between three amino acids of a protein. Several metrics (Minkowski-type, wave-edge, etc) and multi-metrics (Triangle area, Bond-angle, etc) are proposed for the interaction information extraction, as well as probabilistic transformations (e.g., simple stochastic and mutual probability) to achieve matrix normalization. A generalized procedure considering amino acid level-based indices that can be fused together by using aggregator operators for descriptors calculations is proposed. The obtained results demonstrated that the new proposed 3D biomacro-molecular indices perform better than other approaches in the SCOP-based discrimination and the prediction of folding rate of proteins by using simple linear parametrical models. It can be concluded that the proposed method allows the definition of 3D biomacro-molecular descriptors that contain orthogonal information capable of providing better models for applications in protein science.
Assuntos
Biologia Computacional/métodos , Dobramento de Proteína , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Análise Discriminante , Modelos Lineares , Análise EspacialRESUMO
The ability to support a replicator population in an extremely hostile environment is considered in a simple model of a prebiotic cell. We explore from a classical approach how the replicator viability changes as a function of the cell radius. The model includes the interaction between two different species: a substrate that flows from the exterior and a replicator that feeds on the substrate and is readily destroyed in the environment outside the cell. According to our results, replicators in the cell only exist when the radius exceeds some critical value [Formula: see text] being, in general, a function of the substrate concentration, the diffusion constant of the replicator species, and the reproduction rate coefficient. Additionally, the influence of other parameters on the replicator population is also considered. The viability of chemical replicators under such drastic conditions could be crucial in understanding the origin of the first primitive cells and the ulterior development of life on our planet. Key Words: Prebiotic cell-Chemical replicator-Environment-Reproduction rate. Astrobiology 18, 403-411.
Assuntos
Tamanho Celular , Meio Ambiente , Evolução Biológica , Simulação por Computador , Modelos Biológicos , Origem da VidaRESUMO
AIM: Metronidazole is the most widely used drug in trichomoniasis therapy. However, the emergence of metronidazole-resistant Trichomonas vaginalis isolates calls for the search for new drugs to counter the pathogenicity of these parasites. RESULTS: Classification models for predicting the antitrichomonas activity of molecules were built. These models were employed to screen antiprotozoal drugs, from which 20 were classified as active. The in vitro experiments showed moderate to high activity for 19 of the molecules at 10 µg/ml, while 3 compounds yielded higher activity than the reference at 1 µg/ml. The 11 most active chemicals were evaluated in vivo using Naval Medical Research Institute (NMRI) mice. CONCLUSION: Benznidazole showed similar results as metronidazole, and can thus be considered as a potential candidate in antitrichomonas therapy.
Assuntos
Antiprotozoários/química , Antiprotozoários/farmacologia , Reposicionamento de Medicamentos/métodos , Tricomoníase/tratamento farmacológico , Trichomonas vaginalis/efeitos dos fármacos , Animais , Antiprotozoários/uso terapêutico , Análise Discriminante , Resistência a Medicamentos , Feminino , Humanos , Metronidazol/química , Metronidazol/farmacologia , Metronidazol/uso terapêutico , Camundongos , Nitroimidazóis/química , Nitroimidazóis/farmacologia , Nitroimidazóis/uso terapêutico , Vaginite por Trichomonas/tratamento farmacológicoRESUMO
BACKGROUND: Molecular fingerprints are widely used in several areas of chemoinformatics including diversity analysis and similarity searching. The fingerprint-based analysis of chemical libraries, in particular of large collections, usually requires the molecular representation of each compound in the library that may lead to issues of storage space and redundant calculations. In fact, information redundancy is inherent to the data, resulting on binary digit positions in the fingerprint without significant information. RESULTS: Herein is proposed a general approach to represent an entire compound library with a single binary fingerprint. The development of the database fingerprint (DFP) is illustrated first using a short fingerprint (MACCS keys) for 10 data sets of general interest in chemistry. The application of the DFP is further shown with PubChem fingerprints for the data sets used in the primary example but with a larger number of compounds, up to 25,000 molecules. The performance of DFP were studied through differential Shannon entropy, k-mean clustering, and DFP/Tanimoto similarity. CONCLUSIONS: The DFP is designed to capture key information of the compound collection and can be used to compare and assess the diversity of molecular libraries. This Preliminary Communication shows the potential of the novel fingerprint to conduct inter-library relationships. A major future goal is to apply the DFP for virtual screening and developing DFP for other data sets based on several different type of fingerprints.Graphical AbstractDatabase fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening.
RESUMO
Los estudios QSAR definidos en la literatura están basados en enfoques uni-modales, dejando de analizar conjuntos de datos que contienen distintas informaciones químicas. En esta investigación se propone aplicar por primera vez y analizar el comportamiento del enfoque multi-modal en el desarrollo de estudios QSAR. Para este fin se utilizó una base de compuestos con actividad hepatotóxica, a partir de la cual se construyeron cuatro modalidades considerando distintos descriptores moleculares basados en diversas teorías y enfoques. Se desarrollaron varios modelos usando los enfoques uni-modales y multi-modales utilizando algoritmos de clasificación reportados en la literatura e implementados en el lenguaje R. Los parámetros de cada uno de los algoritmos se optimizaron con el procedimiento, parameter tuningwithrepeated grid-search cross-validation, mientras la validación de dichos modelos se realizó mediante validación cruzada de 10 pliegues con 10 repeticiones. Estadísticamente se comprobó que el enfoque multimodal mejora el desempeño de los modelos predictivos comparado con algunos de los modelos derivados de los conjuntos de datos con modalidades individuales(AU)
The QSAR studies defined in the literature are based on uni-modal approaches and do not consider datasets with different chemical information. Thus, this research has as objective to apply and analyze the behavior of multi-modal approaches when QSAR studies are carried out. To this end, a compound dataset with hepatotoxicity activity was employed and four modalities were built considering molecular descriptors based on different mathematical theories. Also, several predictive models were developed taking into account both uni-modal and multi-modal approaches by using classification algorithms reported in the literature and implemented in R language. The parameters of these algorithms with the procedure, parameter tuning with repeated grid-search cross-validation, were optimized, while the strategy 10-fold cross-validation with 10 repetitions was used to corroborate the predictive accuracy of the models. As result of this study it can be stated that the behavior of the models based on multi-modal approach present significant differences with to those models developed from uni-modal approaches(AU)
Assuntos
Imagem Multimodal/métodos , Informática Médica/educaçãoRESUMO
Los estudios QSAR definidos en la literatura están basados en enfoques uni-modales, dejando de analizar conjuntos de datos que contienen distintas informaciones químicas. En esta investigación se propone aplicar por primera vez y analizar el comportamiento del enfoque multi-modal en el desarrollo de estudios QSAR. Para este fin se utilizó una base de compuestos con actividad hepatotóxica, a partir de la cual se construyeron cuatro modalidades considerando distintos descriptores moleculares basados en diversas teorías y enfoques. Se desarrollaron varios modelos usando los enfoques uni-modales y multi-modales utilizando algoritmos de clasificación reportados en la literatura e implementados en el lenguaje R. Los parámetros de cada uno de los algoritmos se optimizaron con el procedimiento peatedgrid-searchcross-validation, mientras la validación de dichos modelos se realizó mediante validación cruzada de 10 pliegues con 10 repeticiones. Estadísticamente se comprobó que el enfoque multimodal mejora el desempeño de los modelos predictivos comparado con algunos de los modelos derivados de los conjuntos de datos con modalidades individuales(AU)
The QSAR studies defined in the literature are based on uni-modal approaches and do not consider datasets with different chemical information. Thus, this research has as objective to apply and analyze the behavior of multi-modal approaches when QSAR studies are carried out. To this end, a compound dataset with hepatotoxicity activity was employed and four modalities were built considering molecular descriptors based on different mathematical theories. Also, several predictive models were developed taking into account both uni-modal and multi-modal approaches by using classification algorithms reported in the literature and implemented in R language. The parameters of these algorithms with the procedure parameter tuning with repeated grid-search cross-validation were optimized, while the strategy 10-fold cross-validation with 10 repetitions was used to corroborate the predictive accuracy of the models. As result of this study it can be stated that the behavior of the models based on multi-modal approach present significant differences with to those models developed from uni-modal approaches(AU)
Assuntos
Humanos , Aplicações da Informática Médica , Software , Terapia Combinada/métodosRESUMO
The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon's entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software ( http://mobiosd-hub.com/imman-soft/ ), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms. Graphic representation for Shannon's distribution of MD calculating software.
Assuntos
Modelos Teóricos , Software , AlgoritmosRESUMO
The present report introduces a novel module of the QuBiLS-MIDAS software for the distributed computation of the 3D Multi-Linear algebraic molecular indices. The main motivation for developing this module is to deal with the computational complexity experienced during the calculation of the descriptors over large datasets. To accomplish this task, a multi-server computing platform named T-arenal was developed, which is suited for institutions with many workstations interconnected through a local network and without resources particularly destined for computation tasks. This new system was deployed in 337 workstations and it was perfectly integrated with the QuBiLS-MIDAS software. To illustrate the usability of the T-arenal platform, performance tests over a dataset comprised of 15 000 compounds are carried out, yielding a 52 and 60 fold reduction in the sequential processing time for the 2-Linear and 3-Linear indices, respectively. Therefore, it can be stated that the T-arenal based distribution of computation tasks constitutes a suitable strategy for performing high-throughput calculations of 3D Multi-Linear descriptors over thousands of chemical structures for posterior QSAR and/or ADME-Tox studies.
Assuntos
Modelos Teóricos , SoftwareRESUMO
The present report introduces the QuBiLS-MIDAS software belonging to the ToMoCoMD-CARDD suite for the calculation of three-dimensional molecular descriptors (MDs) based on the two-linear (bilinear), three-linear, and four-linear (multilinear or N-linear) algebraic forms. Thus, it is unique software that computes these tensor-based indices. These descriptors, establish relations for two, three, and four atoms by using several (dis-)similarity metrics or multimetrics, matrix transformations, cutoffs, local calculations and aggregation operators. The theoretical background of these N-linear indices is also presented. The QuBiLS-MIDAS software was developed in the Java programming language and employs the Chemical Development Kit library for the manipulation of the chemical structures and the calculation of the atomic properties. This software is composed by a desktop user-friendly interface and an Abstract Programming Interface library. The former was created to simplify the configuration of the different options of the MDs, whereas the library was designed to allow its easy integration to other software for chemoinformatics applications. This program provides functionalities for data cleaning tasks and for batch processing of the molecular indices. In addition, it offers parallel calculation of the MDs through the use of all available processors in current computers. The studies of complexity of the main algorithms demonstrate that these were efficiently implemented with respect to their trivial implementation. Lastly, the performance tests reveal that this software has a suitable behavior when the amount of processors is increased. Therefore, the QuBiLS-MIDAS software constitutes a useful application for the computation of the molecular indices based on N-linear algebraic maps and it can be used freely to perform chemoinformatics studies.