ABSTRACT
BACKGROUND: Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of "1" bits on a large representative set of the chemical space. RESULTS: To illustrate the Method, SB-DFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SB-DFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SB-DFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SB-DFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SB-DFP equaled or outperformed other approaches for at least 20 out of the 28 sets. CONCLUSIONS: SB-DFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SB-DFP can be developed, at least in principle, based on any fingerprint and reference data set. SB-DFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching.
ABSTRACT
An algorithm for the efficient computation of Canterakis-Zernike moments of theoretically computed molecular electron densities and rotationally invariant Fingerprint indices derived from them is reported. The algorithm is suitable for any density expressed in terms of Gaussian- or Slater-type functions within the Linear Combination of Atomic Orbitals framework at any level of computation. Electron density is expressed as a one-center expansion of real regular spherical harmonics times radial factors by means of translation techniques, which facilitates the efficient computation of the moments in terms of a single one-dimension numerical integration. The performance of the algorithm is analyzed showing that the computation of radial factors in the quadrature points is responsible for almost all computational time. The procedure is applicable to any density obtained with standard packages for molecular structure calculations. © 2018 Wiley Periodicals, Inc.
ABSTRACT
The 7th edition of Union for International Cancer Control (UICC) staging system moved gastroesophageal junction (GEJ) cancers from gastric to esophageal group. Since clinical management is strongly influenced by this staging system, we looked at molecular fingerprints of GEJ tumors and compared to gastric and esophageal profiles. We aimed at elucidating whether GEJ cancers cluster with gastric or esophageal groups according to mRNA and microRNA expression pattern, since this might represent tumor identity. The clinical and expression data were downloaded from The Cancer Genome Atlas (TCGA) with 395 stomach, 184 esophagus and 521 colon samples for mRNA analyses and 392 stomach, 175 esophagus and 459 colon samples for microRNA comparisons. Both Principal Component Analysis (PCA) and Heat Map plots were performed in R platform, using Log2 transformation of RPKM normalized data. Differential Expression Analysis was also performed in R, using RAW data and the DESeq2 package. The mRNAs and microRNAs were tagged as differentially expressed if they met the following criteria: i) FDR adjusted p-value < 0.05; and ii) |Log2 (fold-change)| > 2. Esophagus squamous cell carcinoma (ESCC) clustered apart of the others tumors, while adenocarcinomas (AC) clustered all together according to both mRNAs and microRNAs expression patterns. The HMs of the differentially expressed mRNAs and microRNAs also demonstrated that ESCC belongs to a different group, while AC molecular signature of esophagus looks like AC of the cardia and non cardia regions. Even distal gastric cancers are quite similar to AC of the lower esophagus, demonstrating that esophagus AC relies much closer to gastric cancers than to esophagus cancers. By using robust molecular fingerprints, it was strongly demonstrated that GEJ tumors looks more like gastric cancers than esophageal cancers, despite of tumor heterogeneity.
ABSTRACT
BACKGROUND: Molecular fingerprints are widely used in several areas of chemoinformatics including diversity analysis and similarity searching. The fingerprint-based analysis of chemical libraries, in particular of large collections, usually requires the molecular representation of each compound in the library that may lead to issues of storage space and redundant calculations. In fact, information redundancy is inherent to the data, resulting on binary digit positions in the fingerprint without significant information. RESULTS: Herein is proposed a general approach to represent an entire compound library with a single binary fingerprint. The development of the database fingerprint (DFP) is illustrated first using a short fingerprint (MACCS keys) for 10 data sets of general interest in chemistry. The application of the DFP is further shown with PubChem fingerprints for the data sets used in the primary example but with a larger number of compounds, up to 25,000 molecules. The performance of DFP were studied through differential Shannon entropy, k-mean clustering, and DFP/Tanimoto similarity. CONCLUSIONS: The DFP is designed to capture key information of the compound collection and can be used to compare and assess the diversity of molecular libraries. This Preliminary Communication shows the potential of the novel fingerprint to conduct inter-library relationships. A major future goal is to apply the DFP for virtual screening and developing DFP for other data sets based on several different type of fingerprints.Graphical AbstractDatabase fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening.
ABSTRACT
BACKGROUND: Measuring the structural diversity of compound databases is relevant in drug discovery and many other areas of chemistry. Since molecular diversity depends on molecular representation, comprehensive chemoinformatic analysis of the diversity of libraries uses multiple criteria. For instance, the diversity of the molecular libraries is typically evaluated employing molecular scaffolds, structural fingerprints, and physicochemical properties. However, the assessment with each criterion is analyzed independently and it is not straightforward to provide an evaluation of the "global diversity". RESULTS: Herein the Consensus Diversity Plot (CDP) is proposed as a novel method to represent in low dimensions the diversity of chemical libraries considering simultaneously multiple molecular representations. We illustrate the application of CDPs to classify eight compound data sets and two subsets with different sizes and compositions using molecular scaffolds, structural fingerprints, and physicochemical properties. CONCLUSIONS: CDPs are general data mining tools that represent in two-dimensions the global diversity of compound data sets using multiple metrics. These plots can be constructed using single or combined measures of diversity. An online version of the CDPs is freely available at: https://consensusdiversityplots-difacquim-unam.shinyapps.io/RscriptsCDPlots/.Graphical AbstractConsensus Diversity Plot is a novel data mining tool that represents in two-dimensions the global diversity of compound data sets using multiple metrics.