Búsqueda | Portal Regional de la BVS

1.

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

Sieg, Jochen; Feldmann, Christian W; Hemmerich, Jennifer; Stork, Conrad; Sandfort, Frederik; Eiden, Philipp; Mathea, Miriam.

J Chem Inf Model ; 2024 Sep 17.

Artículo en Inglés | MEDLINE | ID: mdl-39288001

RESUMEN

The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. MolPipeline provides the building blocks to enable seamless integration of common cheminformatics tasks within scikit-learn's pipeline framework, such as scaffold splits and molecular standardization, making pipeline building easily adaptable to diverse project requirements.

2.

Analysis of uncertainty of neural fingerprint-based models.

Feldmann, Christian W; Sieg, Jochen; Mathea, Miriam.

Faraday Discuss ; 2024 Sep 25.

Artículo en Inglés | MEDLINE | ID: mdl-39320108

RESUMEN

Machine learning has gained popularity for predicting molecular properties based on molecular structure. This study explores the uncertainty estimates of neural fingerprint-based models by comparing pure graph neural networks (GNN) to classical machine learning algorithms combined with neural fingerprints. We investigate the advantage of extracting the neural fingerprint from the GNN and integrating it into a method known for producing better-calibrated probability estimates. Comparisons are made using three classical machine learning methods and the Chemprop model, considering different molecular representations and calibration techniques. We utilize 19 datasets from Toxcast, reflecting real-world scenarios with balanced accuracies ranging from 0.6 to 0.8. Results demonstrate that neural fingerprints combined with classical machine learning methods exhibit a slight decrease in prediction performance compared to the native Chemprop model. However, these models provide significantly improved uncertainty estimates. Notably, uncertainty estimates of neural fingerprint-based methods remain relatively robust for molecules dissimilar to the training set. This suggests that methods like random forest with neural fingerprints can deliver strong prediction performance and reliable uncertainty estimates. When considering both performance and uncertainty, the calibrated Chemprop model and the combination of neural fingerprints with random forest or support vector classifier (SVC) yield comparable results. Surprisingly, the SVC method shows promising performance when combined with neural or count fingerprints. These findings are particularly relevant in real-world industrial projects where accurate predictions and reliable uncertainty estimates are crucial.

3.

Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years.

Sultan, Afnan; Sieg, Jochen; Mathea, Miriam; Volkamer, Andrea.

J Chem Inf Model ; 64(16): 6259-6280, 2024 Aug 26.

Artículo en Inglés | MEDLINE | ID: mdl-39136669

RESUMEN

Molecular Property Prediction (MPP) is vital for drug discovery, crop protection, and environmental science. Over the last decades, diverse computational techniques have been developed, from using simple physical and chemical properties and molecular fingerprints in statistical models and classical machine learning to advanced deep learning approaches. In this review, we aim to distill insights from current research on employing transformer models for MPP. We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP. These questions encompass the choice and scale of the pretraining data, optimal architecture selections, and promising pretraining objectives. Our analysis highlights areas not yet covered in current research, inviting further exploration to enhance the field's understanding. Additionally, we address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.

Asunto(s)

Aprendizaje Automático , Descubrimiento de Drogas/métodos , Aprendizaje Profundo

4.

Searching similar local 3D micro-environments in protein structure databases with MicroMiner.

Sieg, Jochen; Rarey, Matthias.

Brief Bioinform ; 24(6)2023 09 22.

Artículo en Inglés | MEDLINE | ID: mdl-37833838

RESUMEN

The available protein structure data are rapidly increasing. Within these structures, numerous local structural sites depict the details characterizing structure and function. However, searching and analyzing these sites extensively and at scale poses a challenge. We present a new method to search local sites in protein structure databases using residue-defined local 3D micro-environments. We implemented the method in a new tool called MicroMiner and demonstrate the capabilities of residue micro-environment search on the example of structural mutation analysis. Usually, experimental structures for both the wild-type and the mutant are unavailable for comparison. With MicroMiner, we extracted $>255 \times 10^{6}$ amino acid pairs in protein structures from the PDB, exemplifying single mutations' local structural changes for single chains and $>45 \times 10^{6}$ pairs for protein-protein interfaces. We further annotate existing data sets of experimentally measured mutation effects, like $\Delta \Delta G$ measurements, with the extracted structure pairs to combine the mutation effect measurement with the structural change upon mutation. In addition, we show how MicroMiner can bridge the gap between mutation analysis and structure-based drug design tools. MicroMiner is available as a command line tool and interactively on the https://proteins.plus/ webserver.

Asunto(s)

Aminoácidos , Proteínas , Bases de Datos de Proteínas , Proteínas/genética , Proteínas/química , Aminoácidos/química

5.

Modeling with Alternate Locations in X-ray Protein Structures.

Gutermuth, Torben; Sieg, Jochen; Stohn, Tim; Rarey, Matthias.

J Chem Inf Model ; 63(8): 2573-2585, 2023 04 24.

Artículo en Inglés | MEDLINE | ID: mdl-37018549

RESUMEN

In many molecular modeling applications, the standard procedure is still to handle proteins as single, rigid structures. While the importance of conformational flexibility is widely known, handling it remains challenging. Even the crystal structure of a protein usually contains variability exemplified in alternate side chain orientations or backbone segments. This conformational variability is encoded in PDB structure files by so-called alternate locations (AltLocs). Most modeling approaches either ignore AltLocs or resolve them with simple heuristics early on during structure import. We analyzed the occurrence and usage of AltLocs in the PDB and developed an algorithm to automatically handle AltLocs in PDB files enabling all structure-based methods using rigid structures to take the alternative protein conformations described by AltLocs into consideration. A respective software tool named AltLocEnumerator can be used as a structure preprocessor to easily exploit AltLocs. While the amount of data makes it difficult to show impact on a statistical level, handling AltLocs has a substantial impact on a case-by-case basis. We believe that the inspection and consideration of AltLocs is a very valuable approach in many modeling scenarios.

Asunto(s)

Proteínas , Programas Informáticos , Rayos X , Proteínas/química , Conformación Proteica , Algoritmos

6.

ProteinsPlus: a comprehensive collection of web-based molecular modeling tools.

Schöning-Stierand, Katrin; Diedrich, Konrad; Ehrt, Christiane; Flachsenberg, Florian; Graef, Joel; Sieg, Jochen; Penner, Patrick; Poppinga, Martin; Ungethüm, Annett; Rarey, Matthias.

Nucleic Acids Res ; 50(W1): W611-W615, 2022 07 05.

Artículo en Inglés | MEDLINE | ID: mdl-35489057

RESUMEN

Upon the ever-increasing number of publicly available experimentally determined and predicted protein and nucleic acid structures, the demand for easy-to-use tools to investigate these structural models is higher than ever before. The ProteinsPlus web server (https://proteins.plus) comprises a growing collection of molecular modeling tools focusing on protein-ligand interactions. It enables quick access to structural investigations ranging from structure analytics and search methods to molecular docking. It is by now well-established in the community and constantly extended. The server gives easy access not only to experts but also to students and occasional users from the field of life sciences. Here, we describe its recently added new features and tools, beyond them a novel method for on-the-fly molecular docking and a search method for single-residue substitutions in local regions of a protein structure throughout the whole Protein Data Bank. Finally, we provide a glimpse into new avenues for the annotation of AlphaFold structures which are directly accessible via a RESTful service on the ProteinsPlus web server.

Asunto(s)

Proteínas , Programas Informáticos , Simulación del Acoplamiento Molecular , Proteínas/química , Modelos Moleculares , Internet

7.

Analyzing structural features of proteins from deep-sea organisms.

Sieg, Jochen; Sandmeier, Chris Claudius; Lieske, Julia; Meents, Alke; Lemmen, Christian; Streit, Wolfgang R; Rarey, Matthias.

Proteins ; 90(8): 1521-1537, 2022 08.

Artículo en Inglés | MEDLINE | ID: mdl-35313380

RESUMEN

Protein adaptations to extreme environmental conditions are drivers in biotechnological process optimization and essential to unravel the molecular limits of life. Most proteins with such desirable adaptations are found in extremophilic organisms inhabiting extreme environments. The deep sea is such an environment and a promising resource that poses multiple extremes on its inhabitants. Conditions like high hydrostatic pressure and high or low temperature are prevalent and many deep-sea organisms tolerate multiple of these extremes. While molecular adaptations to high temperature are comparatively good described, adaptations to other extremes like high pressure are not well-understood yet. To fully unravel the molecular mechanisms of individual adaptations it is probably necessary to disentangle multifactorial adaptations. In this study, we evaluate differences of protein structures from deep-sea organisms and their respective related proteins from nondeep-sea organisms. We created a data collection of 1281 experimental protein structures from 25 deep-sea organisms and paired them with orthologous proteins. We exhaustively evaluate differences between the protein pairs with machine learning and Shapley values to determine characteristic differences in sequence and structure. The results show a reasonable discrimination of deep-sea and nondeep-sea proteins from which we distinguish correlations previously attributed to thermal stability from other signals potentially describing adaptions to high pressure. While some distinct correlations can be observed the overall picture appears intricate.

Asunto(s)

Adaptación Fisiológica , Proteínas , Frío , Calor , Presión Hidrostática , Proteínas/metabolismo

8.

In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening.

Sieg, Jochen; Flachsenberg, Florian; Rarey, Matthias.

J Chem Inf Model ; 59(3): 947-961, 2019 03 25.

Artículo en Inglés | MEDLINE | ID: mdl-30835112

RESUMEN

Reports of successful applications of machine learning (ML) methods in structure-based virtual screening (SBVS) are increasing. ML methods such as convolutional neural networks show promising results and often outperform traditional methods such as empirical scoring functions in retrospective validation. However, trained ML models are often treated as black boxes and are not straightforwardly interpretable. In most cases, it is unknown which features in the data are decisive and whether a model's predictions are right for the right reason. Hence, we re-evaluated three widely used benchmark data sets in the context of ML methods and came to the conclusion that not every benchmark data set is suitable. Moreover, we demonstrate on two examples from current literature that bias is learned implicitly and unnoticed from standard benchmarks. On the basis of these results, we conclude that there is a need for eligible validation experiments and benchmark data sets suited to ML for more bias-controlled validation in ML-based SBVS. Therefore, we provide guidelines for setting up validation experiments and give a perspective on how new data sets could be generated.

Asunto(s)

Sesgo , Aprendizaje Automático , Benchmarking/métodos , Bases de Datos Factuales , Ligandos , Simulación del Acoplamiento Molecular/métodos , Estructura Molecular , Estudios Retrospectivos , Relación Estructura-Actividad

9.

StructureProfiler: an all-in-one tool for 3D protein structure profiling.

Meyder, Agnes; Kampen, Stefanie; Sieg, Jochen; Fährrolfes, Rainer; Friedrich, Nils-Ole; Flachsenberg, Florian; Rarey, Matthias.

Bioinformatics ; 35(5): 874-876, 2019 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-30124779

RESUMEN

MOTIVATION: Three-dimensional protein structures are important starting points for elucidating protein function and applications like drug design. Computational methods in this area rely on high quality validation datasets which are usually manually assembled. Due to the increase in published structures as well as the increasing demand for specially tailored validation datasets, automatic procedures should be adopted. RESULTS: StructureProfiler is a new tool for automatic, objective and customizable profiling of X-ray protein structures based on the most frequently applied selection criteria currently in use to assemble benchmark datasets. As examples, four dataset configurations (Astex, Iridium, Platinum, combined), all results of the combined tests and the list of all PDB Ids passing the combined criteria set are attached in the Supplementary Material. AVAILABILITY AND IMPLEMENTATION: StructureProfiler is available as part of the ProteinsPlus web service http://proteins.plus and as standalone tool in the NAOMI ChemBio Suite. Dataset updates together with the tool can be found on http://www.zbh.uni-hamburg.de/structureprofiler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Programas Informáticos , Biología Computacional , Diseño de Fármacos , Proteínas

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA