Pesquisa | Portal Regional da BVS (teste)

Sub-Visible Particle Classification and Label Consistency Analysis for Flow-Imaging Microscopy Via Machine Learning Methods.

Lopez-Del Rio, Angela; Pacios-Michelena, Anabel; Picart-Armada, Sergio; Garidel, Patrick; Nikels, Felix; Kube, Sebastian.

J Pharm Sci ; 113(4): 880-890, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-37924976

RESUMO

Sub-visible particles can be a quality concern in pharmaceutical products, especially parenteral preparations. To quantify and characterize these particles, liquid samples may be passed through a flow-imaging microscopy instrument that also generates images of each detected particle. Machine learning techniques have increasingly been applied to this kind of data to detect changes in experimental conditions or classify specific types of particles, primarily focusing on silicone oil. That technique generally requires manual labeling of particle images by subject matter experts, a time-consuming and complex task. In this study, we created artificial datasets of silicone oil, protein particles, and glass particles that mimicked complex datasets of particles found in biopharmaceutical products. We used unsupervised learning techniques to effectively describe particle composition by sample. We then trained independent one-class classifiers to detect specific particle populations: silicone oil and glass particles. We also studied the consistency of the particle labels used to evaluate these models. Our results show that one-class classifiers are a reasonable choice for handling heterogeneous flow-imaging microscopy data and that unsupervised learning can aid in the labeling process. However, we found agreement among experts to be rather low, especially for smaller particles (< 8 µm for our Micro-Flow Imaging data). Given the fact that particle label confidence is not usually reported in the literature, we recommend more careful assessment of this topic in the future.

Assuntos

Microscopia , Óleos de Silicone , Microscopia/métodos , Óleos de Silicone/análise , Aprendizado de Máquina , Vidro , Proteínas , Tamanho da Partícula

Balancing Data on Deep Learning-Based Proteochemometric Activity Classification.

Lopez-Del Rio, Angela; Picart-Armada, Sergio; Perera-Lluna, Alexandre.

J Chem Inf Model ; 61(4): 1657-1669, 2021 04 26.

Artigo em Inglês | MEDLINE | ID: mdl-33779173

RESUMO

In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering, and (4) semi_resampling. These schemas were evaluated in kinases, GPCRs, nuclear receptors, and proteases from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometric model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) to mitigate the data imbalance effect in a realistic scenario. The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.

Assuntos

Aprendizado Profundo , Simulação por Computador , Ligantes , Aprendizado de Máquina

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction.

Lopez-Del Rio, Angela; Martin, Maria; Perera-Lluna, Alexandre; Saidi, Rabie.

Sci Rep ; 10(1): 14634, 2020 09 03.

Artigo em Inglês | MEDLINE | ID: mdl-32884053

RESUMO

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark .

Assuntos

Archaea/metabolismo , Proteínas Arqueais/metabolismo , Aprendizado Profundo , Sequência de Aminoácidos

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning.

Lopez-Del Rio, Angela; Nonell-Canals, Alfons; Vidal, David; Perera-Lluna, Alexandre.

J Chem Inf Model ; 59(4): 1645-1657, 2019 04 22.

Artigo em Inglês | MEDLINE | ID: mdl-30730731

RESUMO

Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database, and (4) splitting based both in the clustering and in the source database. These schemas are applied to a deep learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our deep learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compound clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.

Assuntos

Aprendizado Profundo , Informática/métodos , Descoberta de Drogas , Relação Quantitativa Estrutura-Atividade , Reprodutibilidade dos Testes

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA