Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 115
Filter
Add more filters










Publication year range
1.
Sci Rep ; 14(1): 10744, 2024 05 10.
Article in English | MEDLINE | ID: mdl-38730063

ABSTRACT

Clinical databases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. We here propose to exploit a recently developed statistical approach, the Information Imbalance, to compare different subsets of patient features and automatically select the set of features that is maximally informative for a given clinical purpose, especially in minority classes. We adapt the Information Imbalance approach to work in a clinical framework, where patient features are often categorical and are generally available only for a fraction of the patients. We apply this algorithm to a data set of ∼ 1300 patients treated for COVID-19 in Udine hospital before October 2021. Using this approach, we find combinations of features which, if used in combination, are maximally informative of the clinical fate and of the severity of the disease. The optimal number of features, which is determined automatically, turns out to be between 10 and 15. These features can be measured at admission. The approach can be used also if the features are available only for a fraction of the patients, does not require imputation and, importantly, is able to automatically select features with small inter-feature correlation. Clinical insights deriving from this study are also discussed.


Subject(s)
Algorithms , COVID-19 , SARS-CoV-2 , Severity of Illness Index , Humans , COVID-19/diagnosis , COVID-19/epidemiology , SARS-CoV-2/isolation & purification , Databases, Factual , Male , Female
2.
Proc Natl Acad Sci U S A ; 121(19): e2317256121, 2024 May 07.
Article in English | MEDLINE | ID: mdl-38687797

ABSTRACT

We introduce an approach which allows detecting causal relationships between variables for which the time evolution is available. Causality is assessed by a variational scheme based on the Information Imbalance of distance ranks, a statistical test capable of inferring the relative information content of different distance measures. We test whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without explicitly modeling the underlying dynamics and without the need to compute probability densities of the dynamic variables. This framework makes causality detection possible even between high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled chaotic dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings. We also show that the method can be used to robustly detect causality in human electroencephalography data.

3.
Neurol Sci ; 2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38285327

ABSTRACT

BACKGROUND AND OBJECTIVES: ASPECTs is a widely used marker to identify early stroke signs on non-enhanced computed tomography (NECT), yet it presents interindividual variability and it may be hard to use for non-experts. We introduce an algorithm capable of automatically estimating the NECT volumetric extension of early acute ischemic changes in the 3D space. We compared the power of this marker with ASPECTs evaluated by experienced practitioner in predicting the clinical outcome. METHODS: We analyzed and processed neuroimaging data of 153 patients admitted with acute ischemic stroke. All patients underwent a NECT at admission and on follow-up. The developed algorithm identifies the early ischemic hypodense region based on an automatic comparison of the gray level in the images of the two hemispheres, assumed to be an approximate mirror image of each other in healthy patients. RESULTS: In the two standard axial slices used to estimate the ASPECTs, the regions identified by the algorithm overlap significantly with those identified by experienced practitioners. However, in many patients, the regions identified automatically extend significantly to other slices. In these cases, the volume marker provides supplementary and independent information. Indeed, the clinical outcome of patients with volume marker = 0 can be distinguished with higher statistical confidence than the outcome of patients with ASPECTs = 10. CONCLUSION: The volumetric extension and the location of acute ischemic region in the 3D-space, automatically identified by our algorithm, provide data that are mostly in agreement with the ASPECTs value estimated by expert practitioners, and in some cases complementary and independent.

4.
PNAS Nexus ; 2(8): pgad239, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37545648

ABSTRACT

According to common physical chemistry wisdom, the solvent cavities hosting a solute are tightly sewn around it, practically coinciding with its van der Waals surface. Solvation entropy is primarily determined by the surface and the volume of the cavity while enthalpy is determined by the solute-solvent interaction. In this work, we challenge this picture, demonstrating by molecular dynamics simulations that the cavities surrounding the 20 amino acids deviate significantly from the molecular surface. Strikingly, the shape of the cavity alone can be used to predict the solvation free energy, entropy, enthalpy, and hydrophobicity. Solute-solvent interactions involving the different chemical moieties of the amino acid, determine indirectly the cavity shape, and the properties of the branches but do not have to be taken explicitly into account in the prediction model.

5.
J Chem Theory Comput ; 19(14): 4596-4605, 2023 Jul 25.
Article in English | MEDLINE | ID: mdl-36920997

ABSTRACT

Machine-learning (ML) has become a key workhorse in molecular simulations. Building an ML model in this context involves encoding the information on chemical environments using local atomic descriptors. In this work, we focus on the Smooth Overlap of Atomic Positions (SOAP) and their application in studying the properties of liquid water both in the bulk and at the hydrophobic air-water interface. By using a statistical test aimed at assessing the relative information content of different distance measures defined on the same data space, we investigate if these descriptors provide the same information as some of the common order parameters that are used to characterize local water structure such as hydrogen bonding, density, or tetrahedrality to name a few. Our analysis suggests that the ML description and the standard order parameters of the local water structure are not equivalent. In particular, a combination of these order parameters probing local water environments can predict SOAP similarity only approximately, and vice versa, the environments that are similar according to SOAP are not necessarily similar according to the standard order parameters. We also elucidate the role of some of the metaparameters in the SOAP definition in encoding chemical information.

6.
Phys Rev Lett ; 130(6): 067401, 2023 Feb 10.
Article in English | MEDLINE | ID: mdl-36827575

ABSTRACT

Real-world datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this Letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high dimensionality of sequences' space.

7.
Sci Rep ; 12(1): 20005, 2022 11 21.
Article in English | MEDLINE | ID: mdl-36411305

ABSTRACT

Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches.


Subject(s)
Likelihood Functions , Computer Simulation
8.
Patterns (N Y) ; 3(10): 100589, 2022 Oct 14.
Article in English | MEDLINE | ID: mdl-36277821

ABSTRACT

DADApy is a Python software package for analyzing and characterizing high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering, and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in a synthetic dataset and in a real-world application. DADApy is freely available under the open-source Apache 2.0 license.

9.
PLoS Comput Biol ; 18(10): e1010610, 2022 10.
Article in English | MEDLINE | ID: mdl-36260616

ABSTRACT

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.


Subject(s)
Proteins , Databases, Protein , Proteins/genetics , Cluster Analysis , Amino Acid Sequence , Protein Domains
10.
Elife ; 112022 09 12.
Article in English | MEDLINE | ID: mdl-36094473

ABSTRACT

Single-molecule force spectroscopy (SMFS) uses the cantilever tip of an atomic force microscopy (AFM) to apply a force able to unfold a single protein. The obtained force-distance curve encodes the unfolding pathway, and from its analysis it is possible to characterize the folded domains. SMFS has been mostly used to study the unfolding of purified proteins, in solution or reconstituted in a lipid bilayer. Here, we describe a pipeline for analyzing membrane proteins based on SMFS, which involves the isolation of the plasma membrane of single cells and the harvesting of force-distance curves directly from it. We characterized and identified the embedded membrane proteins combining, within a Bayesian framework, the information of the shape of the obtained curves, with the information from mass spectrometry and proteomic databases. The pipeline was tested with purified/reconstituted proteins and applied to five cell types where we classified the unfolding of their most abundant membrane proteins. We validated our pipeline by overexpressing four constructs, and this allowed us to gather structural insights of the identified proteins, revealing variable elements in the loop regions. Our results set the basis for the investigation of the unfolding of membrane proteins in situ, and for performing proteomics from a membrane fragment.


Subject(s)
Lipid Bilayers , Membrane Proteins , Bayes Theorem , Membrane Proteins/chemistry , Microscopy, Atomic Force/methods , Protein Unfolding , Proteomics
11.
Front Immunol ; 13: 862851, 2022.
Article in English | MEDLINE | ID: mdl-35572587

ABSTRACT

Epitopes that bind simultaneously to all human alleles of Major Histocompatibility Complex class II (MHC II) are considered one of the key factors for the development of improved vaccines and cancer immunotherapies. To engineer MHC II multiple-allele binders, we developed a protocol called PanMHC-PARCE, based on the unsupervised optimization of the epitope sequence by single-point mutations, parallel explicit-solvent molecular dynamics simulations and scoring of the MHC II-epitope complexes. The key idea is accepting mutations that not only improve the affinity but also reduce the affinity gap between the alleles. We applied this methodology to enhance a Plasmodium vivax epitope for multiple-allele binding. In vitro rate-binding assays showed that four engineered peptides were able to bind with improved affinity toward multiple human MHC II alleles. Moreover, we demonstrated that mice immunized with the peptides exhibited interferon-gamma cellular immune response. Overall, the method enables the engineering of peptides with improved binding properties that can be used for the generation of new immunotherapies.


Subject(s)
HLA-D Antigens , Molecular Dynamics Simulation , Alleles , Animals , Epitopes , HLA-D Antigens/genetics , Mice , Peptides
12.
Methods Mol Biol ; 2405: 335-359, 2022.
Article in English | MEDLINE | ID: mdl-35298821

ABSTRACT

Computational peptide design is useful for therapeutics, diagnostics, and vaccine development. To select the most promising peptide candidates, the key is describing accurately the peptide-target interactions at the molecular level. We here review a computational peptide design protocol whose key feature is the use of all-atom explicit solvent molecular dynamics for describing the different peptide-target complexes explored during the optimization. We describe the milestones behind the development of this protocol, which is now implemented in an open-source code called PARCE. We provide a basic tutorial to run the code for an antibody fragment design example. Finally, we describe three additional applications of the method to design peptides for different targets, illustrating the broad scope of the proposed approach.


Subject(s)
Molecular Dynamics Simulation , Peptides , Peptides/chemistry , Solvents
13.
Proc Natl Acad Sci U S A ; 119(4)2022 01 25.
Article in English | MEDLINE | ID: mdl-35058369
14.
J Phys Chem Lett ; 13(1): 183-189, 2022 Jan 13.
Article in English | MEDLINE | ID: mdl-34965118

ABSTRACT

By using advanced data analysis techniques, we characterize the shape of the voids surrounding model polymers of different sizes in water, observed in molecular dynamics simulations. We find that even when the model polymer is folded, the voids are extremely rough, with branches that can extend to over 1 nm away from the polymer. Water molecules in contact with the void retain close-to-bulk properties in terms of local structure. The branches disappear, and the voids start resembling the quasispherical shape predicted by dewetting theory only when they surround particles with a radius ∼1 nm, well above the size occupied by a folded hydrophobic polymer. Our results provide fresh insights into the microscopic origins of the vapor-like interfaces underlying dewetting and drying transitions.

15.
PNAS Nexus ; 1(2): pgac039, 2022 May.
Article in English | MEDLINE | ID: mdl-36713323

ABSTRACT

Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.

16.
Chem Rev ; 121(16): 9722-9758, 2021 08 25.
Article in English | MEDLINE | ID: mdl-33945269

ABSTRACT

Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.

17.
BMC Bioinformatics ; 22(1): 121, 2021 Mar 12.
Article in English | MEDLINE | ID: mdl-33711918

ABSTRACT

BACKGROUND: The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. RESULTS: We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. CONCLUSIONS: The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.


Subject(s)
Proteins , Sequence Alignment , Amino Acid Sequence , Cluster Analysis , Databases, Protein , Humans , Proteins/genetics
18.
J Chem Phys ; 154(7): 074114, 2021 Feb 21.
Article in English | MEDLINE | ID: mdl-33607903

ABSTRACT

Computational protein design has emerged as a powerful tool capable of identifying sequences compatible with pre-defined protein structures. The sequence design protocols, implemented in the Rosetta suite, have become widely used in the protein engineering community. To understand the strengths and limitations of the Rosetta design framework, we tested several design protocols on two distinct folds (SH3-1 and Ubiquitin). The sequence optimization, when started from native structures and natural sequences or polyvaline sequences, converges to sequences that are not recognized as belonging to the fold family of the target protein by standard bioinformatic tools, such as BLAST and Hmmer. The sequences generated from both starting conditions (native and polyvaline) are instead very similar to each other and recognized by Hmmer as belonging to the same "family." This demonstrates the capability of Rosetta to converge to similar sequences, even when sampling from distinct starting conditions, but, on the other hand, shows intrinsic inaccuracy of the scoring function that drifts toward sequences that lack identifiable natural sequence signatures. To address this problem, we developed a protocol embedding Rosetta Design simulations in a genetic algorithm, in which the sequence search is biased to converge to sequences that exist in nature. This protocol allows us to obtain sequences that have recognizable natural sequence signatures and, experimentally, the designed proteins are biochemically well behaved and thermodynamically stable.


Subject(s)
Drug Design , Proteins/chemistry , Amino Acid Sequence , Models, Molecular , Protein Conformation , Protein Folding , Thermodynamics
19.
Proc Math Phys Eng Sci ; 477(2250): 20210019, 2021 Jun.
Article in English | MEDLINE | ID: mdl-35153562

ABSTRACT

We apply two independent data analysis methodologies to locate stable climate states in an intermediate complexity climate model and analyse their interplay. First, drawing from the theory of quasi-potentials, and viewing the state space as an energy landscape with valleys and mountain ridges, we infer the relative likelihood of the identified multistable climate states and investigate the most likely transition trajectories as well as the expected transition times between them. Second, harnessing techniques from data science, and specifically manifold learning, we characterize the data landscape of the simulation output to find climate states and basin boundaries within a fully agnostic and unsupervised framework. Both approaches show remarkable agreement, and reveal, apart from the well known warm and snowball earth states, a third intermediate stable state in one of the two versions of PLASIM, the climate model used in this study. The combination of our approaches allows to identify how the negative feedback of ocean heat transport and entropy production via the hydrological cycle drastically change the topography of the dynamical landscape of Earth's climate.

20.
J Phys Chem Lett ; 12(1): 65-72, 2021 Jan 14.
Article in English | MEDLINE | ID: mdl-33306377

ABSTRACT

We analyzed a 100 µs MD trajectory of the SARS-CoV-2 main protease by a non-parametric data analysis approach which allows characterizing a free energy landscape as a simultaneous function of hundreds of variables. We identified several conformations that, when visited by the dynamics, are stable for several hundred nanoseconds. We explicitly characterize and describe these metastable states. In some of these configurations, the catalytic dyad is less accessible. Stabilizing them by a suitable binder could lead to an inhibition of the enzymatic activity. In our analysis we keep track of relevant contacts between residues which are selectively broken or formed in the states. Some of these contacts are formed by residues which are far from the catalytic dyad and are accessible to the solvent. Based on this analysis we propose some relevant contact patterns and three possible binding sites which could be targeted to achieve allosteric inhibition.


Subject(s)
COVID-19 , Molecular Dynamics Simulation , Protease Inhibitors/pharmacology , SARS-CoV-2/metabolism , Viral Proteases/chemistry , Viral Proteases/metabolism , Binding Sites , Humans , Models, Molecular , Protease Inhibitors/chemistry , Protein Binding , Protein Conformation
SELECTION OF CITATIONS
SEARCH DETAIL
...