Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 85
Filter
Add more filters










Publication year range
1.
Nucleic Acids Res ; 52(W1): W256-W263, 2024 Jul 05.
Article in English | MEDLINE | ID: mdl-38783081

ABSTRACT

Recent progress in solving macromolecular structures and assemblies by cryogenic electron microscopy techniques enables sampling of their conformations in different states that are relevant to their biological function. Knowing the transition path between these conformations would provide new avenues for drug discovery. While the experimental study of transition paths is intrinsically difficult, in-silico methods can be used to generate an initial guess for those paths. The Elastic Network Model (ENM), along with a coarse-grained representation (CG) of the structures are among the most popular models to explore such possible paths. Here we propose an update to our software platform MinActionPath that generates non-linear transition paths based on ENM and CG models, using action minimization to solve the equations of motion. The new website enables the study of large structures such as ribosomes or entire virus envelopes. It provides direct visualization of the trajectories along with quantitative analyses of their behaviors at http://dynstr.pasteur.fr/servers/minactionpath/minactionpath2_submission.


Subject(s)
Macromolecular Substances , Software , Macromolecular Substances/chemistry , Models, Molecular , Ribosomes/metabolism , Ribosomes/chemistry , Cryoelectron Microscopy , Internet
2.
Viruses ; 15(6)2023 06 13.
Article in English | MEDLINE | ID: mdl-37376665

ABSTRACT

The current SARS-CoV-2 pandemic highlights our fragility when we are exposed to emergent viruses either directly or through zoonotic diseases. Fortunately, our knowledge of the biology of those viruses is improving. In particular, we have more and more structural information on virions, i.e., the infective form of a virus that includes its genomic material and surrounding protective capsid, and on their gene products. It is important to have methods that enable the analyses of structural information on such large macromolecular systems. We review some of those methods in this paper. We focus on understanding the geometry of virions and viral structural proteins, their dynamics, and their energetics, with the ambition that this understanding can help design antiviral agents. We discuss those methods in light of the specificities of those structures, mainly that they are huge. We focus on three of our own methods based on the alpha shape theory for computing geometry, normal mode analyses to study dynamics, and modified Poisson-Boltzmann theories to study the organization of ions and co-solvent and solvent molecules around biomacromolecules. The corresponding software has computing times that are compatible with the use of regular desktop computers. We show examples of their applications on some outer shells and structural proteins of the West Nile Virus.


Subject(s)
COVID-19 , Humans , SARS-CoV-2 , Capsid Proteins , Capsid , Solvents
3.
J Chem Inf Model ; 63(3): 973-985, 2023 02 13.
Article in English | MEDLINE | ID: mdl-36638318

ABSTRACT

Geometry is crucial in our efforts to comprehend the structures and dynamics of biomolecules. For example, volume, surface area, and integrated mean and Gaussian curvature of the union of balls representing a molecule are used to quantify its interactions with the water surrounding it in the morphometric implicit solvent models. The Alpha Shape theory provides an accurate and reliable method for computing these geometric measures. In this paper, we derive homogeneous formulas for the expressions of these measures and their derivatives with respect to the atomic coordinates, and we provide algorithms that implement them into a new software package, AlphaMol. The only variables in these formulas are the interatomic distances, making them insensitive to translations and rotations. AlphaMol includes a sequential algorithm and a parallel algorithm. In the parallel version, we partition the atoms of the molecule of interest into 3D rectangular blocks, using a kd-tree algorithm. We then apply the sequential algorithm of AlphaMol to each block, augmented by a buffer zone to account for atoms whose ball representations may partially cover the block. The current parallel version of AlphaMol leads to a 20-fold speed-up compared to an independent serial implementation when using 32 processors. For instance, it takes 31 s to compute the geometric measures and derivatives of each atom in a viral capsid with more than 26 million atoms on 32 Intel processors running at 2.7 GHz. The presence of the buffer zones, however, leads to redundant computations, which ultimately limit the impact of using multiple processors. AlphaMol is available as an OpenSource software.


Subject(s)
Algorithms , Software , Solvents , Water
4.
J Chem Phys ; 157(5): 054105, 2022 Aug 07.
Article in English | MEDLINE | ID: mdl-35933198

ABSTRACT

We present a new method to sample conditioned trajectories of a system evolving under Langevin dynamics based on Brownian bridges. The trajectories are conditioned to end at a certain point (or in a certain region) in space. The bridge equations can be recast exactly in the form of a non-linear stochastic integro-differential equation. This equation can be very well approximated when the trajectories are closely bundled together in space, i.e., at low temperature, or for transition paths. The approximate equation can be solved iteratively using a fixed point method. We discuss how to choose the initial trajectories and show some examples of the performance of this method on some simple problems. This method allows us to generate conditioned trajectories with a high accuracy.

5.
J Comput Chem ; 42(23): 1643-1661, 2021 09 05.
Article in English | MEDLINE | ID: mdl-34117647

ABSTRACT

Coarse-grained normal mode analyses of protein dynamics rely on the idea that the geometry of a protein structure contains enough information for computing its fluctuations around its equilibrium conformation. This geometry is captured in the form of an elastic network (EN), namely a network of edges between its residues. The normal modes of a protein are then identified with the normal modes of its EN. Different approaches have been proposed to construct ENs, focusing on the choice of the edges that they are comprised of, and on their parameterizations by the force constants associated with those edges. Here we propose new tools to guide choices on these two facets of EN. We study first different geometric models for ENs. We compare cutoff-based ENs, whose edges have lengths that are smaller than a cutoff distance, with Delaunay-based ENs and find that the latter provide better representations of the geometry of protein structures. We then derive an analytical method for the parameterization of the EN such that its dynamics leads to atomic fluctuations that agree with experimental B-factors. To limit overfitting, we attach a parameter referred to as flexibility constant to each atom instead of to each edge in the EN. The parameterization is expressed as a non-linear optimization problem whose parameters describe both rigid-body and internal motions. We show that this parameterization leads to improved ENs, whose dynamics mimic MD simulations better than ENs with uniform force constants, and reduces the number of normal modes needed to reproduce functional conformational changes.


Subject(s)
Molecular Dynamics Simulation , Proteins/chemistry , Protein Conformation
6.
Phys Rev E ; 103(4-1): 042101, 2021 Apr.
Article in English | MEDLINE | ID: mdl-34005932

ABSTRACT

The linear assignment problem is a fundamental problem in combinatorial optimization with a wide range of applications, from operational research to data science. It consists of assigning "agents" to "tasks" on a one-to-one basis, while minimizing the total cost associated with the assignment. While many exact algorithms have been developed to identify such an optimal assignment, most of these methods are computationally prohibitive for large size problems. In this paper, we propose an alternative approach to solving the assignment problem using techniques adapted from statistical physics. Our first contribution is to fully describe this formalism, including all the proofs of its main claims. In particular we derive a strongly concave effective free-energy function that captures the constraints of the assignment problem at a finite temperature. We prove that this free energy decreases monotonically as a function of ß, the inverse of temperature, to the optimal assignment cost, providing a robust framework for temperature annealing. We prove also that for large enough ß values the exact solution to the generic assignment problem can be derived using simple roundoff to the nearest integer of the elements of the computed assignment matrix. Our second contribution is to derive a provably convergent method to handle degenerate assignment problems, with a characterization of those problems. We describe computer implementations of our framework that are optimized for parallel architectures, one based on CPU, the other based on GPU. We show that the latter enables solving large assignment problems (of the orders of a few 10 000s) in computing clock times of the orders of minutes.

7.
J Phys Chem B ; 125(19): 5052-5067, 2021 05 20.
Article in English | MEDLINE | ID: mdl-33973782

ABSTRACT

We present an extension of the Poisson-Boltzmann model in which the solute of interest is immersed in an assembly of self-orienting Langevin water dipoles, anions, cations, and hydrophobic molecules, all of variable densities. Interactions between charges are controlled by electrostatics, while hydrophobic interactions are modeled with a Yukawa potential. We impose steric constraints by assuming that the system is represented on a cubic lattice. We also assume incompressibility; i.e., all sites of the lattice are occupied. This model, which we refer to as the Hydrophobic Dipolar Poisson-Boltzmann Langevin (HDPBL) model, leads to a system of two equations whose solutions give the water dipole, salt, and hydrophobic molecule densities, all of them in the presence of the others in a self-consistent way. We use those to study the organization of the ions, cosolvent, and solvent molecules around proteins. In particular, peaks of densities are expected to reveal, simultaneously, the presence of compatible binding sites of different kinds on a protein. We have tested and validated the ability of HDPBL to detect pockets in proteins that bind to hydrophobic ligands, polar ligands, and charged small probes as well as to characterize the binding sites of lipids for membrane proteins.


Subject(s)
Proteins , Binding Sites , Hydrophobic and Hydrophilic Interactions , Solvents , Static Electricity
8.
Phys Rev E ; 103(1-1): 012113, 2021 Jan.
Article in English | MEDLINE | ID: mdl-33601576

ABSTRACT

Optimal transport (OT) has become a discipline by itself that offers solutions to a wide range of theoretical problems in probability and mathematics with applications in several applied fields such as imaging sciences, machine learning, and in data sciences in general. The traditional OT problem suffers from a severe limitation: its balance condition imposes that the two distributions to be compared be normalized and have the same total mass. However, it is important for many applications to be able to relax this constraint and allow for mass creation and/or destruction. This is true, for example, in all problems requiring partial matching. In this paper, we propose an approach to solving a generalized version of the OT problem, which we refer to as the discrete variable-mass optimal-transport (VMOT) problem, using techniques adapted from statistical physics. Our first contribution is to fully describe this formalism, including all the proofs of its main claims. In particular, we derive a strongly concave effective free-energy function that captures the constraints of the VMOT problem at a finite temperature. From its maximum we derive a weak distance (i.e., a divergence) between possibly unbalanced distribution functions. The temperature-dependent OT distance decreases monotonically to the standard variable-mass OT distance, providing a robust framework for temperature annealing. Our second contribution is to show that the implementation of this formalism has the same properties as the regularized OT algorithms in time complexity, making it a competitive approach to solving the VMOT problem. We illustrate applications of the framework to the problem of partial two- and three-dimensional shape-matching problems.

9.
Phys Rev Lett ; 123(4): 040603, 2019 Jul 26.
Article in English | MEDLINE | ID: mdl-31491256

ABSTRACT

Originally defined for the optimal allocation of resources, optimal transport (OT) has found many theoretical and practical applications in multiple domains of science and physics. In this Letter we develop a new method for solving the discrete version of this problem using techniques derived from statistical physics. We derive a strongly concave free energy function that captures the constraints of the OT problem at a finite temperature. Its maximum defines an optimal transport plan, or registration between the two discrete probability measures that are compared, as well as a pseudodistance between those measures that satisfies the triangular inequalities. The computation of this pseudodistance is fast and numerically stable. The temperature dependent OT pseudodistance is shown to decrease monotonically with respect to the inverse of the temperature and to converge to the standard OT distance at zero temperature, providing a robust framework for temperature annealing. We illustrate applications of this framework to the problem of image comparison.

10.
Phys Rev E ; 100(1-1): 013310, 2019 Jul.
Article in English | MEDLINE | ID: mdl-31499816

ABSTRACT

Optimal transport (OT) has become a discipline by itself that offers solutions to a wide range of theoretical problems in probability and mathematics. Despite its appealing theoretical properties, solving the OT problem involves the resolution of a linear program whose computational cost can quickly become prohibitive whenever the size of the problem exceeds a few hundred points. The recent introduction of entropy regularization, however, has led to the development of fast algorithms for solving an approximate OT problem. The successes of those algorithms have resulted in a popularization of the applications of OT in several applied fields such as imaging sciences and machine learning, and in data sciences in general. Problems remain, however, as to the numerical convergence of those regularized approximations towards the actual OT solution. In addition, the physical meaning of this regularization is unclear. In this paper, we propose an approach to solving the discrete OT problem using techniques adapted from statistical physics. Our first contribution is to fully describe this formalism, including all the proofs of its main claims. In particular we derive a strongly concave effective free energy function that captures the constraints of the optimal transport problem at a finite temperature. Its maximum defines a pseudo distance between the two set of weighted points that are compared, which satisfies the triangular inequalities. The temperature dependent OT pseudo distance decreases monotonically to the standard OT distance, providing a robust framework for temperature annealing. Our second contribution is to show that the implementation of this formalism has the same properties as the regularized OT algorithms in time complexity, making it a competitive approach to solving the OT problem. We illustrate applications of the framework to the problem of protein fold recognition based on sequence information only.

11.
PLoS One ; 14(6): e0217838, 2019.
Article in English | MEDLINE | ID: mdl-31170208

ABSTRACT

Clustering large and complex data sets whose partitions may adopt arbitrary shapes remains a difficult challenge. Part of this challenge comes from the difficulty in defining a similarity measure between the data points that captures the underlying geometry of those data points. In this paper, we propose an algorithm, DCG++ that generates such a similarity measure that is data-driven and ultrametric. DCG++ uses Markov Chain Random Walks to capture the intrinsic geometry of data, scans possible scales, and combines all this information using a simple procedure that is shown to generate an ultrametric. We validate the effectiveness of this similarity measure within the context of clustering on synthetic data with complex geometry, on a real-world data set containing segmented audio records of frog calls described by mel-frequency cepstral coefficients, as well as on an image segmentation problem. The experimental results show a significant improvement on performance with the DCG-based ultrametric compared to using an empirical distance measure.


Subject(s)
Algorithms , Pattern Recognition, Automated , Animals , Anura/physiology , Cluster Analysis , Databases as Topic , Image Interpretation, Computer-Assisted , ROC Curve , Sound , Vocalization, Animal
12.
Prog Biophys Mol Biol ; 143: 20-37, 2019 05.
Article in English | MEDLINE | ID: mdl-30273615

ABSTRACT

While structural data on viruses are more and more common, information on their dynamics is much harder to obtain as those viruses form very large molecular complexes. In this paper, we propose a new method for computing the coarse-grained normal modes of such supra-molecules, NormalGo. A new formalism is developed to represent the Hessian of a quadratic potential using tensor products. This formalism is applied to the Tirion elastic potential, as well as to a Go like potential. When combined with a fast method for computing a select set of eigenpairs of the Hessian, this new formalism enables the computation of thousands of normal modes of a full viral shell with more than one hundred thousand atoms in less than 2 h on a standard desktop computer. We then compare the two coarse-grained potentials. We show that, despite significant differences in their formulations, the Tirion and the Go like potentials capture very similar dynamics characteristics of the molecule under study. However, we find that the Go like potential should be preferred as it leads to less local deformations in the structure of the molecule during normal mode dynamics. Finally, we use NormalGo to characterize the structural transitions that occur when FAB fragments bind to the icosahedral outer shell of serotype 3 of the Dengue virus. We have identified residues at the surface of the outer shell that are important for the transition between the FAB-free and FAB-bound conformations, and therefore potentially useful for the design of antibodies to Dengue viruses.


Subject(s)
Dengue Virus/chemistry , Dengue Virus/metabolism , Models, Molecular , Molecular Conformation , Virion/chemistry , Virion/metabolism
13.
Proc Natl Acad Sci U S A ; 115(52): E12172-E12181, 2018 12 26.
Article in English | MEDLINE | ID: mdl-30541892

ABSTRACT

The pentameric ligand-gated ion channel (pLGIC) from Gloeobacter violaceus (GLIC) has provided insightful structure-function views on the permeation process and the allosteric regulation of the pLGICs family. However, GLIC is activated by pH instead of a neurotransmitter and a clear picture for the gating transition driven by protons is still lacking. We used an electrostatics-based (finite difference Poisson-Boltzmann/Debye-Hückel) method to predict the acidities of all aspartic and glutamic residues in GLIC, both in its active and closed-channel states. Those residues with a predicted pKa close to the experimental pH50 were individually replaced by alanine and the resulting variant receptors were titrated by ATR/FTIR spectroscopy. E35, located in front of loop F far away from the orthosteric site, appears as the key proton sensor with a measured individual pKa at 5.8. In the GLIC open conformation, E35 is connected through a water-mediated hydrogen-bond network first to the highly conserved electrostatic triad R192-D122-D32 and then to Y197-Y119-K248, both located at the extracellular domain-transmembrane domain interface. The second triad controls a cluster of hydrophobic side chains from the M2-M3 loop that is remodeled during the gating transition. We solved 12 crystal structures of GLIC mutants, 6 of them being trapped in an agonist-bound but nonconductive conformation. Combined with previous data, this reveals two branches of a continuous network originating from E35 that reach, independently, the middle transmembrane region of two adjacent subunits. We conclude that GLIC's gating proceeds by making use of loop F, already known as an allosteric site in other pLGICs, instead of the classic orthosteric site.


Subject(s)
Bacterial Proteins/chemistry , Bacterial Proteins/metabolism , Cyanobacteria/metabolism , Ligand-Gated Ion Channels/chemistry , Ligand-Gated Ion Channels/metabolism , Bacterial Proteins/genetics , Cyanobacteria/chemistry , Cyanobacteria/genetics , Kinetics , Ligand-Gated Ion Channels/genetics , Models, Molecular , Protein Domains , Protons , Static Electricity
14.
PLoS One ; 13(11): e0207029, 2018.
Article in English | MEDLINE | ID: mdl-30462682

ABSTRACT

RNA SHAPE experiments have become important and successful sources of information for RNA structure prediction. In such experiments, chemical reagents are used to probe RNA backbone flexibility at the nucleotide level, which in turn provides information on base pairing and therefore secondary structure. Little is known, however, about the statistics of such SHAPE data. In this work, we explore different representations of noise in SHAPE data and propose a statistically sound framework for extracting reliable reactivity information from multiple SHAPE replicates. Our analyses of RNA SHAPE experiments underscore that a normal noise model is not adequate to represent their data. We propose instead a log-normal representation of noise and discuss its relevance. Under this assumption, we observe that processing simulated SHAPE data by directly averaging different replicates leads to bias. Such bias can be reduced by analyzing the data following a log transformation, either by log-averaging or Kalman filtering. Application of Kalman filtering has the additional advantage that a prior on the nucleotide reactivities can be introduced. We show that the performance of Kalman filtering is then directly dependent on the quality of that prior. We conclude the paper with guidelines on signal processing of RNA SHAPE data.


Subject(s)
Databases, Genetic , RNA/chemistry , Algorithms , Nucleic Acid Conformation
15.
F1000Res ; 72018.
Article in English | MEDLINE | ID: mdl-30079234

ABSTRACT

Connecting the dots among the amino acid sequence of a protein, its structure, and its function remains a central theme in molecular biology, as it would have many applications in the treatment of illnesses related to misfolding or protein instability. As a result of high-throughput sequencing methods, biologists currently live in a protein sequence-rich world. However, our knowledge of protein structure based on experimental data remains comparatively limited. As a consequence, protein structure prediction has established itself as a very active field of research to fill in this gap. This field, once thought to be reserved for theoretical biophysicists, is constantly reinventing itself, borrowing ideas informed by an ever-increasing assembly of scientific domains, from biology, chemistry, (statistical) physics, mathematics, computer science, statistics, bioinformatics, and more recently data sciences. We review the recent progress arising from this integration of knowledge, from the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences.

16.
J Chem Theory Comput ; 14(7): 3903-3919, 2018 Jul 10.
Article in English | MEDLINE | ID: mdl-29874072

ABSTRACT

Computational methods ranging from all-atom molecular dynamics simulations to coarse-grained normal-mode analyses based on simplified elastic networks provide a general framework to studying molecular dynamics. Despite recent successes in analyzing very large systems with up to 100 million atoms, those methods are currently limited to studying small- to medium-size molecular systems when used on standard desktop computers, because of computational limitations. The hope to circumvent those limitations rests on the development of improved algorithms with novel implementations that mitigate their computationally challenging parts. In this paper, we have addressed the computational challenges associated with computing coarse-grained normal modes of very large molecular systems, focusing on the calculation of the eigenpairs of the Hessian of the potential energy function from which the normal modes are computed. We have described and implemented a new method for handling this Hessian based on tensor products. This new formulation is shown to reduce space requirements and to improve the parallelization of its implementation. We have implemented and tested four different methods for computing some eigenpairs of the Hessian, namely, the standard, robust Lanczos method, a simple modification of this method based on polynomial filtering, a functional-based method recently proposed for normal-mode analyses of viruses, and a block Chebyshev-Davidson method with inner-outer restart. We have shown that the latter provides the most efficient implementation when computing eigenpairs of extremely large Hessian matrices corresponding to large viral capsids. We have also shown that, for those viral capsids, a large number of eigenpairs is actually needed, on the order of thousands, noticing however that this large number is still a small fraction of the total number of possible eigenpairs (a few percent).

17.
PLoS Comput Biol ; 14(3): e1006039, 2018 03.
Article in English | MEDLINE | ID: mdl-29596417

ABSTRACT

Quantitative reasoning and techniques are increasingly ubiquitous across the life sciences. However, new graduate researchers with a biology background are often not equipped with the skills that are required to utilize such techniques correctly and efficiently. In parallel, there are increasing numbers of engineers, mathematicians, and physical scientists interested in studying problems in biology with only basic knowledge of this field. Students from such varied backgrounds can struggle to engage proactively together to tackle problems in biology. There is therefore a need to establish bridges between those disciplines. It is our proposal that the beginning of graduate school is the appropriate time to initiate those bridges through an interdisciplinary short course. We have instigated an intensive 10-day course that brought together new graduate students in the life sciences from across departments within the National University of Singapore. The course aimed at introducing biological problems as well as some of the quantitative approaches commonly used when tackling those problems. We have run the course for three years with over 100 students attending. Building on this experience, we share 11 quick tips on how to run such an effective, interdisciplinary short course for new graduate students in the biosciences.


Subject(s)
Computational Biology/education , Computational Biology/methods , Biological Science Disciplines/education , Biology/education , Curriculum , Education, Graduate/methods , Engineering/education , Humans , Interdisciplinary Studies , Students
18.
Molecules ; 24(1)2018 Dec 28.
Article in English | MEDLINE | ID: mdl-30597916

ABSTRACT

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.


Subject(s)
Amino Acid Sequence , Amino Acids/chemistry , Models, Statistical , Proteins/chemistry , Sequence Alignment , Algorithms , Normal Distribution
19.
J Chem Phys ; 147(15): 152703, 2017 Oct 21.
Article in English | MEDLINE | ID: mdl-29055326

ABSTRACT

We propose a novel stochastic method to generate Brownian paths conditioned to start at an initial point and end at a given final point during a fixed time tf under a given potential U(x). These paths are sampled with a probability given by the overdamped Langevin dynamics. We show that these paths can be exactly generated by a local stochastic partial differential equation. This equation cannot be solved in general but we present several approximations that are valid either in the low temperature regime or in the presence of barrier crossing. We show that this method warrants the generation of statistically independent transition paths. It is computationally very efficient. We illustrate the method first on two simple potentials, the two-dimensional Mueller potential and the Mexican hat potential, and then on the multi-dimensional problem of conformational transitions in proteins using the "Mixed Elastic Network Model" as a benchmark.

20.
BMC Bioinformatics ; 18(1): 378, 2017 Aug 25.
Article in English | MEDLINE | ID: mdl-28841820

ABSTRACT

BACKGROUND: Alignment-free methods for comparing protein sequences have proved to be viable alternatives to approaches that first rely on an alignment of the sequences to be compared. Much work however need to be done before those methods provide reliable fold recognition for proteins whose sequences share little similarity. We have recently proposed an alignment-free method based on the concept of string kernels, SeqKernel (Nojoomi and Koehl, BMC Bioinformatics, 2017, 18:137). In this previous study, we have shown that while Seqkernel performs better than standard alignment-based methods, its applications are potentially limited, because of biases due mostly to sequence length effects. METHODS: In this study, we propose improvements to SeqKernel that follows two directions. First, we developed a weighted version of the kernel, WSeqKernel. Second, we expand the concept of string kernels into a novel framework for deriving information on amino acids from protein sequences. RESULTS: Using a dataset that only contains remote homologs, we have shown that WSeqKernel performs remarkably well in fold recognition experiments. We have shown that with the appropriate weighting scheme, we can remove the length effects on the kernel values. WSeqKernel, just like any alignment-based sequence comparison method, depends on a substitution matrix. We have shown that this matrix can be optimized so that sequence similarity scores correlate well with structure similarity scores. Starting from no information on amino acid similarity, we have shown that we can derive a scoring matrix that echoes the physico-chemical properties of amino acids. CONCLUSION: We have made progress in characterizing and parametrizing string kernels as alignment-based methods for comparing protein sequences, and we have shown that they provide a framework for extracting sequence information from structure.


Subject(s)
Algorithms , Proteins/chemistry , Amino Acids/chemistry , Area Under Curve , Principal Component Analysis , Protein Folding , Proteins/metabolism , ROC Curve
SELECTION OF CITATIONS
SEARCH DETAIL
...