Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
1.
J Chem Phys ; 152(8): 084113, 2020 Feb 28.
Article in English | MEDLINE | ID: mdl-32113352

ABSTRACT

The evaluation of electrostatic energy for a set of point charges in a periodic lattice is a computationally expensive part of molecular dynamics simulations (and other applications) because of the long-range nature of the Coulomb interaction. A standard approach is to decompose the Coulomb potential into a near part, typically evaluated by direct summation up to a cutoff radius, and a far part, typically evaluated in Fourier space. In practice, all decomposition approaches involve approximations-such as cutting off the near-part direct sum-but it may be possible to find new decompositions with improved trade-offs between accuracy and performance. Here, we present the u-series, a new decomposition of the Coulomb potential that is more accurate than the standard (Ewald) decomposition for a given amount of computational effort and achieves the same accuracy as the Ewald decomposition with approximately half the computational effort. These improvements, which we demonstrate numerically using a lipid membrane system, arise because the u-series is smooth on the entire real axis and exact up to the cutoff radius. Additional performance improvements over the Ewald decomposition may be possible in certain situations because the far part of the u-series is a sum of Gaussians and can thus be evaluated using algorithms that require a separable convolution kernel; we describe one such algorithm that reduces communication latency at the expense of communication bandwidth and computation, a trade-off that may be advantageous on modern massively parallel supercomputers.

2.
J Chem Phys ; 139(16): 164106, 2013 Oct 28.
Article in English | MEDLINE | ID: mdl-24182003

ABSTRACT

In molecular dynamics simulations, control over temperature and pressure is typically achieved by augmenting the original system with additional dynamical variables to create a thermostat and a barostat, respectively. These variables generally evolve on timescales much longer than those of particle motion, but typical integrator implementations update the additional variables along with the particle positions and momenta at each time step. We present a framework that replaces the traditional integration procedure with separate barostat, thermostat, and Newtonian particle motion updates, allowing thermostat and barostat updates to be applied infrequently. Such infrequent updates provide a particularly substantial performance advantage for simulations parallelized across many computer processors, because thermostat and barostat updates typically require communication among all processors. Infrequent updates can also improve accuracy by alleviating certain sources of error associated with limited-precision arithmetic. In addition, separating the barostat, thermostat, and particle motion update steps reduces certain truncation errors, bringing the time-average pressure closer to its target value. Finally, this framework, which we have implemented on both general-purpose and special-purpose hardware, reduces software complexity and improves software modularity.


Subject(s)
Molecular Dynamics Simulation , Pressure , Temperature , Artifacts
3.
J Chem Theory Comput ; 6(7): 2045-58, 2010 Jul 13.
Article in English | MEDLINE | ID: mdl-26615934

ABSTRACT

Since the behavior of biomolecules can be sensitive to temperature, the ability to accurately calculate and control the temperature in molecular dynamics (MD) simulations is important. Standard analysis of equilibrium MD simulations-even constant-energy simulations with negligible long-term energy drift-often yields different calculated temperatures for different motions, however, in apparent violation of the statistical mechanical principle of equipartition of energy. Although such analysis provides a valuable warning that other simulation artifacts may exist, it leaves the actual value of the temperature uncertain. We observe that Tolman's generalized equipartition theorem should hold for long stable simulations performed using velocity-Verlet or other symplectic integrators, because the simulated trajectory is thought to sample almost exactly from a continuous trajectory generated by a shadow Hamiltonian. From this we conclude that all motions should share a single simulation temperature, and we provide a new temperature estimator that we test numerically in simulations of a diatomic fluid and of a solvated protein. Apparent temperature variations between different motions observed using standard estimators do indeed disappear when using the new estimator. We use our estimator to better understand how thermostats and barostats can exacerbate integration errors. In particular, we find that with large (albeit widely used) time steps, the common practice of using two thermostats to remedy so-called hot solvent-cold solute problems can have the counterintuitive effect of causing temperature imbalances. Our results, moreover, highlight the utility of multiple-time step integrators for accurate and efficient simulation.

5.
J Comput Biol ; 12(7): 943-51, 2005 Sep.
Article in English | MEDLINE | ID: mdl-16201914

ABSTRACT

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log2 |summation|)(1+epsilon) bits per character (where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(| summation|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 (1+|summation|)(1+epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.


Subject(s)
Computational Biology/methods , Genomics/methods , Models, Genetic , Animals , Computational Biology/statistics & numerical data , Computer Simulation , Genomics/statistics & numerical data , Humans
6.
J Comput Biol ; 12(6): 762-76, 2005.
Article in English | MEDLINE | ID: mdl-16108715

ABSTRACT

Recent sequencing of the human and other mammalian genomes has brought about the necessity to align them, to identify and characterize their commonalities and differences. Programs that align whole genomes generally use a seed-and-extend technique, starting from exact or near-exact matches and selecting a reliable subset of these, called anchors, and then filling in the remaining portions between the anchors using a combination of local and global alignment algorithms, but their choices for the parameters so far have been primarily heuristic. We present a statistical framework and practical methods for selecting a set of matches that is both sensitive and specific and can constitute a reliable set of anchors for a one-to-one mapping of two genomes from which a whole-genome alignment can be built. Starting from exact matches, we introduce a novel per-base repeat annotation, the Z-score, from which noise and repeat filtering conditions are explored. Dynamic programming-based chaining algorithms are also evaluated as context-based filters. We apply the methods described here to the comparison of two progressive assemblies of the human genome, NCBI build 28 and build 34 (www.genome.ucsc.edu), and show that a significant portion of the two genomes can be found in selected exact matches, with very limited amount of sequence duplication.


Subject(s)
Chromosome Mapping , Genome , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Amino Acid Motifs , Models, Genetic
7.
J Comput Biol ; 12(4): 407-15, 2005 May.
Article in English | MEDLINE | ID: mdl-15882139

ABSTRACT

The starting point for any alignment of mammalian genomes is the computation of exact matches satisfying various criteria. Time-efficient, O(n), data structures for this computation, such as the suffix tree, require O(n log(n)) space, several times the space of the genomes themselves. Thus, any reasonable whole-genome comparative project finds itself requiring tens of Gigabytes of RAM to maintain time-efficiency. This is beyond most modern workstations. With a new data structure, the compressed suffix array (CSA) implemented via the Burrows-Wheeler transform, we can trade time-efficiency for space-efficiency, taking O(n log(n)) time, but running in O(n) space, typically in total space less than or equal to that of the genomes themselves. If space is more expensive than time, this is an appropriate approach to consider. The most space-efficient implementation of this data structure requires 5 bits per nucleotide character to build on-line, in the worst case, and 2.5 bits per character to store once built. We present a description of this data structure and how it is used to obtain matches. An implementation (called bbbwt) is demonstrated by aligning two mammalian genomes on a modest workstation equipped with under 2 GB of free RAM in time superior to that of the implementations of other data structures.


Subject(s)
Computational Biology/statistics & numerical data , Genome , Sequence Alignment/statistics & numerical data , Animals , Computational Biology/methods , Humans
8.
Genome Res ; 15(4): 454-62, 2005 Apr.
Article in English | MEDLINE | ID: mdl-15781572

ABSTRACT

The extent and patterns of linkage disequilibrium (LD) determine the feasibility of association studies to map genes that underlie complex traits. Here we present a comparison of the patterns of LD across four major human populations (African-American, Caucasian, Chinese, and Japanese) with a high-resolution single-nucleotide polymorphism (SNP) map covering almost the entire length of chromosomes 6, 21, and 22. We constructed metric LD maps formulated such that the units measure the extent of useful LD for association mapping. LD reaches almost twice as far in chromosome 6 as in chromosomes 21 or 22, in agreement with their differences in recombination rates. By all measures used, out-of-Africa populations showed over a third more LD than African-Americans, highlighting the role of the population's demography in shaping the patterns of LD. Despite those differences, the long-range contour of the LD maps is remarkably similar across the four populations, presumably reflecting common localization of recombination hot spots. Our results have practical implications for the rational design and selection of SNPs for disease association studies.


Subject(s)
Chromosome Mapping , Chromosomes, Human, Pair 21 , Chromosomes, Human, Pair 22 , Chromosomes, Human, Pair 6 , Demography , Linkage Disequilibrium , Recombination, Genetic , Black or African American/genetics , Asian People/genetics , Black People/genetics , Genetics, Population , Humans , Polymorphism, Single Nucleotide , White People/genetics
9.
Cancer Res ; 64(24): 8891-900, 2004 Dec 15.
Article in English | MEDLINE | ID: mdl-15604249

ABSTRACT

Nearly one in eight US women will develop breast cancer in their lifetime. Most breast cancer is not associated with a hereditary syndrome, occurs in postmenopausal women, and is estrogen and progesterone receptor-positive. Estrogen exposure is an epidemiologic risk factor for breast cancer and estrogen is a potent mammary mitogen. We studied single nucleotide polymorphisms (SNPs) in estrogen receptors in 615 healthy subjects and 1011 individuals with histologically confirmed breast cancer, all from New York City. We analyzed 13 SNPs in the progesterone receptor gene (PGR), 17 SNPs in estrogen receptor 1 gene (ESR1), and 8 SNPs in the estrogen receptor 2 gene (ESR2). We observed three common haplotypes in ESR1 that were associated with a decreased risk for breast cancer [odds ratio (OR), approximately O.4; 95% confidence interval (CI), 0.2-0.8; P < 0.01]. Another haplotype was associated with an increased risk of breast cancer (OR, 2.1; 95% CI, 1.2-3.8; P < 0.05). A unique risk haplotype was present in approximately 7% of older Ashkenazi Jewish study subjects (OR, 1.7; 95% CI, 1.2-2.4; P < 0.003). We narrowed the ESR1 risk haplotypes to the promoter region and first exon. We define several other haplotypes in Ashkenazi Jews in both ESR1 and ESR2 that may elevate susceptibility to breast cancer. In contrast, we found no association between any PGR variant or haplotype and breast cancer. Genetic epidemiology study replication and functional assays of the haplotypes should permit a better understanding of the role of steroid receptor genetic variants and breast cancer risk.


Subject(s)
Breast Neoplasms/genetics , Estrogen Receptor alpha/genetics , Estrogen Receptor beta/genetics , Base Sequence , Case-Control Studies , Cell Transformation, Neoplastic/genetics , Ethnicity/genetics , Female , Genetic Predisposition to Disease , Genotype , Haplotypes , Humans , Linkage Disequilibrium , Male , Middle Aged , Polymorphism, Single Nucleotide , Receptors, Progesterone/genetics , Reproducibility of Results
10.
Genome Res ; 14(8): 1633-40, 2004 Aug.
Article in English | MEDLINE | ID: mdl-15289481

ABSTRACT

It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex diseases and variable drug responses. A major stumbling block to the successful design and execution of genome-wide disease association studies using single-nucleotide polymorphisms (SNPs) and linkage disequilibrium is the enormous number of SNPs in the human genome. This results in unacceptably high costs for exhaustive genotyping and presents a challenging problem of statistical inference. Here, we present a new method for optimally selecting minimum informative subsets of SNPs, also known as "tagging" SNPs, that is efficient for genome-wide selection. We contrast this method to published methods including haplotype block tagging, that is, grouping SNPs into segments of low haplotype diversity and typing a subset of the SNPs that can discriminate all common haplotypes within the blocks. Because our method does not rely on a predefined haplotype block structure and makes use of the weaker correlations that occur across neighboring blocks, it can be effectively applied across chromosomal regions with both high and low local linkage disequilibrium. We show that the number of tagging SNPs selected is substantially smaller than previously reported using block-based approaches and that selecting tagging SNPs optimally can result in a two- to threefold savings over selecting random SNPs.


Subject(s)
Haplotypes , Polymorphism, Single Nucleotide , Algorithms , Chromosomes, Human, Pair 22 , Genetic Variation , Humans , Linkage Disequilibrium , Models, Genetic , Research Design
11.
Proc Natl Acad Sci U S A ; 101(7): 1916-21, 2004 Feb 17.
Article in English | MEDLINE | ID: mdl-14769938

ABSTRACT

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.


Subject(s)
Computational Biology , Genome, Human , Human Genome Project , Computational Biology/standards , Contig Mapping/standards , Humans , RNA, Messenger/analysis , Software
12.
Proc Natl Acad Sci U S A ; 99(22): 13980-9, 2002 Oct 29.
Article in English | MEDLINE | ID: mdl-12374863

ABSTRACT

When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D(2) and certain transformations of D(2) are used for EST sequence database searches. In this paper we begin the rigorous study of the statistical distribution of D(2). Using an independence model of DNA sequences, we derive limiting distributions by means of the Stein and Chen-Stein methods and identify three asymptotic regimes, including compound Poisson and normal. The compound Poisson distribution arises when the word size k is large and word matches are rare. The normal distribution arises when the word size is small and matches are common. Explicit expressions for what is meant by large and small word sizes are given in the paper. However, when word size is small and the letters are uniformly distributed, the anticipated limiting normal distribution does not always occur. In this situation the uniform distribution provides the exception to other letter distributions. Therefore a naive, one distribution fits all, approach to D(2) statistics could easily create serious errors in estimating significance.


Subject(s)
Computer Simulation , DNA/analysis , Models, Statistical , Mathematical Computing , Sequence Analysis, DNA
13.
Brief Bioinform ; 3(1): 23-31, 2002 Mar.
Article in English | MEDLINE | ID: mdl-12002221

ABSTRACT

With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention is being paid to the genetic differences among individuals and the abilities of those differences to predict phenotypes. A significant obstacle to such work is the difficulty and expense of determining haplotypes--sets of variants genetically linked because of their proximity on the genome--for large numbers of individuals for use in association studies. This paper presents some algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism data gathered from short genome 'fragments.' Formalised models of the biological system under consideration are examined, given a variety of assumptions about the goal of the problem and the character of optimal solutions. Some theoretical results and algorithms for handling haplotype assembly given the different models are then sketched. The primary conclusion is that some important simplified variants of the problem yield tractable problems while more general variants tend to be intractable in the worst case.


Subject(s)
Algorithms , Haplotypes , Polymorphism, Single-Stranded Conformational , Base Sequence , DNA , Models, Theoretical
SELECTION OF CITATIONS
SEARCH DETAIL
...