Search | VHL Regional Portal

1.

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Koren, Sergey; Walenz, Brian P; Berlin, Konstantin; Miller, Jason R; Bergman, Nicholas H; Phillippy, Adam M.

Genome Res ; 27(5): 722-736, 2017 05.

Article in English | MEDLINE | ID: mdl-28298431

ABSTRACT

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

Subject(s)

Contig Mapping/methods , Genomics/methods , Sequence Analysis, DNA/methods , Software , Animals , Contig Mapping/standards , Drosophila melanogaster/genetics , Genome, Bacterial , Genomics/standards , Humans , Repetitive Sequences, Nucleic Acid , Sequence Analysis, DNA/standards

2.

Corrigendum: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Berlin, Konstantin; Koren, Sergey; Chin, Chen-Shan; Drake, James P; Landolin, Jane M; Phillippy, Adam M.

Nat Biotechnol ; 33(10): 1109, 2015 Oct.

Article in English | MEDLINE | ID: mdl-26448093

3.

Information content of long-range NMR data for the characterization of conformational heterogeneity.

Andralojc, Witold; Berlin, Konstantin; Fushman, David; Luchinat, Claudio; Parigi, Giacomo; Ravera, Enrico; Sgheri, Luca.

J Biomol NMR ; 62(3): 353-71, 2015 Jul.

Article in English | MEDLINE | ID: mdl-26044033

ABSTRACT

Long-range NMR data, namely residual dipolar couplings (RDCs) from external alignment and paramagnetic data, are becoming increasingly popular for the characterization of conformational heterogeneity of multidomain biomacromolecules and protein complexes. The question addressed here is how much information is contained in these averaged data. We have analyzed and compared the information content of conformationally averaged RDCs caused by steric alignment and of both RDCs and pseudocontact shifts caused by paramagnetic alignment, and found that, despite the substantial differences, they contain a similar amount of information. Furthermore, using several synthetic tests we find that both sets of data are equally good towards recovering the major state(s) in conformational distributions.

Subject(s)

Nuclear Magnetic Resonance, Biomolecular/methods , Protein Conformation , Proteins/chemistry , Algorithms

4.

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Berlin, Konstantin; Koren, Sergey; Chin, Chen-Shan; Drake, James P; Landolin, Jane M; Phillippy, Adam M.

Nat Biotechnol ; 33(6): 623-30, 2015 Jun.

Article in English | MEDLINE | ID: mdl-26006009

ABSTRACT

Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

Subject(s)

Genome, Fungal , Genome, Human , Genome, Insect , Genome, Plant , Sequence Analysis, DNA , Animals , Arabidopsis/genetics , Base Sequence , Chromosomes/genetics , Drosophila melanogaster/genetics , Heterochromatin , High-Throughput Nucleotide Sequencing/methods , Humans , Saccharomyces cerevisiae/genetics , Sequence Alignment

5.

Hierarchical O(N) computation of small-angle scattering profiles and their associated derivatives.

Berlin, Konstantin; Gumerov, Nail A; Fushman, David; Duraiswami, Ramani.

J Appl Crystallogr ; 47(Pt 2): 755-761, 2014 Apr 01.

Article in English | MEDLINE | ID: mdl-24701198

ABSTRACT

The need for fast approximate algorithms for Debye summation arises in computations performed in crystallography, small/wide-angle X-ray scattering and small-angle neutron scattering. When integrated into structure refinement protocols these algorithms can provide significant speed up over direct all-atom-to-all-atom computation. However, these protocols often employ an iterative gradient-based optimization procedure, which then requires derivatives of the profile with respect to atomic coordinates. This article presents an accurate, O(N) cost algorithm for the computation of scattering profile derivatives. The results reported here show orders of magnitude improvement in computational efficiency, while maintaining the prescribed accuracy. This opens the possibility to efficiently integrate small-angle scattering data into the structure determination and refinement of macromolecular systems.

6.

Recovering a representative conformational ensemble from underdetermined macromolecular structural data.

Berlin, Konstantin; Castañeda, Carlos A; Schneidman-Duhovny, Dina; Sali, Andrej; Nava-Tudela, Alfredo; Fushman, David.

J Am Chem Soc ; 135(44): 16595-609, 2013 Nov 06.

Article in English | MEDLINE | ID: mdl-24093873

ABSTRACT

Structural analysis of proteins and nucleic acids is complicated by their inherent flexibility, conferred, for example, by linkers between their contiguous domains. Therefore, the macromolecule needs to be represented by an ensemble of conformations instead of a single conformation. Determining this ensemble is challenging because the experimental data are a convoluted average of contributions from multiple conformations. As the number of the ensemble degrees of freedom generally greatly exceeds the number of independent observables, directly deconvolving experimental data into a representative ensemble is an ill-posed problem. Recent developments in sparse approximations and compressive sensing have demonstrated that useful information can be recovered from underdetermined (ill-posed) systems of linear equations by using sparsity regularization. Inspired by these advances, we designed the Sparse Ensemble Selection (SES) method for recovering multiple conformations from a limited number of observations. SES is more general and accurate than previously published minimum-ensemble methods, and we use it to obtain representative conformational ensembles of Lys48-linked diubiquitin, characterized by the residual dipolar coupling data measured at several pH conditions. These representative ensembles are validated against NMR chemical shift perturbation data and compared to maximum-entropy results. The SES method reproduced and quantified the previously observed pH dependence of the major conformation of Lys48-linked diubiquitin, and revealed lesser-populated conformations that are preorganized for binding known diubiquitin receptors, thus providing insights into possible mechanisms of receptor recognition by polyubiquitin. SES is applicable to any experimental observables that can be expressed as a weighted linear combination of data for individual states.

Subject(s)

Ubiquitins/chemistry , Entropy , Humans , Hydrogen-Ion Concentration , Lysine/chemistry , Macromolecular Substances/chemistry , Models, Molecular , Protein Conformation

7.

Deriving quantitative dynamics information for proteins and RNAs using ROTDIF with a graphical user interface.

Berlin, Konstantin; Longhini, Andrew; Dayie, T Kwaku; Fushman, David.

J Biomol NMR ; 57(4): 333-52, 2013 Dec.

Article in English | MEDLINE | ID: mdl-24170368

ABSTRACT

To facilitate rigorous analysis of molecular motions in proteins, DNA, and RNA, we present a new version of ROTDIF, a program for determining the overall rotational diffusion tensor from single- or multiple-field nuclear magnetic resonance relaxation data. We introduce four major features that expand the program's versatility and usability. The first feature is the ability to analyze, separately or together, (13)C and/or (15)N relaxation data collected at a single or multiple fields. A significant improvement in the accuracy compared to direct analysis of R2/R1 ratios, especially critical for analysis of (13)C relaxation data, is achieved by subtracting high-frequency contributions to relaxation rates. The second new feature is an improved method for computing the rotational diffusion tensor in the presence of biased errors, such as large conformational exchange contributions, that significantly enhances the accuracy of the computation. The third new feature is the integration of the domain alignment and docking module for relaxation-based structure determination of multi-domain systems. Finally, to improve accessibility to all the program features, we introduced a graphical user interface that simplifies and speeds up the analysis of the data. Written in Java, the new ROTDIF can run on virtually any computer platform. In addition, the new ROTDIF achieves an order of magnitude speedup over the previous version by implementing a more efficient deterministic minimization algorithm. We not only demonstrate the improvement in accuracy and speed of the new algorithm for synthetic and experimental (13)C and (15)N relaxation data for several proteins and nucleic acids, but also show that careful analysis required especially for characterizing RNA dynamics allowed us to uncover subtle conformational changes in RNA as a function of temperature that were opaque to previous analysis.

Subject(s)

Nuclear Magnetic Resonance, Biomolecular/methods , Proteins/chemistry , RNA/chemistry , Software , Algorithms , Computational Biology/methods , Isotopes , Least-Squares Analysis , Molecular Dynamics Simulation , Proteins/metabolism , RNA/metabolism , User-Computer Interface

8.

A hierarchical algorithm for fast Debye summation with applications to small angle scattering.

Gumerov, Nail A; Berlin, Konstantin; Fushman, David; Duraiswami, Ramani.

J Comput Chem ; 33(25): 1981-96, 2012 Sep 30.

Article in English | MEDLINE | ID: mdl-22707386

ABSTRACT

Debye summation, which involves the summation of sinc functions of distances between all pair of atoms in three-dimensional space, arises in computations performed in crystallography, small/wide angle X-ray scattering (SAXS/WAXS), and small angle neutron scattering (SANS). Direct evaluation of Debye summation has quadratic complexity, which results in computational bottleneck when determining crystal properties, or running structure refinement protocols that involve SAXS or SANS, even for moderately sized molecules. We present a fast approximation algorithm that efficiently computes the summation to any prescribed accuracy Îµ in linear time. The algorithm is similar to the fast multipole method (FMM), and is based on a hierarchical spatial decomposition of the molecule coupled with local harmonic expansions and translation of these expansions. An even more efficient implementation is possible when the scattering profile is all that is required, as in small angle scattering reconstruction (SAS) of macromolecules. We examine the relationship of the proposed algorithm to existing approximate methods for profile computations, and show that these methods may result in inaccurate profile computations, unless an error-bound derived in this article is used. Our theoretical and computational results show orders of magnitude improvement in computation complexity over existing methods, while maintaining prescribed accuracy.

Subject(s)

Algorithms , Computer Simulation , Neutron Diffraction , Scattering, Small Angle , X-Ray Diffraction

9.

Fast approximations of the rotational diffusion tensor and their application to structural assembly of molecular complexes.

Berlin, Konstantin; O'Leary, Dianne P; Fushman, David.

Proteins ; 79(7): 2268-81, 2011 Jul.

Article in English | MEDLINE | ID: mdl-21604302

ABSTRACT

We present and evaluate a rigid-body, deterministic, molecular docking method, called ELMDOCK, that relies solely on the three-dimensional structure of the individual components and the overall rotational diffusion tensor of the complex, obtained from nuclear spin-relaxation measurements. We also introduce a docking method, called ELMPATIDOCK, derived from ELMDOCK and based on the new concept of combining the shape-related restraints from rotational diffusion with those from residual dipolar couplings, along with ambiguous contact/interface-related restraints obtained from chemical shift perturbations. ELMDOCK and ELMPATIDOCK use two novel approximations of the molecular rotational diffusion tensor that allow computationally efficient docking. We show that these approximations are accurate enough to properly dock the two components of a complex without the need to recompute the diffusion tensor at each iteration step. We analyze the accuracy, robustness, and efficiency of these methods using synthetic relaxation data for a large variety of protein-protein complexes. We also test our method on three protein systems for which the structure of the complex and experimental relaxation data are available, and analyze the effect of flexible unstructured tails on the outcome of docking. Additionally, we describe a method for integrating the new approximation methods into the existing docking approaches that use the rotational diffusion tensor as a restraint. The results show that the proposed docking method is robust against experimental errors in the relaxation data or structural rearrangements upon complex formation and is computationally more efficient than current methods. The developed approximations are accurate enough to be used in structure refinement protocols.

Subject(s)

Computational Biology/methods , Protein Binding , Proteins/chemistry , Software , Algorithms , Binding Sites , Databases, Protein , Diffusion , Models, Molecular , Proteins/metabolism , Rotation

10.

Structural assembly of molecular complexes based on residual dipolar couplings.

Berlin, Konstantin; O'Leary, Dianne P; Fushman, David.

J Am Chem Soc ; 132(26): 8961-72, 2010 Jul 07.

Article in English | MEDLINE | ID: mdl-20550109

ABSTRACT

We present and evaluate a rigid-body molecular docking method, called PATIDOCK, that relies solely on the three-dimensional structure of the individual components and the experimentally derived residual dipolar couplings (RDCs) for the complex. We show that, given an accurate ab initio predictor of the alignment tensor from a protein structure, it is possible to accurately assemble a protein-protein complex by utilizing the RDCs' sensitivity to molecular shape to guide the docking. The proposed docking method is robust against experimental errors in the RDCs and computationally efficient. We analyze the accuracy and efficiency of this method using experimental or synthetic RDC data for several proteins, as well as synthetic data for a large variety of protein-protein complexes. We also test our method on two protein systems for which the structure of the complex and steric-alignment data are available (Lys48-linked diubiquitin and a complex of ubiquitin and a ubiquitin-associated domain) and analyze the effect of flexible unstructured tails on the outcome of docking. The results demonstrate that it is fundamentally possible to assemble a protein-protein complex solely on the basis of experimental RDC data and the prediction of the alignment tensor from 3D structures. Thus, despite the purely angular nature of RDCs, they can be converted into intermolecular distance/translational constraints. Additionally, we show a method for combining RDCs with other experimental data, such as ambiguous constraints from interface mapping, to further improve structure characterization of protein complexes.

Subject(s)

Models, Molecular , Proteins/chemistry , Proteins/metabolism , Animals , Bacterial Proteins/chemistry , Bacterial Proteins/metabolism , Carrier Proteins/chemistry , Carrier Proteins/metabolism , Humans , Protein Structure, Tertiary , Quantum Theory , Ubiquitin/chemistry , Ubiquitin/metabolism

11.

Improvement and analysis of computational methods for prediction of residual dipolar couplings.

Berlin, Konstantin; O'Leary, Dianne P; Fushman, David.

J Magn Reson ; 201(1): 25-33, 2009 Nov.

Article in English | MEDLINE | ID: mdl-19700353

ABSTRACT

We describe a new, computationally efficient method for computing the molecular alignment tensor based on the molecular shape. The increase in speed is achieved by re-expressing the problem as one of numerical integration, rather than a simple uniform sampling (as in the PALES method), and by using a convex hull rather than a detailed representation of the surface of a molecule. This method is applicable to bicelles, PEG/hexanol, and other alignment media that can be modeled by steric restrictions introduced by a planar barrier. This method is used to further explore and compare various representations of protein shape by an equivalent ellipsoid. We also examine the accuracy of the alignment tensor and residual dipolar couplings (RDC) prediction using various ab initio methods. We separately quantify the inaccuracy in RDC prediction caused by the inaccuracy in the orientation and in the magnitude of the alignment tensor, concluding that orientation accuracy is much more important in accurate prediction of RDCs.

Subject(s)

Nuclear Magnetic Resonance, Biomolecular/methods , Proteins/chemistry , Signal Processing, Computer-Assisted , Algorithms , Bacterial Proteins/chemistry , Carrier Proteins/chemistry , Forecasting , Molecular Conformation , Nostoc/chemistry , Reproducibility of Results

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL