Search | VHL Regional Portal

1.

Flanked Block-Interchange Distance on Strings.

Li, Tiantian; Jiang, Haitao; Zhu, Binhai; Wang, Lusheng; Zhu, Daming.

IEEE/ACM Trans Comput Biol Bioinform ; 21(2): 301-311, 2024.

Article in English | MEDLINE | ID: mdl-38194376

ABSTRACT

Rearrangement sorting problems impact profoundly in measuring genome similarities and tracing historic scenarios of species. However, recent studies on genome rearrangement mechanisms disclosed a statistically significant evidence, repeats are situated at the ends of rearrangement relevant segments and stay unchanged before and after rearrangements.To reflect the principle behind this evidence, we propose flanked block-interchange, an operation on strings that exchanges two substrings flanked by identical left and right symbols in a string. The flanked block-interchange distance problem is formulated as finding a shortest sequence of flanked block-interchanges to transform a string into the other. We propose a sufficient and necessary condition for deciding whether two strings can be transformed into each other by flanked block-interchanges. This condition is linear time verifiable. Under this condition for two strings, we present a [Formula: see text]-approximation algorithm for the flanked block-interchange distance problem where each symbol occurs at most k times in a string and a polynomial algorithm for this problem where each symbol occurs at most twice in a string. We show that the problem of flanked block-interchange distance is NP-hard at last.

Subject(s)

Gene Rearrangement , Genome , Algorithms

2.

Maximum Stacking Base Pairs: Hardness and Approximation by Nonlinear Linear Programming-Rounding.

Liu, Lixin; Jiang, Haitao; Liu, Peiqiang; Zhu, Binhai; Zhu, Daming.

J Comput Biol ; 27(2): 200-211, 2020 02.

Article in English | MEDLINE | ID: mdl-31905005

ABSTRACT

Maximum stacking base pairs is a fundamental combinatorial problem from ribonucleic acid (RNA) secondary structure prediction under the energy model. The basic maximum stacking base pairs problem can be described as: given an RNA sequence, find a maximum number of base pairs such that each chosen base pair has at least one parallel and adjacent partner (i.e., they form a stacking). This problem is NP-hard, no matter whether the candidate base pairs follow the biology principle or are given explicitly as input. This article investigates a restricted version of this problem where the base pairs are given as input and each base is associated with at most k (a constant) base pairs. We show that this restricted version is still APX-hard, even if the base pairs are weighted. Moreover, by a nonlinear LP-rounding method, we present an approximation algorithm with a factor [Formula: see text]. Applying our algorithms on the simulated data, the actual approximation factor is in fact much better than this theoretical bound.

Subject(s)

Computational Biology/methods , RNA/chemistry , Algorithms , Base Pairing , Models, Molecular , Programming, Linear

3.

Approaching the One-Sided Exemplar Adjacency Number Problem.

Qingge, Letu; Smith, Killian; Jungst, Sean; Wang, Baihui; Yang, Qing; Zhu, Binhai.

IEEE/ACM Trans Comput Biol Bioinform ; 17(6): 1946-1954, 2020.

Article in English | MEDLINE | ID: mdl-31056506

ABSTRACT

The one-sided Exemplar Adjacency Number (EAN) is a known problem for computing the exemplar similarity between a generic linear genome G with gene duplications and an exemplar genome H (over the same set of n gene families). In this problem, we need to compute an exemplar genome G, which is a permutation obtained from G, such that the number of common adjacencies between G and H is maximized. Unfortunately, the problem is not only NP-hard but also NP-hard to approximate. In this paper, we approach the problem by relaxing the constraint such that a sub-permutation G+ obtained from G does not have to include all the gene families, but still needs to have a length at least k. Hence G+ is called a pseudo-exemplar genome. Then, a slightly more general problem (One-sided EAN+) is defined: compute a pseudo-exemplar genome G+ from G such that the number of common adjacencies between H and G+ is maximized. Certainly One-sided EAN+ contains One-sided EAN as a special case; moreover, it presents some flexibility in designing algorithms. First, we relax and formulate the One-sided EAN+ problem as the maximum independent set (MIS) on a colored interval graph and hence reduce the appearance of each gene to at most two times. We show that this new relaxation is still NP-complete, though a simple factor-2 approximation algorithm can be designed; moreover, we also prove that the problem cannot be approximated within 2-Îµ by a local search technique. We then show that this relaxed version is fixed-parameter tractable (FPT). Second, to ensure that each gene appears in G+ at most once, we use integer linear programming (ILP) to solve this problem. Finally, we implement our algorithm and compare it with the up-to-date software GREDU, with simulated signed and unsigned genomes. It turns out that our algorithm is more stable and can process genomes of length up to 12,000 for signed genomes (while GREDU can falter on such a large signed genome and it cannot handle unsigned genomes at all).

Subject(s)

Genome/genetics , Genomics/methods , Algorithms , Computer Simulation , Gene Duplication/genetics , Programming, Linear , Software

4.

A 2-approximation algorithm for the contig-based genomic scaffold filling problem.

Jiang, Haitao; Qingge, Letu; Zhu, Daming; Zhu, Binhai.

J Bioinform Comput Biol ; 16(6): 1850022, 2018 Dec.

Article in English | MEDLINE | ID: mdl-30616473

ABSTRACT

The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I' , with respect to a complete reference genome G , such that the number of common/shared adjacencies between G and I' is maximized. The problem is NP-complete, and admits a constant-factor approximation. However, the sequence input I is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when a scaffold S is given, the missing genes can only be inserted in between the contigs, and the objective is to maximize the number of common adjacencies between G and the filled genome S' . For this problem, we present a simple NP-completeness proof, we then present a factor-2 approximation algorithm.

Subject(s)

Algorithms , Genomics/methods , Contig Mapping/methods , Genome

5.

Filling a Protein Scaffold With a Reference.

Qingge, Letu; Liu, Xiaowen; Zhong, Farong; Zhu, Binhai.

IEEE Trans Nanobioscience ; 16(2): 123-130, 2017 03.

Article in English | MEDLINE | ID: mdl-28207401

ABSTRACT

In mass spectrometry-based de novo protein sequencing, it is hard to complete the sequence of the whole protein. Motivated by this, we study the (one-sided) problem of filling a protein scaffold S with some missing amino acids, given a sequence of contigs none of which is allowed to be altered, with respect to a complete reference protein P of length n , such that the BLOSUM62 score between P and the filled sequence S' is maximized. We show that this problem is polynomial-time solvable in O(n26) time. We also consider the case when the contigs are not of high quality and they are concatenated into an (incomplete) sequence I , where the missing amino acids can be inserted anywhere in I to obtain I' , such that the BLOSUM62 score between P and I' is maximized. We show that this problem is polynomial-time solvable in O(n22) time. Due to the high time complexity, both of these algorithms are impractical, we hence present several algorithms based on greedy and local search, trying to solve the problems practically. The empirical results, based on some antibody and mammalian proteins, show that the algorithms can fill protein scaffolds with high quality, provided that a good pair of scaffold and reference are given.

Subject(s)

Computational Biology/methods , Proteins/chemistry , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Animals , Cattle , Mass Spectrometry

6.

Characterization of Proteoforms with Unknown Post-translational Modifications Using the MIScore.

Kou, Qiang; Zhu, Binhai; Wu, Si; Ansong, Charles; Tolic, Nikola; Pasa-Tolic, Ljiljana; Liu, Xiaowen.

J Proteome Res ; 15(8): 2422-32, 2016 08 05.

Article in English | MEDLINE | ID: mdl-27291504

ABSTRACT

Various proteoforms may be generated from a single gene due to primary structure alterations (PSAs) such as genetic variations, alternative splicing, and post-translational modifications (PTMs). Top-down mass spectrometry is capable of analyzing intact proteins and identifying patterns of multiple PSAs, making it the method of choice for studying complex proteoforms. In top-down proteomics, proteoform identification is often performed by searching tandem mass spectra against a protein sequence database that contains only one reference protein sequence for each gene or transcript variant in a proteome. Because of the incompleteness of the protein database, an identified proteoform may contain unknown PSAs compared with the reference sequence. Proteoform characterization is to identify and localize PSAs in a proteoform. Although many software tools have been proposed for proteoform identification by top-down mass spectrometry, the characterization of proteoforms in identified proteoform-spectrum matches still relies mainly on manual annotation. We propose to use the Modification Identification Score (MIScore), which is based on Bayesian models, to automatically identify and localize PTMs in proteoforms. Experiments showed that the MIScore is accurate in identifying and localizing one or two modifications.

Subject(s)

Protein Processing, Post-Translational , Proteome/analysis , Proteomics/methods , Bayes Theorem , Escherichia coli/chemistry , Salmonella typhimurium/chemistry , Software , Tandem Mass Spectrometry

7.

Isomorphism and similarity for 2-generation pedigrees.

Jiang, Haitao; Lin, Guohui; Tong, Weitian; Zhu, Daming; Zhu, Binhai.

BMC Bioinformatics ; 16 Suppl 5: S7, 2015.

Article in English | MEDLINE | ID: mdl-25860335

ABSTRACT

We consider the emerging problem of comparing the similarity between (unlabeled) pedigrees. More specifically, we focus on the simplest pedigrees, namely, the 2-generation pedigrees. We show that the isomorphism testing for two 2-generation pedigrees is GI-hard. If the 2-generation pedigrees are monogamous (i.e., each individual at level-1 can mate with exactly one partner) then the isomorphism testing problem can be solved in polynomial time. We then consider the problem by relaxing it into an NP-complete decomposition problem which can be formulated as the Minimum Common Integer Pair Partition (MCIPP) problem, which we show to be FPT by exploiting a property of the optimal solution. While there is still some difficulty to overcome, this lays down a solid foundation for this research.

Subject(s)

Algorithms , Computational Biology/methods , Computer Simulation , Pedigree , Female , Humans , Male

8.

An improved approximation algorithm for scaffold filling to maximize the common adjacencies.

Liu, Nan; Jiang, Haitao; Zhu, Daming; Zhu, Binhai.

IEEE/ACM Trans Comput Biol Bioinform ; 10(4): 905-13, 2013.

Article in English | MEDLINE | ID: mdl-24334385

ABSTRACT

Scaffold filling is a new combinatorial optimization problem in genome sequencing. The one-sided scaffold filling problem can be described as given an incomplete genome I and a complete (reference) genome G, fill the missing genes into I such that the number of common (string) adjacencies between the resulting genome I' and G is maximized. This problem is NP-complete for genome with duplicated genes and the best known approximation factor is 1.33, which uses a greedy strategy. In this paper, we prove a better lower bound of the optimal solution, and devise a new algorithm by exploiting the maximum matching method and a local improvement technique, which improves the approximation factor to 1.25. For genome with gene repetitions, this is the only known NP-complete problem which admits an approximation with a small constant factor (less than 1.5).

Subject(s)

Algorithms , Genomics/methods , Sequence Analysis, DNA/methods , Genome

9.

Protein chain pair simplification under the discrete Fréchet distance.

Wylie, Tim; Zhu, Binhai.

IEEE/ACM Trans Comput Biol Bioinform ; 10(6): 1372-83, 2013.

Article in English | MEDLINE | ID: mdl-24407296

ABSTRACT

For protein structure alignment and comparison, a lot of work has been done using RMSD as the distance measure, which has drawbacks under certain circumstances. Thus, the discrete Fréchet distance was recently applied to the problem of protein (backbone) structure alignment and comparison with promising results. For this problem, visualization is also important because protein chain backbones can have as many as 500-600 $(\alpha)$-carbon atoms, which constitute the vertices in the comparison. Even with an excellent alignment, the similarity of two polygonal chains can be difficult to visualize unless the chains are nearly identical. Thus, the chain pair simplification problem (CPS-3F) was proposed in 2008 to simultaneously simplify both chains with respect to each other under the discrete Fréchet distance. The complexity of CPS-3F is unknown, so heuristic methods have been developed. Here, we define a variation of CPS-3F, called the constrained CPS-3F problem ($({\rm CPS\hbox{-}3F}^+)$), and prove that it is polynomially solvable by presenting a dynamic programming solution, which we then prove is a factor-2 approximation for CPS-3F. We then compare the $({\rm CPS\hbox{-}3F}^+)$ solutions with previous empirical results, and further demonstrate some of the benefits of the simplified comparisons. Chain pair simplification based on the Hausdorff distance (CPS-2H) is known to be NP-complete, and here we prove that the constrained version ($(\rm CPS\hbox{-}2H^+)$) is also NP-complete. Finally, we discuss future work and implications along with a software library implementation, named the Fréchet-based Protein Alignment & Comparison Toolkit (FPACT).

Subject(s)

Computational Biology/methods , Proteins/chemistry , Algorithms , Carbon/chemistry , Computer Simulation , Protein Conformation , Sequence Alignment/methods , Software , Structural Homology, Protein

10.

Comparing pedigree graphs.

Kirkpatrick, Bonnie; Reshef, Yakir; Finucane, Hilary; Jiang, Haitao; Zhu, Binhai; Karp, Richard M.

J Comput Biol ; 19(9): 998-1014, 2012 Sep.

Article in English | MEDLINE | ID: mdl-22897201

ABSTRACT

Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this article, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present (1) several algorithms that are fast and exact in various special cases, and (2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.

Subject(s)

Algorithms , Computer Simulation , Models, Genetic , Pedigree , Artificial Intelligence , Humans

11.

Scaffold filling under the breakpoint and related distances.

Jiang, Haitao; Zheng, Chunfang; Sankoff, David; Zhu, Binhai.

IEEE/ACM Trans Comput Biol Bioinform ; 9(4): 1220-9, 2012.

Article in English | MEDLINE | ID: mdl-22529329

ABSTRACT

Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, a problem on filling an incomplete multichromosomal genome (or scaffold) I with respect to a complete target genome G was studied. The objective is to minimize the resulting genomic distance between I' and G, where I' is the corresponding filled scaffold. We call this problem the onesided scaffold filling problem. In this paper, we conduct a systematic study for the scaffold filling problem under the breakpoint distance and its variants, for both unichromosomal and multichromosomal genomes (with and without gene repetitions). When the input genome contains no gene repetition (i.e., is a fragment of a permutation), we show that the two-sided scaffold filling problem (i.e., G is also incomplete) is polynomially solvable for unichromosomal genomes under the breakpoint distance and for multichromosomal genomes under the genomic (or DCJ--Double-Cut-and-Join) distance. However, when the input genome contains some repeated genes, even the one-sided scaffold filling problem becomes NP-complete when the similarity measure is the maximum number of adjacencies between two sequences. For this problem, we also present efficient constant-factor approximation algorithms: factor-2 for the general case and factor 1.33 for the one-sided case.

Subject(s)

Algorithms , Genomics/methods , Genes , Genome , Sequence Analysis, DNA

12.

Algorithms for sorting unsigned linear genomes by the DCJ operations.

Jiang, Haitao; Zhu, Binhai; Zhu, Daming.

Bioinformatics ; 27(3): 311-6, 2011 Feb 01.

Article in English | MEDLINE | ID: mdl-21134895

ABSTRACT

MOTIVATION: The double cut and join operation (abbreviated as DCJ) has been extensively used for genomic rearrangement. Although the DCJ distance between signed genomes with both linear and circular (uni- and multi-) chromosomes is well studied, the only known result for the NP-complete unsigned DCJ distance problem is an approximation algorithm for unsigned linear unichromosomal genomes. In this article, we study the problem of computing the DCJ distance on two unsigned linear multichromosomal genomes (abbreviated as UDCJ). RESULTS: We devise a 1.5-approximation algorithm for UDCJ by exploiting the distance formula for signed genomes. In addition, we show that UDCJ admits a weak kernel of size 2k and hence an FPT algorithm running in O(2(2k)n) time.

Subject(s)

Algorithms , Genome , Genomics/methods , Software

13.

On the tractability of maximal strip recovery.

Wang, Lusheng; Zhu, Binhai.

J Comput Biol ; 17(7): 907-14, 2010 Jul.

Article in English | MEDLINE | ID: mdl-20632870

ABSTRACT

Given two genomic maps G and H represented by a sequence of n gene markers, a strip (syntenic block) is a sequence of distinct markers of length at least two which appear as subsequences in the input maps, either directly or in reversed and negated form. The problem Maximal Strip Recovery (MSR) is to find two subsequences G' and H' of G and H, respectively, such that the total length of disjoint strips in G' and H' is maximized (or, conversely, the number of markers hence deleted, is minimized). Previously, several heuristic algorithms which work well in practice, have been proposed. Theoretically, a factor-4 polynomial-time approximation is known for the MSR problem. Moreover, several close variants of MSR, MSR-d (with d > 2 input maps), MSR-DU (with marker duplications) and MSR-WT (with markers weighted) have been proved to be NP-complete. Before this work, the complexity of the original MSR problem was left open. In this article, we solve the open problem by showing that MSR is NP-complete, using a polynomial time reduction from One-in-Three 3SAT. We also present some fixed-parameter tractable algorithms for the (complement of) MSR problem and its variants. Let k be the minimum number of markers deleted in an optimal solution. The running time of our algorithms are O(2(3.61k)n + n(2)) for MSR, [Formula: see text] for MSR-d, and O(2(7.22k)n + n(2)) for MSR-DU, respectively.

Subject(s)

Gene Rearrangement , Genomics , Models, Genetic , Genome , Haplotypes

14.

Linear time probabilistic algorithms for the singular haplotype reconstruction problem from SNP fragments.

Chen, Zhixiang; Fu, Bin; Schweller, Robert; Yang, Boting; Zhao, Zhiyu; Zhu, Binhai.

J Comput Biol ; 15(5): 535-46, 2008 Jun.

Article in English | MEDLINE | ID: mdl-18549306

ABSTRACT

In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in linear time in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.

Subject(s)

Algorithms , Models, Genetic , Polymorphism, Single Nucleotide , Animals , Haplotypes , Humans , Probability

15.

Protein structure-structure alignment with discrete Fréchet distance.

Jiang, Minghui; Xu, Ying; Zhu, Binhai.

J Bioinform Comput Biol ; 6(1): 51-64, 2008 Feb.

Article in English | MEDLINE | ID: mdl-18324745

ABSTRACT

Matching two geometric objects in two-dimensional (2D) and three-dimensional (3D) spaces is a central problem in computer vision, pattern recognition, and protein structure prediction. In particular, the problem of aligning two polygonal chains under translation and rotation to minimize their distance has been studied using various distance measures. It is well known that the Hausdorff distance is useful for matching two point sets, and that the Fréchet distance is a superior measure for matching two polygonal chains. The discrete Fréchet distance closely approximates the (continuous) Fréchet distance, and is a natural measure for the geometric similarity of the folded 3D structures of biomolecules such as proteins. In this paper, we present new algorithms for matching two polygonal chains in two dimensions to minimize their discrete Fréchet distance under translation and rotation, and an effective heuristic for matching two polygonal chains in three dimensions. We also describe our empirical results on the application of the discrete Fréchet distance to protein structure-structure alignment.

Subject(s)

Models, Chemical , Models, Molecular , Proteins/chemistry , Proteins/ultrastructure , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , Molecular Sequence Data , Protein Conformation

16.

Protein local structure alignment under the discrete Fréchet distance.

Zhu, Binhai.

J Comput Biol ; 14(10): 1343-51, 2007 Dec.

Article in English | MEDLINE | ID: mdl-18052775

ABSTRACT

Protein structure alignment is a fundamental problem in computational and structural biology. While there has been lots of experimental/heuristic methods and empirical results, very few results are known regarding the algorithmic/complexity aspects of the problem, especially on protein local structure alignment. A well-known measure to characterize the similarity of two polygonal chains is the famous Fréchet distance, and with the application of protein-related research, a related discrete Fréchet distance has been used recently. In this paper, following the recent work of Jiang et al. we investigate the protein local structural alignment problem using bounded discrete Fréchet distance. Given m proteins (or protein backbones, which are 3D polygonal chains), each of length O(n), our main results are summarized as follows: * If the number of proteins, m, is not part of the input, then the problem is NP-complete; moreover, under bounded discrete Fréchet distance it is NP-hard to approximate the maximum size common local structure within a factor of n(1-epsilon). These results hold both when all the proteins are static and when translation/rotation are allowed. * If the number of proteins, m, is a constant, then there is a polynomial time solution for the problem.

Subject(s)

Computational Biology/methods , Proteins/chemistry , Structural Homology, Protein

17.

Protein folding on the hexagonal lattice in the HP model.

Jiang, Minghui; Zhu, Binhai.

J Bioinform Comput Biol ; 3(1): 19-34, 2005 Feb.

Article in English | MEDLINE | ID: mdl-15751110

ABSTRACT

In this paper, we introduce the 2D hexagonal lattice as a biologically meaningful alternative to the standard square lattice for the study of protein folding in the HP model. We show that the hexagonal lattice alleviates the "sharp turn" problem and models certain aspects of the protein secondary structure more realistically. We present a 1/6-approximation and a clustering heuristic for protein folding on the hexagonal lattice. In addition to these two algorithms, we also implement a Monte Carlo Metropolis algorithm and a branch-and-bound partial enumeration algorithm, and conduct experiments to compare their effectiveness.

Subject(s)

Algorithms , Crystallography/methods , Models, Chemical , Models, Molecular , Proteins/analysis , Proteins/chemistry , Sequence Analysis, Protein/methods , Computer Simulation , Protein Conformation , Protein Folding , Protein Structure, Secondary , Structure-Activity Relationship

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL