ABSTRACT
The most common way to calculate the rearrangement distance between two genomes is to use the size of a minimum length sequence of rearrangements that transforms one of the two given genomes into the other, where the genomes are represented as permutations using only their gene order, based on the assumption that genomes have the same gene content. With the advance of research in genome rearrangements, new works extended the classical models by either considering genomes with different gene content (unbalanced genomes) or including more genomic characteristics to the mathematical representation of the genomes, such as the distribution of intergenic regions sizes. In this study, we study the Reversal, Transposition, and Indel (Insertion and Deletion) Distance using intergenic information, which allows comparing unbalanced genomes, because indels are included in the rearrangement model (i.e., the set of possible rearrangements allowed when we compute the distance). For the particular case of transpositions and indels on unbalanced genomes, we present a 4-approximation algorithm, improving a previous 4.5 approximation. This algorithm is extended so as to deal with gene orientation and to maintain the 4-approximation factor for the Reversal, Transposition, and Indel Distance on unbalanced genomes. Furthermore, we evaluate the proposed algorithms using experiments on simulated data.
Subject(s)
Gene Rearrangement , Models, Genetic , Genome/genetics , Genomics , INDEL Mutation , AlgorithmsABSTRACT
BACKGROUND: In the comparative genomics field, one of the goals is to estimate a sequence of genetic changes capable of transforming a genome into another. Genome rearrangement events are mutations that can alter the genetic content or the arrangement of elements from the genome. Reversal and transposition are two of the most studied genome rearrangement events. A reversal inverts a segment of a genome while a transposition swaps two consecutive segments. Initial studies in the area considered only the order of the genes. Recent works have incorporated other genetic information in the model. In particular, the information regarding the size of intergenic regions, which are structures between each pair of genes and in the extremities of a linear genome. RESULTS AND CONCLUSIONS: In this work, we investigate the SORTING BY INTERGENIC REVERSALS AND TRANSPOSITIONS problem on genomes sharing the same set of genes, considering the cases where the orientation of genes is known and unknown. Besides, we explored a variant of the problem, which generalizes the transposition event. As a result, we present an approximation algorithm that guarantees an approximation factor of 4 for both cases considering the reversal and transposition (classic definition) events, an improvement from the 4.5-approximation previously known for the scenario where the orientation of the genes is unknown. We also present a 3-approximation algorithm by incorporating the generalized transposition event, and we propose a greedy strategy to improve the performance of the algorithms. We performed practical tests adopting simulated data which indicated that the algorithms, in both cases, tend to perform better when compared with the best-known algorithms for the problem. Lastly, we conducted experiments using real genomes to demonstrate the applicability of the algorithms.
ABSTRACT
Problems in the genome rearrangement field are often formulated in terms of pairwise genome comparison: given two genomes [Formula: see text] and [Formula: see text], find the minimum number of genome rearrangements that may have occurred during the evolutionary process. This broad definition lacks at least two important considerations: the first being which features are extracted from genomes to create a useful mathematical model, and the second being which types of genome rearrangement events should be represented. Regarding the first consideration, seminal works in the genome rearrangement field solely used gene order to represent genomes as permutations of integer numbers, neglecting many important aspects like gene duplication, intergenic regions, and complex interactions between genes. Regarding the second consideration, some rearrangement events are widely studied such as reversals and transpositions. In this paper, we shed light on the first consideration and created a model that takes into account gene order and the number of nucleotides in intergenic regions. In addition, we consider events of reversals, transpositions, and indels (insertions and deletions) of genomic material. We present a 4-approximation algorithm for reversals and indels, a [Formula: see text]-approximation algorithm for transpositions and indels, and a 6-approximation for reversals, transpositions, and indels.
Subject(s)
Genome , Models, Genetic , Algorithms , DNA, Intergenic/genetics , Gene Rearrangement , GenomicsABSTRACT
The rearrangement distance is a method to compare genomes of different species. Such distance is the number of rearrangement events necessary to transform one genome into another. Two commonly studied events are the transposition, which exchanges two consecutive blocks of the genome, and the reversal, which reverts a block of the genome. When dealing with such problems, seminal works represented genomes as sequences of genes without repetition. More realistic models started to consider gene repetition or the presence of intergenic regions, sequences of nucleotides between genes and in the extremities of the genome. This work explores the transposition and reversal events applied in a genome representation considering both gene repetition and intergenic regions. We define two problems called Minimum Common Intergenic String Partition and Reverse Minimum Common Intergenic String Partition. Using a relation with these two problems, we show a [Formula: see text]-approximation for the Intergenic Transposition Distance, the Intergenic Reversal Distance, and the Intergenic Reversal and Transposition Distance problems, where k is the maximum number of copies of a gene in the genomes. Our practical experiments on simulated genomes show that the use of partitions improves the estimates for the distances.
ABSTRACT
Aim: GDF15 levels are a biomarker for metformin use. We performed the functional annotation of noncoding genome-wide association study (GWAS) SNPs for GDF15 levels and the Genotype-Tissue Expression (GTEx)-expression quantitative trait loci (eQTLs) for GDF15 expression within metformin-activated enhancers around GDF15. Materials & methods: These enhancers were identified using chromatin immunoprecipitation followed by sequencing data for active (H3K27ac) and silenced (H3K27me3) histone marks on human hepatocytes treated with metformin, Encyclopedia of DNA Elements data and cis-regulatory elements assignment tools. Results: The GWAS lead SNP rs888663, the SNP rs62122429 associated with GDF15 levels in the Outcome Reduction with Initial Glargine Intervention trial, and the GTEx-expression quantitative trait locus rs4808791 for GDF15 expression in whole blood are located in a metformin-activated enhancer upstream of GDF15 and tightly linked in Europeans and East Asians. Conclusion: Noncoding variation within a metformin-activated enhancer may increase GDF15 expression and help to predict GDF15 levels.
Subject(s)
Genome-Wide Association Study/methods , Growth Differentiation Factor 15/biosynthesis , Growth Differentiation Factor 15/genetics , Metformin/pharmacology , Polymorphism, Single Nucleotide/genetics , Cell Line , Hepatocytes/drug effects , Hepatocytes/metabolism , Humans , Hypoglycemic Agents/pharmacology , Polymorphism, Single Nucleotide/drug effectsABSTRACT
During the evolutionary process, genomes are affected by various genome rearrangements, that is, events that modify large stretches of the genetic material. In the literature, a large number of models have been proposed to estimate the number of events that occurred during evolution; most of them represent a genome as an ordered sequence of genes, and, in particular, disregard the genetic material between consecutive genes. However, recent studies showed that taking into account the genetic material between consecutive genes can enhance evolutionary distance estimations. Reversal and transposition are genome rearrangements that have been widely studied in the literature. A reversal inverts a (contiguous) segment of the genome, while a transposition swaps the positions of two consecutive segments. Genomes also undergo nonconservative events (events that alter the amount of genetic material) such as insertions and deletions, in which genetic material from intergenic regions of the genome is inserted or deleted, respectively. In this article, we study a genome rearrangement model that considers both gene order and sizes of intergenic regions. We investigate the reversal distance, and also the reversal and transposition distance between two genomes in two scenarios: with and without nonconservative events. We show that these problems are NP-hard and we present constant ratio approximation algorithms for all of them. More precisely, we provide a 4-approximation algorithm for the reversal distance, both in the conservative and nonconservative versions. For the reversal and transposition distance, we provide a 4.5-approximation algorithm, both in the conservative and nonconservative versions. We also perform experimental tests to verify the behavior of our algorithms, as well as to compare the practical and theoretical results. We finally extend our study to scenarios in which events have different costs, and we present constant ratio approximation algorithms for each scenario.
ABSTRACT
BACKGROUND: The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called intergenic regions. RESULTS AND CONCLUSIONS: In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which affect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.
ABSTRACT
This study was undertaken to characterize the alpha subgroup of the proteobacteria causing the huanglongbing (HLB) disease of citrus from three different ecological zones of Kenya namely the Lower highlands (LH2, LH3, 1800-1900 m above sea level); Upper midlands (UM3, UM4, 1390-1475m), Lower midlands (LM5, LM4, LM3 of 1290-1340-1390m), by isolation and sequencing DNA encoding the L10 and L12 ribosomal proteins and the intergenic region. A 7I6-basepair DNA fragment was amplified and sequenced and consisted of 536 basepairs of DNA encoding the L10 protein, 44 basepairs of DNA intergenic region and 136 basepairs of DNA that partially encodes the L12 protein. Sequences of rpL10/L12 protein genes from Kenyan strains were 98 percent and 81 percent similar to the South African 'Candidatus Liberibacter africanus strain Nelspruit' and the Asian 'Candidatus Liberibacter asiaticus' strains, respectively. The intergenic rDNA sequence of Kenyan strain from UM and LM showed 84 percent similarity with 'Candidatus L. africanus strain Nelspruit' and 50 percent similarity with 'Candidatus L. asiaticus' strain. However, the LH strain had an 11- basepairs deletion, while the LM4 had a 5-basepair deletion in the intergenic region compared to 'Candidatus L. africanus strain Nelspruit'. The L10 amino acid sequence was 100 percent homologous among HLB bacteria obtained from the agro-ecological zones in Kenya and the L10 protein sequence was also homologus to 'Candidatus L. africanus strain Nelspruit'. Nevertheless, the L10 amino acid sequence of 'Candidatus L. asiaticus' and the 'Candidatus L. africanus subsp. capensis' differed from the Kenyan strains by 18.36 percent and 11.82 percent, respectively. Phylogenetic analysis of both the L10/L12 rDNA sequences and the L10 amino acid sequences clustered the Kenyan strains of the 'Candidatus Liberibacter' species with members of alpha subdivision of proteobacteria.