Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 33
Filter
Add more filters










Publication year range
2.
Oncotarget ; 12(10): 1011-1023, 2021 May 11.
Article in English | MEDLINE | ID: mdl-34012513

ABSTRACT

Non-invasive clinical diagnostics of bladder cancer is feasible via a set of chemically distinct molecules including macromolecular tumor markers such as polypeptides and nucleic acids. In terms of tumor-related aberrant gene expression, RNA transcripts are the primary indicator of tumor-specific gene expression as for polypeptides and their metabolic products occur subsequently. Thus, in case of bladder cancer, urine RNA represents an early potentially useful diagnostic marker. Here we describe a systematic deep transcriptome analysis of representative pools of urine RNA collected from healthy donors versus bladder cancer patients according to established SOPs. This analysis revealed RNA marker candidates reflecting coding sequences, non-coding sequences, and circular RNAs. Next, we designed and validated PCR amplicons for a set of novel marker candidates and tested them in human bladder cancer cell lines. We identified linear and circular transcripts of the S100 Calcium Binding Protein 6 (S100A6) and translocation associated membrane protein 1 (TRAM1) as highly promising potential tumor markers. This work strongly suggests exploiting urine RNAs as diagnostic markers of bladder cancer and it suggests specific novel markers. Further, this study describes an entry into the tumor-biology of bladder cancer and the development of gene-targeted therapeutic drugs.

3.
Methods Mol Biol ; 2242: 77-89, 2021.
Article in English | MEDLINE | ID: mdl-33961219

ABSTRACT

By tracking pathogen outbreaks using whole genome sequencing, medical microbiology is currently being transformed into genomic epidemiology. This change in technology is leading to the rapid accumulation of large samples of closely related genome sequences. Summarizing such samples into phylogenies can be computationally challenging. Our program andi quickly computes accurate pairwise distances between up to thousands of bacterial genomes. Working under the UNIX command line, we show how andi can be used to transform genomes to phylogenies with support values ready to be printed or integrated into documents.


Subject(s)
DNA, Bacterial/genetics , Escherichia coli/genetics , Genome, Bacterial , Genomics , Phylogeny , Shigella/genetics , Databases, Genetic , Research Design , Software Design , Workflow
4.
Bioinformatics ; 37(15): 2081-2087, 2021 Aug 09.
Article in English | MEDLINE | ID: mdl-33515232

ABSTRACT

MOTIVATION: Unique marker sequences are highly sought after in molecular diagnostics. Nevertheless, there are only few programs available to search for marker sequences, compared to the many programs for similarity search. We therefore wrote the program Fur for Finding Unique genomic Regions. RESULTS: Fur takes as input a sample of target sequences and a sample of closely related neighbors. It returns the regions present in all targets and absent from all neighbors. The recently published program genmap can also be used for this purpose and we compared it to fur. When analyzing a sample of 33 genomes representing the major phylogroups of E.coli, fur was 40 times faster than genmap but used three times more memory. On the other hand, genmap yielded three times more markers, but they were less accurate when tested in silico on a sample of 237 E.coli genomes. We also designed phylogroup-specific PCR primers based on the markers proposed by genmap and fur, and tested them by analyzing their virtual amplicons in GenBank. Finally, we used fur to design primers specific to a Lactobacillus species, and found excellent sensitivity and specificity in vitro. AVAILABILITY AND IMPLEMENTATION: Fur sources and documentation are available from https://github.com/evolbioinf/fur. The compiled software is posted as a docker container at https://hub.docker.com/r/haubold/fox. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
Bioinformatics ; 36(7): 2040-2046, 2020 04 01.
Article in English | MEDLINE | ID: mdl-31790149

ABSTRACT

MOTIVATION: Tracking disease outbreaks by whole-genome sequencing leads to the collection of large samples of closely related sequences. Five years ago, we published a method to accurately compute all pairwise distances for such samples by indexing each sequence. Since indexing is slow, we now ask whether it is possible to achieve similar accuracy when indexing only a single sequence. RESULTS: We have implemented this idea in the program phylonium and show that it is as accurate as its predecessor and roughly 100 times faster when applied to all 2678 Escherichia coli genomes contained in ENSEMBL. One of the best published programs for rapidly computing pairwise distances, mash, analyzes the same dataset four times faster but, with default settings, it is less accurate than phylonium. AVAILABILITY AND IMPLEMENTATION: Phylonium runs under the UNIX command line; its C++ sources and documentation are available from github.com/evolbioinf/phylonium. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Software , Algorithms , Genome , Sequence Analysis, DNA
6.
G3 (Bethesda) ; 10(1): 211-223, 2020 01 07.
Article in English | MEDLINE | ID: mdl-31699776

ABSTRACT

With up to millions of nearly neutral polymorphisms now being routinely sampled in population-genomic surveys, it is possible to estimate the site-frequency spectrum of such sites with high precision. Each frequency class reflects a mixture of potentially unique demographic histories, which can be revealed using theory for the probability distributions of the starting and ending points of branch segments over all possible coalescence trees. Such distributions are completely independent of past population history, which only influences the segment lengths, providing the basis for estimating average population sizes separating tree-wide coalescence events. The history of population-size change experienced by a sample of polymorphisms can then be dissected in a model-flexible fashion, and extension of this theory allows estimation of the mean and full distribution of long-term effective population sizes and ages of alleles of specific frequencies. Here, we outline the basic theory underlying the conceptual approach, develop and test an efficient statistical procedure for parameter estimation, and apply this to multiple population-genomic datasets for the microcrustacean Daphnia pulex.


Subject(s)
Biomass , Models, Genetic , Polymorphism, Single Nucleotide , Animals , Daphnia/genetics , Daphnia/growth & development
7.
Bioinformatics ; 35(11): 1813-1819, 2019 06 01.
Article in English | MEDLINE | ID: mdl-30395202

ABSTRACT

MOTIVATION: Unique sequence regions are associated with genetic function in vertebrate genomes. However, measuring uniqueness, or absence of long repeats, along a genome is conceptually and computationally difficult. Here we use a variant of the Lempel-Ziv complexity, the match complexity, Cm, and augment it by deriving its null distribution for random sequences. We then apply Cm to the human and mouse genomes to investigate the relationship between sequence complexity and function. RESULTS: We implemented Cm in the program macle and show through simulation that the newly derived null distribution of Cm is accurate. This allows us to delineate high-complexity regions in the human and mouse genomes. Using our program macle2go, we find that these regions are twofold enriched for genes. Moreover, the genes contained in these regions are more than 10-fold enriched for developmental functions. AVAILABILITY AND IMPLEMENTATION: Source code for macle and macle2go is available from www.github.com/evolbioinf/macle and www.github.com/evolbioinf/macle2go, respectively; Cm browser tracks from guanine.evolbio.mgp.de/complexity. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Genomics , Animals , Genes, Developmental , Humans , Mammals , Mice , Software
8.
Bioinformatics ; 32(16): 2554-5, 2016 08 15.
Article in English | MEDLINE | ID: mdl-27153632

ABSTRACT

MOTIVATION: In many organisms, including humans, recombination clusters within recombination hotspots. The standard method for de novo detection of recombinants at hotspots is sperm typing. This relies on allele-specific PCR at single nucleotide polymorphisms. Designing allele-specific primers by hand is time-consuming. We have therefore written a package to support hotspot detection and analysis. RESULTS: hotspot consists of four programs: asp looks up SNPs and designs allele-specific primers; aso constructs allele-specific oligos for mapping recombinants; xov implements a maximum-likelihood method for estimating the crossover rate; six, finally, simulates typing data. AVAILABILITY AND IMPLEMENTATION: hotspot is written in C. Sources are freely available under the GNU General Public License from http://github.com/evolbioinf/hotspot/ CONTACT: haubold@evolbio.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Recombination, Genetic , Software , Spermatozoa , Alleles , Humans , Likelihood Functions , Male
9.
Life (Basel) ; 6(1)2016 Mar 07.
Article in English | MEDLINE | ID: mdl-26959064

ABSTRACT

We have recently developed a distance metric for efficiently estimating the number of substitutions per site between unaligned genome sequences. These substitution rates are called "anchor distances" and can be used for phylogeny reconstruction. Most phylogenies come with bootstrap support values, which are computed by resampling with replacement columns of homologous residues from the original alignment. Unfortunately, this method cannot be applied to anchor distances, as they are based on approximate pairwise local alignments rather than the full multiple sequence alignment necessary for the classical bootstrap. We explore two alternatives: pairwise bootstrap and quartet analysis, which we compare to classical bootstrap. With simulated sequences and 53 human primate mitochondrial genomes, pairwise bootstrap gives better results than quartet analysis. However, when applied to 29 E. coli genomes, quartet analysis comes closer to the classical bootstrap.

10.
PLoS One ; 10(8): e0133988, 2015.
Article in English | MEDLINE | ID: mdl-26267652

ABSTRACT

Wild house mice form social hierarchies with aggressive males defending territories, in which females, young mice and submissive adult males share nests. In contrast, socially excluded males are barred from breeding groups, have numerous bite wounds and patches of thinning fur. Since their feeding times are often disrupted, we investigated whether social exclusion leads to changes in epigenetic marks of metabolic genes in liver tissue. We used chromatin immunoprecipitation and quantitative PCR to measure enrichment of two activating histone marks at 15 candidate loci. The epigenetic profiles of healthy males sampled from nest boxes differed significantly from the profiles of ostracized males caught outside of nests and showing bite wounds indicative of social exclusion. Enrichment of histone-3 lysine-4 trimethylation (H3K4me3) changed significantly at genes Cyp4a14, Gapdh, Nr3c1, Pck1, Ppara, and Sqle. Changes at histone-3 lysine-27 acetylation (H3K27ac) marks were detected at genes Fasn, Nr3c1, and Plin5. A principal components analysis separated the socialized from the ostracized mice. This was independent of body weight for the H3K4me3 mark, and partially dependent for H3K27ac. There was no separation, however, between healthy males that had been sampled from two different nests. A hierarchical cluster analysis also separated the two phenotypes, which was independent of body weight for both markers. Our study shows that a period of social exclusion during adult life leads to quantitative changes in histone modification patterns in mouse liver tissue. Similar epigenetic changes might occur during the development of stress-induced metabolic disorders in humans.


Subject(s)
DNA Methylation/genetics , Hierarchy, Social , Histones/genetics , Liver/metabolism , Metabolic Diseases/genetics , Animals , Chromatin Immunoprecipitation , Epigenesis, Genetic/genetics , Feeding Behavior/physiology , Female , Histones/metabolism , Humans , Lysine/genetics , Male , Metabolic Diseases/pathology , Mice , Promoter Regions, Genetic
11.
Bioinformatics ; 31(8): 1169-75, 2015 Apr 15.
Article in English | MEDLINE | ID: mdl-25504847

ABSTRACT

MOTIVATION: A standard approach to classifying sets of genomes is to calculate their pairwise distances. This is difficult for large samples. We have therefore developed an algorithm for rapidly computing the evolutionary distances between closely related genomes. RESULTS: Our distance measure is based on ungapped local alignments that we anchor through pairs of maximal unique matches of a minimum length. These exact matches can be looked up efficiently using enhanced suffix arrays and our implementation requires approximately only 1 s and 45 MB RAM/Mbase analysed. The pairing of matches distinguishes non-homologous from homologous regions leading to accurate distance estimation. We show this by analysing simulated data and genome samples ranging from 29 Escherichia coli/Shigella genomes to 3085 genomes of Streptococcus pneumoniae. AVAILABILITY AND IMPLEMENTATION: We have implemented the computation of anchor distances in the multithreaded UNIX command-line program andi for ANchor DIstances. C sources and documentation are posted at http://github.com/evolbioinf/andi/ CONTACT: haubold@evolbio.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Biological Evolution , Genome , Genomics/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Animals , Databases, Genetic , Humans , Phylogeny
12.
Genetics ; 198(1): 269-81, 2014 Sep.
Article in English | MEDLINE | ID: mdl-24948778

ABSTRACT

Although the analysis of linkage disequilibrium (LD) plays a central role in many areas of population genetics, the sampling variance of LD is known to be very large with high sensitivity to numbers of nucleotide sites and individuals sampled. Here we show that a genome-wide analysis of the distribution of heterozygous sites within a single diploid genome can yield highly informative patterns of LD as a function of physical distance. The proposed statistic, the correlation of zygosity, is closely related to the conventional population-level measure of LD, but is agnostic with respect to allele frequencies and hence likely less prone to outlier artifacts. Application of the method to several vertebrate species leads to the conclusion that >80% of recombination events are typically resolved by gene-conversion-like processes unaccompanied by crossovers, with the average lengths of conversion patches being on the order of one to several kilobases in length. Thus, contrary to common assumptions, the recombination rate between sites does not scale linearly with distance, often even up to distances of 100 kb. In addition, the amount of LD between sites separated by <200 bp is uniformly much greater than can be explained by the conventional neutral model, possibly because of the nonindependent origin of mutations within this spatial scale. These results raise questions about the application of conventional population-genetic interpretations to LD on short spatial scales and also about the use of spatial patterns of LD to infer demographic histories.


Subject(s)
Genome, Human , Linkage Disequilibrium , Models, Genetic , Animals , Gene Conversion , Gene Frequency , Heterozygote , Humans
13.
PLoS One ; 9(5): e97568, 2014.
Article in English | MEDLINE | ID: mdl-24849289

ABSTRACT

In mammals, exposure to toxic or disease-causing environments can change epigenetic marks that are inherited independently of the intrauterine environment. Such inheritance of molecular phenotypes may be adaptive. However, studies demonstrating molecular evidence for epigenetic inheritance have so far relied on extreme treatments, and are confined to inbred animals. We therefore investigated whether epigenomic changes could be detected after a non-drastic change in the environment of an outbred organism. We kept two populations of wild-caught house mice (Mus musculus domesticus) for several generations in semi-natural enclosures on either standard diet and light cycle, or on an energy-enriched diet with longer daylight to simulate summer. As epigenetic marker for active chromatin we quantified genome-wide histone-3 lysine-4 trimethylation (H3K4me3) from liver samples by chromatin immunoprecipitation and high-throughput sequencing as well as by quantitative polymerase chain reaction. The treatment caused a significant increase of H3K4me3 at metabolic genes such as lipid and cholesterol regulators, monooxygenases, and a bile acid transporter. In addition, genes involved in immune processes, cell cycle, and transcription and translation processes were also differently marked. When we transferred young mice of both populations to cages and bred them under standard conditions, most of the H3K4me3 differences were lost. The few loci with stable H3K4me3 changes did not cluster in metabolic functional categories. This is, to our knowledge, the first quantitative study of an epigenetic marker in an outbred mammalian organism. We demonstrate genome-wide epigenetic plasticity in response to a realistic environmental stimulus. In contrast to disease models, the bulk of the epigenomic changes we observed were not heritable.


Subject(s)
Animals, Wild/genetics , Epigenesis, Genetic , Genomics , Histones/chemistry , Histones/metabolism , Liver/metabolism , Lysine/metabolism , Animals , Animals, Wild/metabolism , Female , Male , Methylation , Mice , Protein Stability
14.
Brief Bioinform ; 15(3): 407-18, 2014 May.
Article in English | MEDLINE | ID: mdl-24291823

ABSTRACT

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on comparative data, today usually DNA sequences. These have become so plentiful that alignment-free sequence comparison is of growing importance in the race between scientists and sequencing machines. In phylogenetics, efficient distance computation is the major contribution of alignment-free methods. A distance measure should reflect the number of substitutions per site, which underlies classical alignment-based phylogeny reconstruction. Alignment-free distance measures are either based on word counts or on match lengths, and I apply examples of both approaches to simulated and real data to assess their accuracy and efficiency. While phylogeny reconstruction is based on the number of substitutions, in population genetics, the distribution of mutations along a sequence is also considered. This distribution can be explored by match lengths, thus opening the prospect of alignment-free population genomics.


Subject(s)
Genetics, Population/methods , Phylogeny , Sequence Analysis, DNA/methods , Animals , Computational Biology/methods , Evolution, Molecular , Genetics, Population/statistics & numerical data , Genome, Mitochondrial , Humans , Models, Genetic , Mutation , Recombination, Genetic , Selection, Genetic , Sequence Alignment , Sequence Analysis, DNA/statistics & numerical data
15.
Bioinformatics ; 29(24): 3121-7, 2013 Dec 15.
Article in English | MEDLINE | ID: mdl-24064419

ABSTRACT

MOTIVATION: Why recombination? is one of the central questions in biology. This has led to a host of methods for quantifying recombination from sequence data. These methods are usually based on aligned DNA sequences. Here, we propose an efficient alignment-free alternative. RESULTS: Our method is based on the distribution of match lengths, which we look up using enhanced suffix arrays. By eliminating the alignment step, the test becomes fast enough for application to whole bacterial genomes. Using simulations we show that our test has similar power as established tests when applied to long pairs of sequences. When applied to 58 genomes of Escherichia coli, we pick up the strongest recombination signal from a 125 kb horizontal gene transfer engineered 20 years ago. AVAILABILITY AND IMPLEMENTATION: We have implemented our method in the command-line program rush. Its C sources and documentation are available under the GNU General Public License from http://guanine.evolbio.mpg.de/rush/.


Subject(s)
Algorithms , Computational Biology , Genome, Bacterial , Recombination, Genetic , Sequence Alignment/methods , Computer Simulation , Escherichia coli/genetics , Phylogeny
16.
PLoS Pathog ; 9(7): e1003503, 2013.
Article in English | MEDLINE | ID: mdl-23935484

ABSTRACT

The origins of crop diseases are linked to domestication of plants. Most crops were domesticated centuries--even millennia--ago, thus limiting opportunity to understand the concomitant emergence of disease. Kiwifruit (Actinidia spp.) is an exception: domestication began in the 1930s with outbreaks of canker disease caused by P. syringae pv. actinidiae (Psa) first recorded in the 1980s. Based on SNP analyses of two circularized and 34 draft genomes, we show that Psa is comprised of distinct clades exhibiting negligible within-clade diversity, consistent with disease arising by independent samplings from a source population. Three clades correspond to their geographical source of isolation; a fourth, encompassing the Psa-V lineage responsible for the 2008 outbreak, is now globally distributed. Psa has an overall clonal population structure, however, genomes carry a marked signature of within-pathovar recombination. SNP analysis of Psa-V reveals hundreds of polymorphisms; however, most reside within PPHGI-1-like conjugative elements whose evolution is unlinked to the core genome. Removal of SNPs due to recombination yields an uninformative (star-like) phylogeny consistent with diversification of Psa-V from a single clone within the last ten years. Growth assays provide evidence of cultivar specificity, with rapid systemic movement of Psa-V in Actinidia chinensis. Genomic comparisons show a dynamic genome with evidence of positive selection on type III effectors and other candidate virulence genes. Each clade has highly varied complements of accessory genes encoding effectors and toxins with evidence of gain and loss via multiple genetic routes. Genes with orthologs in vascular pathogens were found exclusively within Psa-V. Our analyses capture a pathogen in the early stages of emergence from a predicted source population associated with wild Actinidia species. In addition to candidate genes as targets for resistance breeding programs, our findings highlight the importance of the source population as a reservoir of new disease.


Subject(s)
Actinidia/microbiology , Bacterial Proteins/genetics , Genome, Bacterial , Plant Diseases/microbiology , Pseudomonas syringae/genetics , Actinidia/growth & development , Bacterial Proteins/chemistry , Bacterial Proteins/metabolism , Crops, Agricultural/growth & development , Crops, Agricultural/microbiology , Fruit/growth & development , Fruit/microbiology , Genomic Islands , Italy , Japan , New Zealand , Phylogeny , Plant Diseases/etiology , Plant Shoots/growth & development , Plant Shoots/microbiology , Polymorphism, Single Nucleotide , Pseudomonas syringae/growth & development , Pseudomonas syringae/isolation & purification , Pseudomonas syringae/pathogenicity , Recombination, Genetic , Republic of Korea , Species Specificity , Virulence
17.
G3 (Bethesda) ; 2(8): 883-9, 2012 Aug.
Article in English | MEDLINE | ID: mdl-22908037

ABSTRACT

Comparative sequencing contributes critically to the functional annotation of genomes. One prerequisite for successful analysis of the increasingly abundant comparative sequencing data is the availability of efficient computational tools. We present here a strategy for comparing unaligned genomes based on a coalescent approach combined with advanced algorithms for indexing sequences. These algorithms are particularly efficient when analyzing large genomes, as their run time ideally grows only linearly with sequence length. Using this approach, we have derived and implemented a maximum-likelihood estimator of the average number of mismatches per site between two closely related sequences, π. By allowing for fluctuating coalescent times, we are able to improve a previously published alignment-free estimator of π. We show through simulation that our new estimator is fast and accurate even with moderate recombination (ρ ≤ π). To demonstrate its applicability to real data, we compare the unaligned genomes of Drosophila persimilis and D. pseudoobscura. In agreement with previous studies, our sliding window analysis locates the global divergence minimum between these two genomes to the pericentromeric region of chromosome 3.


Subject(s)
Drosophila/genetics , Genetic Variation , Metagenomics , Algorithms , Animals , Drosophila/classification , Genome , Phylogeny , Sequence Alignment , Sequence Analysis, DNA
18.
Nat Commun ; 3: 919, 2012 Jun 26.
Article in English | MEDLINE | ID: mdl-22735447

ABSTRACT

Under neutrality, polymorphisms are maintained through the balance between mutation and drift. Under selection, a variety of mechanisms may be involved in the maintenance of polymorphisms, for example, sexual selection or host-parasite coevolution on the population level or heterozygote advantage in diploid individuals. Here we address the emergence of polymorphisms in a population of interacting haploid individuals. In our model, each mutation generates a new evolutionary game characterized by a payoff matrix with an additional row and an additional column. Hence, in general, the fitness of new mutations is frequency-dependent rather than constant. This dynamical process is distinct from the sequential fixation of advantageous traits and naturally leads to the emergence of polymorphisms under selection. It causes substantially higher diversity than observed under the established models of neutral or frequency-independent selection. Our framework allows for the coexistence of an arbitrary number of types, but predicts an intermediate average diversity.


Subject(s)
Biological Evolution , Polymorphism, Genetic/genetics , Computational Biology , Models, Theoretical , Mutation , Selection, Genetic
19.
PLoS One ; 6(5): e18155, 2011.
Article in English | MEDLINE | ID: mdl-21637331

ABSTRACT

Understanding the processes and conditions under which populations diverge to give rise to distinct species is a central question in evolutionary biology. Since recently diverged populations have high levels of shared polymorphisms, it is challenging to distinguish between recent divergence with no (or very low) inter-population gene flow and older splitting events with subsequent gene flow. Recently published methods to infer speciation parameters under the isolation-migration framework are based on summarizing polymorphism data at multiple loci in two species using the joint site-frequency spectrum (JSFS). We have developed two improvements of these methods based on a more extensive use of the JSFS classes of polymorphisms for species with high intra-locus recombination rates. First, using a likelihood based method, we demonstrate that taking into account low-frequency polymorphisms shared between species significantly improves the joint estimation of the divergence time and gene flow between species. Second, we introduce a local linear regression algorithm that considerably reduces the computational time and allows for the estimation of unequal rates of gene flow between species. We also investigate which summary statistics from the JSFS allow the greatest estimation accuracy for divergence time and migration rates for low (around 10) and high (around 100) numbers of loci. Focusing on cases with low numbers of loci and high intra-locus recombination rates we show that our methods for the estimation of divergence time and migration rates are more precise than existing approaches.


Subject(s)
Genetic Speciation , Models, Genetic , Databases, Genetic , Emigration and Immigration , Gene Flow/genetics , Genetic Loci/genetics , Mutation/genetics , Recombination, Genetic/genetics
20.
Bioinformatics ; 27(11): 1466-72, 2011 Jun 01.
Article in English | MEDLINE | ID: mdl-21471011

ABSTRACT

MOTIVATION: Bacterial and viral genomes are often affected by horizontal gene transfer observable as abrupt switching in local homology. In addition to the resulting mosaic genome structure, they frequently contain regions not found in close relatives, which may play a role in virulence mechanisms. Due to this connection to medical microbiology, there are numerous methods available to detect horizontal gene transfer. However, these are usually aimed at individual genes and viral genomes rather than the much larger bacterial genomes. Here, we propose an efficient alignment-free approach to describe the mosaic structure of viral and bacterial genomes, including their unique regions. RESULTS: Our method is based on the lengths of exact matches between pairs of sequences. Long matches indicate close homology, short matches more distant homology or none at all. These exact match lengths can be looked up efficiently using an enhanced suffix array. Our program implementing this approach, alfy (ALignment-Free local homologY), efficiently and accurately detects the recombination break points in simulated DNA sequences and among recombinant HIV-1 strains. We also apply alfy to Escherichia coli genomes where we detect new evidence for the hypothesis that strains pathogenic in poultry can infect humans. AVAILABILITY: alfy is written in standard C and its source code is available under the GNU General Public License from http://guanine.evolbio.mpg.de/alfy/. The software package also includes documentation and example data.


Subject(s)
Genome, Bacterial , Genome, Viral , Sequence Analysis, DNA , Sequence Homology, Nucleic Acid , Escherichia coli/genetics , Gene Transfer, Horizontal , Genomics/methods , HIV-1/genetics , Humans , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...