Search | VHL Regional Portal

1.

Panacus: fast and exact pangenome growth and core size estimation.

Parmigiani, Luca; Garrison, Erik; Stoye, Jens; Marschall, Tobias; Doerr, Daniel.

bioRxiv ; 2024 Jun 12.

Article in English | MEDLINE | ID: mdl-38915671

ABSTRACT

Motivation: Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth). Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour. Availability: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.

2.

Family-Free Genome Comparison.

Braga, Marilia D V; Doerr, Daniel; Rubert, Diego P; Stoye, Jens.

Methods Mol Biol ; 2802: 57-72, 2024.

Article in English | MEDLINE | ID: mdl-38819556

ABSTRACT

The comparison of large-scale genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC that, relying on built-in interfaces with a sequence comparison tool (either BLAST+ or DIAMOND) and with an ILP solver (either CPLEX or Gurobi), provides several methods for analyses that do not require prior classification of genes across the studied genomes. Taking annotated genome sequences as input, FFGC is a complete workflow for genome comparison allowing not only the computation of measures of similarity and dissimilarity but also the inference of gene families, simultaneously based on sequence similarities and large-scale genomic features.

Subject(s)

Genomics , Phylogeny , Software , Genomics/methods , Genome , Computational Biology/methods , Humans

3.

AGO, a Framework for the Reconstruction of Ancestral Syntenies and Gene Orders.

Cribbie, Evan P; Doerr, Daniel; Chauve, Cedric.

Methods Mol Biol ; 2802: 247-265, 2024.

Article in English | MEDLINE | ID: mdl-38819563

ABSTRACT

Reconstructing ancestral gene orders from the genome data of extant species is an important problem in comparative and evolutionary genomics. In a phylogenomics setting that accounts for gene family evolution through gene duplication and gene loss, the reconstruction of ancestral gene orders involves several steps, including multiple sequence alignment, the inference of reconciled gene trees, and the inference of ancestral syntenies and gene adjacencies. For each of the steps of such a process, several methods can be used and implemented using a growing corpus of, often parameterized, tools; in practice, interfacing such tools into an ancestral gene order reconstruction pipeline is far from trivial. This chapter introduces AGO, a Python-based framework aimed at creating ancestral gene order reconstruction pipelines allowing to interface and parameterize different bioinformatics tools. The authors illustrate the features of AGO by reconstructing ancestral gene orders for the X chromosome of three ancestral Anopheles species using three different pipelines. AGO is freely available at https://github.com/cchauve/AGO-pipeline .

Subject(s)

Evolution, Molecular , Gene Order , Genomics , Phylogeny , Software , Animals , Genomics/methods , Computational Biology/methods , Synteny/genetics , Anopheles/genetics , X Chromosome/genetics , Sequence Alignment/methods

4.

Training an automated circulating tumor cell classifier when the true classification is uncertain.

Nanou, Afroditi; Stoecklein, Nikolas H; Doerr, Daniel; Driemel, Christiane; Terstappen, Leon W M M; Coumans, Frank A W.

PNAS Nexus ; 3(2): pgae048, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38371418

ABSTRACT

Circulating tumor cell (CTC) and tumor-derived extracellular vesicle (tdEV) loads are prognostic factors of survival in patients with carcinoma. The current method of CTC enumeration relies on operator review and, unfortunately, has moderate interoperator agreement (Fleiss' kappa 0.60) due to difficulties in classifying CTC-like events. We compared operator review, ACCEPT automated image processing, and refined the output of a deep-learning algorithm to identify CTC and tdEV for the prediction of survival in patients with metastatic and nonmetastatic cancers. Operator review is only defined for CTC. Refinement was performed using automatic contrast maximization CM-CTC of events detected in cancer and in benign samples (CM-CTC). We used 418 samples from benign diseases, 6,293 from nonmetastatic breast, 2,408 from metastatic breast, and 698 from metastatic prostate cancer to train, test, optimize, and evaluate CTC and tdEV enumeration. For CTC identification, the CM-CTC performed best on metastatic/nonmetastatic breast cancer, respectively, with a hazard ratio (HR) for overall survival of 2.6/2.1 vs. 2.4/1.4 for operator CTC and 1.2/0.8 for ACCEPT-CTC. For tdEV identification, CM-tdEV performed best with an HR of 1.6/2.9 vs. 1.5/1.0 with ACCEPT-tdEV. In conclusion, contrast maximization is effective even though it does not utilize domain knowledge.

5.

Correction: Constructing founder sets under allelic and non-allelic homologous recombination.

Bonnet, Konstantinn; Marschall, Tobias; Doerr, Daniel.

Algorithms Mol Biol ; 18(1): 20, 2023 Dec 06.

Article in English | MEDLINE | ID: mdl-38057863

6.

Constructing founder sets under allelic and non-allelic homologous recombination.

Bonnet, Konstantinn; Marschall, Tobias; Doerr, Daniel.

Algorithms Mol Biol ; 18(1): 15, 2023 Sep 29.

Article in English | MEDLINE | ID: mdl-37775806

ABSTRACT

Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements-including deletion, duplication, and inversion-and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, describe exact methods to characterize the number of recombinations, and demonstrate scalability to problem instances arising in practice.

7.

A draft human pangenome reference.

Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K; Monlong, Jean; Abel, Haley J; Buonaiuto, Silvia; Chang, Xian H; Cheng, Haoyu; Chu, Justin; Colonna, Vincenza; Eizenga, Jordan M; Feng, Xiaowen; Fischer, Christian; Fulton, Robert S; Garg, Shilpa; Groza, Cristian; Guarracino, Andrea; Harvey, William T; Heumos, Simon; Howe, Kerstin; Jain, Miten; Lu, Tsung-Yu; Markello, Charles; Martin, Fergal J; Mitchell, Matthew W; Munson, Katherine M; Mwaniki, Moses Njagi; Novak, Adam M; Olsen, Hugh E; Pesout, Trevor; Porubsky, David; Prins, Pjotr; Sibbesen, Jonas A; Sirén, Jouni; Tomlinson, Chad; Villani, Flavia; Vollger, Mitchell R; Antonacci-Fulton, Lucinda L; Baid, Gunjan; Baker, Carl A; Belyaeva, Anastasiya; Billis, Konstantinos; Carroll, Andrew; Chang, Pi-Chuan; Cody, Sarah.

Nature ; 617(7960): 312-324, 2023 05.

Article in English | MEDLINE | ID: mdl-37165242

ABSTRACT

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.

Subject(s)

Genome, Human , Genomics , Humans , Diploidy , Genome, Human/genetics , Haplotypes/genetics , Sequence Analysis, DNA , Genomics/standards , Reference Standards , Cohort Studies , Alleles , Genetic Variation

8.

Small parsimony for natural genomes in the DCJ-indel model.

Doerr, Daniel; Chauve, Cedric.

J Bioinform Comput Biol ; 19(6): 2140009, 2021 12.

Article in English | MEDLINE | ID: mdl-34806948

ABSTRACT

The Small Parsimony Problem (SPP) aims at finding the gene orders at internal nodes of a given phylogenetic tree such that the overall genome rearrangement distance along the tree branches is minimized. This problem is intractable in most genome rearrangement models, especially when gene duplication and loss are considered. In this work, we describe an Integer Linear Program algorithm to solve the SPP for natural genomes, i.e. genomes that contain conserved, unique, and duplicated markers. The evolutionary model that we consider is the DCJ-indel model that includes the Double-Cut and Join rearrangement operation and the insertion and deletion of genome segments. We evaluate our algorithm on simulated data and show that it is able to reconstruct very efficiently and accurately ancestral gene orders in a very comprehensive evolutionary model.

Subject(s)

Genome , Models, Genetic , Algorithms , Biological Evolution , Evolution, Molecular , Gene Rearrangement , Phylogeny

9.

The potential of family-free rearrangements towards gene orthology inference.

Rubert, Diego P; Doerr, Daniel; Braga, Marília D V.

J Bioinform Comput Biol ; 19(6): 2140014, 2021 12.

Article in English | MEDLINE | ID: mdl-34775922

ABSTRACT

Recently, we proposed an efficient ILP formulation [Rubert DP, Martinez FV, Braga MDV, Natural family-free genomic distance, Algorithms Mol Biol 16:4, 2021] for exactly computing the rearrangement distance of two genomes in a family-free setting. In such a setting, neither prior classification of genes into families, nor further restrictions on the genomes are imposed. Given two genomes, the mentioned ILP computes an optimal matching of the genes taking into account simultaneously local mutations, given by gene similarities, and large-scale genome rearrangements. Here, we explore the potential of using this ILP for inferring groups of orthologs across several species. More precisely, given a set of genomes, our method first computes all pairwise optimal gene matchings, which are then integrated into gene families in the second step. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities. It can be downloaded from gitlab.ub.uni-bielefeld.de/gi/FFGC. We obtained promising results with experiments on both simulated and real data.

Subject(s)

Genome , Models, Genetic , Algorithms , Gene Rearrangement , Genomics , Humans

10.

Computing the Rearrangement Distance of Natural Genomes.

Bohnenkämper, Leonard; Braga, Marília D V; Doerr, Daniel; Stoye, Jens.

J Comput Biol ; 28(4): 410-431, 2021 04.

Article in English | MEDLINE | ID: mdl-33393848

ABSTRACT

The computation of genomic distances has been a very active field of computational comparative genomics over the past 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double cut and join distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao et al. relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an integer linear programming (ILP) solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes that have equal numbers of duplicates of any marker. Therefore, it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this article, we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao et al. to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few 10,000 markers, which we demonstrate on simulated and real data.

Subject(s)

Computational Biology , Gene Rearrangement/genetics , Genome/genetics , Genomics , Algorithms , Models, Genetic , Programming, Linear

11.

Active and repressed biosynthetic gene clusters have spatially distinct chromosome states.

Nützmann, Hans-Wilhelm; Doerr, Daniel; Ramírez-Colmenero, América; Sotelo-Fonseca, Jesús Emiliano; Wegel, Eva; Di Stefano, Marco; Wingett, Steven W; Fraser, Peter; Hurst, Laurence; Fernandez-Valverde, Selene L; Osbourn, Anne.

Proc Natl Acad Sci U S A ; 117(24): 13800-13809, 2020 06 16.

Article in English | MEDLINE | ID: mdl-32493747

ABSTRACT

While colocalization within a bacterial operon enables coexpression of the constituent genes, the mechanistic logic of clustering of nonhomologous monocistronic genes in eukaryotes is not immediately obvious. Biosynthetic gene clusters that encode pathways for specialized metabolites are an exception to the classical eukaryote rule of random gene location and provide paradigmatic exemplars with which to understand eukaryotic cluster dynamics and regulation. Here, using 3C, Hi-C, and Capture Hi-C (CHi-C) organ-specific chromosome conformation capture techniques along with high-resolution microscopy, we investigate how chromosome topology relates to transcriptional activity of clustered biosynthetic pathway genes in Arabidopsis thaliana Our analyses reveal that biosynthetic gene clusters are embedded in local hot spots of 3D contacts that segregate cluster regions from the surrounding chromosome environment. The spatial conformation of these cluster-associated domains differs between transcriptionally active and silenced clusters. We further show that silenced clusters associate with heterochromatic chromosomal domains toward the periphery of the nucleus, while transcriptionally active clusters relocate away from the nuclear periphery. Examination of chromosome structure at unrelated clusters in maize, rice, and tomato indicates that integration of clustered pathway genes into distinct topological domains is a common feature in plant genomes. Our results shed light on the potential mechanisms that constrain coexpression within clusters of nonhomologous eukaryotic genes and suggest that gene clustering in the one-dimensional chromosome is accompanied by compartmentalization of the 3D chromosome.

Subject(s)

Arabidopsis/genetics , Chromosomes, Plant/genetics , Multigene Family , Plant Proteins/genetics , Solanum lycopersicum/genetics , Zea mays/genetics , Arabidopsis/metabolism , Chromosomes, Plant/metabolism , Genome, Plant , Solanum lycopersicum/metabolism , Oryza/genetics , Oryza/metabolism , Plant Proteins/metabolism , Zea mays/metabolism

12.

Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants.

Rubert, Diego P; Martinez, Fábio V; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 21(Suppl 2): 273, 2020 Apr 16.

Article in English | MEDLINE | ID: mdl-32299356

ABSTRACT

BACKGROUND: Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. RESULTS: Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. CONCLUSIONS: Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept.

Subject(s)

Chromosome Mapping/methods , Genomics/methods , Magnoliopsida/genetics , Computer Simulation , Evolution, Molecular , Gene Order , Genome, Plant , Models, Genetic , Phylogeny , Synteny/genetics

13.

Horizontal Gene Transfer Phylogenetics: A Random Walk Approach.

Sevillya, Gur; Doerr, Daniel; Lerner, Yael; Stoye, Jens; Steel, Mike; Snir, Sagi.

Mol Biol Evol ; 37(5): 1470-1479, 2020 05 01.

Article in English | MEDLINE | ID: mdl-31845962

ABSTRACT

The dramatic decrease in time and cost for generating genetic sequence data has opened up vast opportunities in molecular systematics, one of which is the ability to decipher the evolutionary history of strains of a species. Under this fine systematic resolution, the standard markers are too crude to provide a phylogenetic signal. Nevertheless, among prokaryotes, genome dynamics in the form of horizontal gene transfer (HGT) between organisms and gene loss seem to provide far richer information by affecting both gene order and gene content. The "synteny index" (SI) between a pair of genomes combines these latter two factors, allowing comparison of genomes with unequal gene content, together with order considerations of their common genes. Although this approach is useful for classifying close relatives, no rigorous statistical modeling for it has been suggested. Such modeling is valuable, as it allows observed measures to be transformed into estimates of time periods during evolution, yielding the "additivity" of the measure. To the best of our knowledge, there is no other additivity proof for other gene order/content measures under HGT. Here, we provide a first statistical model and analysis for the SI measure. We model the "gene neighborhood" as a "birth-death-immigration" process affected by the HGT activity over the genome, and analytically relate the HGT rate and time to the expected SI. This model is asymptotic and thus provides accurate results, assuming infinite size genomes. Therefore, we also developed a heuristic model following an "exponential decay" function, accounting for biologically realistic values, which performed well in simulations. Applying this model to 1,133 prokaryotes partitioned to 39 clusters by the rank of genus yields that the average number of genome dynamics events per gene in the phylogenetic depth of genus is around half with significant variability between genera. This result extends and confirms similar results obtained for individual genera in different manners.

Subject(s)

Gene Transfer, Horizontal , Genetic Techniques , Models, Genetic , Synteny , Genome, Microbial , Phylogeny

14.

GraphTeams: a method for discovering spatial gene clusters in Hi-C sequencing data.

Schulz, Tizian; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 19(Suppl 5): 308, 2018 May 08.

Article in English | MEDLINE | ID: mdl-29745835

ABSTRACT

BACKGROUND: Hi-C sequencing offers novel, cost-effective means to study the spatial conformation of chromosomes. We use data obtained from Hi-C experiments to provide new evidence for the existence of spatial gene clusters. These are sets of genes with associated functionality that exhibit close proximity to each other in the spatial conformation of chromosomes across several related species. RESULTS: We present the first gene cluster model capable of handling spatial data. Our model generalizes a popular computational model for gene cluster prediction, called Î´-teams, from sequences to graphs. Following previous lines of research, we subsequently extend our model to allow for several vertices being associated with the same label. The model, called Î´-teams with families, is particular suitable for our application as it enables handling of gene duplicates. We develop algorithmic solutions for both models. We implemented the algorithm for discovering Î´-teams with families and integrated it into a fully automated workflow for discovering gene clusters in Hi-C data, called GraphTeams. We applied it to human and mouse data to find intra- and interchromosomal gene cluster candidates. The results include intrachromosomal clusters that seem to exhibit a closer proximity in space than on their chromosomal DNA sequence. We further discovered interchromosomal gene clusters that contain genes from different chromosomes within the human genome, but are located on a single chromosome in mouse. CONCLUSIONS: By identifying Î´-teams with families, we provide a flexible model to discover gene cluster candidates in Hi-C data. Our analysis of Hi-C data from human and mouse reveals several known gene clusters (thus validating our approach), but also few sparsely studied or possibly unknown gene cluster candidates that could be the source of further experimental investigations.

Subject(s)

Algorithms , Chromosomes/chemistry , Computer Graphics , High-Throughput Nucleotide Sequencing/methods , Multigene Family , Sequence Analysis, DNA/methods , Animals , Cluster Analysis , Genomics , Humans , Mice

15.

Sequence-Based Synteny Analysis of Multiple Large Genomes.

Doerr, Daniel; Moret, Bernard M E.

Methods Mol Biol ; 1704: 317-329, 2018.

Article in English | MEDLINE | ID: mdl-29277871

ABSTRACT

Current methods for synteny analysis provide only limited support to study large genomes at the sequence level. In this chapter, we describe a pipeline based on existing tools that, applied in a suitable fashion, enables synteny analysis of large genomic datasets. We give a hands-on description of each step of the pipeline using four avian genomes for data. We also provide integration scripts that simplify the conversion and setup of data between the different tools in the pipeline.

Subject(s)

Birds/genetics , Genome , Software , Synteny , Algorithms , Animals , Birds/classification , Computational Biology , Genetic Markers , Genomics/methods , Sequence Analysis, DNA

16.

Family-Free Genome Comparison.

Doerr, Daniel; Feijão, Pedro; Stoye, Jens.

Methods Mol Biol ; 1704: 331-342, 2018.

Article in English | MEDLINE | ID: mdl-29277872

ABSTRACT

The comparison of genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC which provides several methods for gene order analyses that do not require prior knowledge of evolutionary relationships between the genes across the studied genomes. Moreover, the tool features a complete workflow for genome comparison, requiring nothing but annotated genome sequences as input.

Subject(s)

Evolution, Molecular , Gene Order , Genome , Software , Computational Biology , Models, Genetic , Molecular Sequence Annotation , Phylogeny

17.

Comparative scaffolding and gap filling of ancient bacterial genomes applied to two ancient Yersinia pestis genomes.

Luhmann, Nina; Doerr, Daniel; Chauve, Cedric.

Microb Genom ; 3(9): e000123, 2017 09.

Article in English | MEDLINE | ID: mdl-29114402

ABSTRACT

Yersinia pestis is the causative agent of the bubonic plague, a disease responsible for several dramatic historical pandemics. Progress in ancient DNA (aDNA) sequencing rendered possible the sequencing of whole genomes of important human pathogens, including the ancient Y. pestis strains responsible for outbreaks of the bubonic plague in London in the 14th century and in Marseille in the 18th century, among others. However, aDNA sequencing data are still characterized by short reads and non-uniform coverage, so assembling ancient pathogen genomes remains challenging and often prevents a detailed study of genome rearrangements. It has recently been shown that comparative scaffolding approaches can improve the assembly of ancient Y. pestis genomes at a chromosome level. In the present work, we address the last step of genome assembly, the gap-filling stage. We describe an optimization-based method AGapEs (ancestral gap estimation) to fill in inter-contig gaps using a combination of a template obtained from related extant genomes and aDNA reads. We show how this approach can be used to refine comparative scaffolding by selecting contig adjacencies supported by a mix of unassembled aDNA reads and comparative signal. We applied our method to two Y. pestis data sets from the London and Marseilles outbreaks, for which we obtained highly improved genome assemblies for both genomes, comprised of, respectively, five and six scaffolds with 95â% of the assemblies supported by ancient reads. We analysed the genome evolution between both ancient genomes in terms of genome rearrangements, and observed a high level of synteny conservation between these strains.

Subject(s)

Contig Mapping/methods , DNA, Ancient , Genome, Bacterial , Plague/microbiology , Yersinia pestis/genetics , DNA, Bacterial , Evolution, Molecular , France/epidemiology , History, 18th Century , History, Medieval , Humans , London/epidemiology , Pandemics/history , Phylogeny , Plague/epidemiology , Plague/history

18.

New Genome Similarity Measures based on Conserved Gene Adjacencies.

Doerr, Daniel; Kowada, Luis Antonio B; Araujo, Eloi; Deshpande, Shachi; Dantas, Simone; Moret, Bernard M E; Stoye, Jens.

J Comput Biol ; 24(6): 616-634, 2017 Jun.

Article in English | MEDLINE | ID: mdl-28590847

ABSTRACT

Many important questions in molecular biology, evolution, and biomedicine can be addressed by comparative genomic approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example, to elucidate the phylogenetic relationships between species. The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomic methods that allow this kind of input are called gene family-based. The most powerful-but also most complex-models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free. In this article, we study an intermediate approach between family-based and family-free genomic similarity measures. Introducing this simpler model, called gene connections, we focus on the combinatorial aspects of gene family-free genome comparison. While in most cases, the computational costs to the general family-free case are the same, we also find an instance where the gene connections model has lower complexity. Within the gene connections model, we define three variants of genomic similarity measures that have different expression powers. We give polynomial-time algorithms for two of them, while we show NP-hardness for the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.

Subject(s)

Algorithms , Computational Biology/methods , Gene Order , Genes, Plant , Genome, Plant , Genomics/methods , Models, Genetic , Multigene Family , Phylogeny

19.

The gene family-free median of three.

Doerr, Daniel; Balaban, Metin; Feijão, Pedro; Chauve, Cedric.

Algorithms Mol Biol ; 12: 14, 2017.

Article in English | MEDLINE | ID: mdl-28559921

ABSTRACT

BACKGROUND: The gene family-free framework for comparative genomics aims at providing methods for gene order analysis that do not require prior gene family assignment, but work directly on a sequence similarity graph. We study two problems related to the breakpoint median of three genomes, which asks for the construction of a fourth genome that minimizes the sum of breakpoint distances to the input genomes. METHODS: We present a model for constructing a median of three genomes in this family-free setting, based on maximizing an objective function that generalizes the classical breakpoint distance by integrating sequence similarity in the score of a gene adjacency. We study its computational complexity and we describe an integer linear program (ILP) for its exact solution. We further discuss a related problem called family-free adjacencies for k genomes for the special case of [Formula: see text] and present an ILP for its solution. However, for this problem, the computation of exact solutions remains intractable for sufficiently large instances. We then proceed to describe a heuristic method, FFAdj-AM, which performs well in practice. RESULTS: The developed methods compute accurate positional orthologs for genomes comparable in size of bacterial genomes on simulated data and genomic data acquired from the OMA orthology database. In particular, FFAdj-AM performs equally or better when compared to the well-established gene family prediction tool MultiMSOAR. CONCLUSIONS: We study the computational complexity of a new family-free model and present algorithms for its solution. With FFAdj-AM, we propose an appealing alternative to established tools for identifying higher confidence positional orthologs.

20.

Orthology detection combining clustering and synteny for very large datasets.

Lechner, Marcus; Hernandez-Rosales, Maribel; Doerr, Daniel; Wieseke, Nicolas; Thévenin, Annelyse; Stoye, Jens; Hartmann, Roland K; Prohaska, Sonja J; Stadler, Peter F.

PLoS One ; 9(8): e105015, 2014.

Article in English | MEDLINE | ID: mdl-25137074

ABSTRACT

The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.

Subject(s)

Models, Genetic , Software , Synteny , Bacterial Proteins/genetics , Cluster Analysis , Computer Simulation , Datasets as Topic , Genes, Bacterial

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL