Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Genome Biol Evol ; 12(1): 3586-3598, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31774499

RESUMO

Plant mitogenomes can be difficult to assemble because they are structurally dynamic and prone to intergenomic DNA transfers, leading to the unusual situation where an organelle genome is far outnumbered by its nuclear counterparts. As a result, comparative mitogenome studies are in their infancy and some key aspects of genome evolution are still known mainly from pregenomic, qualitative methods. To help address these limitations, we combined machine learning and in silico enrichment of mitochondrial-like long reads to assemble the bacterial-sized mitogenome of Norway spruce (Pinaceae: Picea abies). We conducted comparative analyses of repeat abundance, intergenomic transfers, substitution and rearrangement rates, and estimated repeat-by-repeat homologous recombination rates. Prompted by our discovery of highly recombinogenic small repeats in P. abies, we assessed the genomic support for the prevailing hypothesis that intramolecular recombination is predominantly driven by repeat length, with larger repeats facilitating DNA exchange more readily. Overall, we found mixed support for this view: Recombination dynamics were heterogeneous across vascular plants and highly active small repeats (ca. 200 bp) were present in about one-third of studied mitogenomes. As in previous studies, we did not observe any robust relationships among commonly studied genome attributes, but we identify variation in recombination rates as a underinvestigated source of plant mitogenome diversity.


Assuntos
Genoma Mitocondrial , Picea/genética , Recombinação Genética , Simulação por Computador , Cycadopsida/genética , DNA de Plantas/química , Genes de Plantas , Variação Genética , Sequências Repetitivas de Ácido Nucleico , Máquina de Vetores de Suporte
2.
Mol Phylogenet Evol ; 139: 106558, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31288106

RESUMO

The oomycetes are filamentous eukaryotic microorganisms, distinct from true fungi, many of which act as crop or fish pathogens that cause devastating losses in agriculture and aquaculture. Chitin is present in all true fungi, but it occurs in only small amounts in some Saprolegniomycetes and it is absent in Peronosporomycetes. However, the growth of several oomycetes is severely impacted by competitive chitin synthase (CHS) inhibitors. Here, we shed light on the diversity, evolution and function of oomycete CHS proteins. We show by phylogenetic analysis of 93 putative CHSs from 48 highly diverse oomycetes, including the early diverging Eurychasma dicksonii, that all available oomycete genomes contain at least one putative CHS gene. All gene products contain conserved CHS motifs essential for enzymatic activity and form two Peronosporomycete-specific and six Saprolegniale-specific clades. Proteins of all clades, except one, contain an N-terminal microtubule interacting and trafficking (MIT) domain as predicted by protein domain databases or manual analysis, which is supported by homology modelling and comparison of conserved structural features from sequence logos. We identified at least three groups of CHSs conserved among all oomycete lineages and used phylogenetic reconciliation analysis to infer the dynamic evolution of CHSs in oomycetes. The evolutionary aspects of CHS diversity in modern-day oomycetes are discussed. In addition, we observed hyphal tip rupture in Phytophthora infestans upon treatment with the CHS inhibitor nikkomycin Z. Combining data on phylogeny, gene expression, and response to CHS inhibitors, we propose the association of different CHS clades with certain developmental stages.


Assuntos
Quitina Sintase/genética , Evolução Molecular , Variação Genética , Oomicetos/enzimologia , Oomicetos/genética , Sequência de Aminoácidos , Quitina Sintase/química , Sequência Conservada/genética , Funções Verossimilhança , Filogenia , Domínios Proteicos
3.
Bioinformatics ; 34(21): 3646-3652, 2018 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-29762653

RESUMO

Motivation: A reconciliation is an annotation of the nodes of a gene tree with evolutionary events-for example, speciation, gene duplication, transfer, loss, etc.-along with a mapping onto a species tree. Many algorithms and software produce or use reconciliations but often using different reconciliation formats, regarding the type of events considered or whether the species tree is dated or not. This complicates the comparison and communication between different programs. Results: Here, we gather a consortium of software developers in gene tree species tree reconciliation to propose and endorse a format that aims to promote an integrative-albeit flexible-specification of phylogenetic reconciliations. This format, named recPhyloXML, is accompanied by several tools such as a reconciled tree visualizer and conversion utilities. Availability and implementation: http://phylariane.univ-lyon1.fr/recphyloxml/.


Assuntos
Evolução Molecular , Duplicação Gênica , Algoritmos , Filogenia , Software
4.
BMC Bioinformatics ; 18(1): 97, 2017 Feb 10.
Artigo em Inglês | MEDLINE | ID: mdl-28187712

RESUMO

BACKGROUND: MCMC-based methods are important for Bayesian inference of phylogeny and related parameters. Although being computationally expensive, MCMC yields estimates of posterior distributions that are useful for estimating parameter values and are easy to use in subsequent analysis. There are, however, sometimes practical difficulties with MCMC, relating to convergence assessment and determining burn-in, especially in large-scale analyses. Currently, multiple software are required to perform, e.g., convergence, mixing and interactive exploration of both continuous and tree parameters. RESULTS: We have written a software called VMCMC to simplify post-processing of MCMC traces with, for example, automatic burn-in estimation. VMCMC can also be used both as a GUI-based application, supporting interactive exploration, and as a command-line tool suitable for automated pipelines. CONCLUSIONS: VMCMC is a free software available under the New BSD License. Executable jar files, tutorial manual and source code can be downloaded from https://bitbucket.org/rhali/visualmcmc/ .


Assuntos
Software , Cadeias de Markov , Método de Monte Carlo , Filogenia
5.
J Comput Biol ; 24(6): 581-589, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-27681236

RESUMO

Reads from paired-end and mate-pair libraries are often utilized to find structural variation in genomes, and one common approach is to use their fragment length for detection. After aligning read pairs to the reference, read pair distances are analyzed for statistically significant deviations. However, previously proposed methods are based on a simplified model of observed fragment lengths that does not agree with data. We show how this model limits statistical analysis of identifying variants and propose a new model by adapting a model we have previously introduced for contig scaffolding, which agrees with data. From this model, we derive an improved null hypothesis that when applied in the variant caller CLEVER, reduces the number of false positives and corrects a bias that contributes to more deletion calls than insertion calls. We advise developers of variant callers with statistical fragment length-based methods to adapt the concepts in our proposed model and null hypothesis.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Software , Viés , Genoma Humano , Variação Estrutural do Genoma , Humanos , Modelos Genéticos
6.
BMC Evol Biol ; 16(1): 120, 2016 06 04.
Artigo em Inglês | MEDLINE | ID: mdl-27260514

RESUMO

BACKGROUND: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. RESULTS: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. CONCLUSIONS: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.


Assuntos
Algoritmos , Homologia de Sequência do Ácido Nucleico , Sintenia , Animais , Análise por Conglomerados , Bases de Dados Genéticas , Fungos/genética , Humanos , Camundongos , Filogenia , Especificidade da Espécie , Estatística como Assunto
7.
Bioinformatics ; 32(13): 1925-32, 2016 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-27153683

RESUMO

MOTIVATION: Scaffolding is often an essential step in a genome assembly process, in which contigs are ordered and oriented using read pairs from a combination of paired-end libraries and longer-range mate-pair libraries. Although a simple idea, scaffolding is unfortunately hard to get right in practice. One source of problems is so-called PE-contamination in mate-pair libraries, in which a non-negligible fraction of the read pairs get the wrong orientation and a much smaller insert size than what is expected. This contamination has been discussed before, in relation to integrated scaffolders, but solutions rely on the orientation being observable, e.g. by finding the junction adapter sequence in the reads. This is not always possible, making orientation and insert size of a read pair stochastic. To our knowledge, there is neither previous work on modeling PE-contamination, nor a study on the effect PE-contamination has on scaffolding quality. RESULTS: We have addressed PE-contamination in an update to our scaffolder BESST. We formulate the problem as an integer linear program which is solved using an efficient heuristic. The new method shows significant improvement over both integrated and stand-alone scaffolders in our experiments. The impact of modeling PE-contamination is quantified by comparing with the previous BESST model. We also show how other scaffolders are vulnerable to PE-contaminated libraries, resulting in an increased number of misassemblies, more conservative scaffolding and inflated assembly sizes. AVAILABILITY AND IMPLEMENTATION: The model is implemented in BESST. Source code and usage instructions are found at https://github.com/ksahlin/BESST BESST can also be downloaded using PyPI. CONTACT: ksahlin@kth.se SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biblioteca Gênica , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bactérias/genética , Humanos , Programação Linear , Software
8.
BMC Bioinformatics ; 17(Suppl 14): 431, 2016 Nov 11.
Artigo em Inglês | MEDLINE | ID: mdl-28185583

RESUMO

BACKGROUND: Lateral gene transfer (LGT) is an evolutionary process that has an important role in biology. It challenges the traditional binary tree-like evolution of species and is attracting increasing attention of the molecular biologists due to its involvement in antibiotic resistance. A number of attempts have been made to model LGT in the presence of gene duplication and loss, but reliably placing LGT events in the species tree has remained a challenge. RESULTS: In this paper, we propose probabilistic methods that samples reconciliations of the gene tree with a dated species tree and computes maximum a posteriori probabilities. The MCMC-based method uses the probabilistic model DLTRS, that integrates LGT, gene duplication, gene loss, and sequence evolution under a relaxed molecular clock for substitution rates. We can estimate posterior distributions on gene trees and, in contrast to previous work, the actual placement of potential LGT, which can be used to, e.g., identify "highways" of LGT. CONCLUSIONS: Based on a simulation study, we conclude that the method is able to infer the true LGT events on gene tree and reconcile it to the correct edges on the species tree in most cases. Applied to two biological datasets, containing gene families from Cyanobacteria and Molicutes, we find potential LGTs highways that corroborate other studies as well as previously undetected examples.


Assuntos
Transferência Genética Horizontal/genética , Modelos Genéticos , Evolução Biológica , Entomoplasmataceae/classificação , Entomoplasmataceae/genética , Filogenia
9.
BMC Genomics ; 16 Suppl 10: S12, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26449131

RESUMO

Over the last decade, methods have been developed for the reconstruction of gene trees that take into account the species tree. Many of these methods have been based on the probabilistic duplication-loss model, which describes how a gene-tree evolves over a species-tree with respect to duplication and losses, as well as extension of this model, e.g., the DLRS (Duplication, Loss, Rate and Sequence evolution) model that also includes sequence evolution under relaxed molecular clock. A disjoint, almost as recent, and very important line of research has been focused on non protein-coding, but yet, functional DNA. For instance, DNA sequences being pseudogenes in the sense that they are not translated, may still be transcribed and the thereby produced RNA may be functional.


Assuntos
DNA/genética , Evolução Molecular , Filogenia , Pseudogenes/genética , Duplicação Gênica
10.
BMC Bioinformatics ; 15: 281, 2014 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-25128196

RESUMO

BACKGROUND: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. RESULTS: We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. CONCLUSION: We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Biblioteca Gênica , Humanos , Reprodutibilidade dos Testes
11.
J Forensic Sci ; 59(4): 898-908, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24814664

RESUMO

The discriminatory power of the noncoding control region (CR) of domestic dog mitochondrial DNA alone is relatively low. The extent to which the discriminatory power could be increased by analyzing additional highly variable coding regions of the mitochondrial genome (mtGenome) was therefore investigated. Genetic variability across the mtGenome was evaluated by phylogenetic analysis, and the three most variable ~1 kb coding regions identified. We then sampled 100 Swedish dogs to represent breeds in accordance with their frequency in the Swedish population. A previously published dataset of 59 dog mtGenomes collected in the United States was also analyzed. Inclusion of the three coding regions increased the exclusion capacity considerably for the Swedish sample, from 0.920 for the CR alone to 0.964 for all four regions. The number of mtDNA types among all 159 dogs increased from 41 to 72, the four most frequent CR haplotypes being resolved into 22 different haplotypes.


Assuntos
DNA Mitocondrial/genética , Cães/genética , Análise de Sequência de DNA , Animais , Bases de Dados de Ácidos Nucleicos , Genoma , Haplótipos , Região de Controle de Locus Gênico , Reação em Cadeia da Polimerase
12.
Syst Biol ; 63(3): 409-20, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24562812

RESUMO

Lateral gene transfer (LGT)--which transfers DNA between two non-vertically related individuals belonging to the same or different species--is recognized as a major force in prokaryotic evolution, and evidence of its impact on eukaryotic evolution is ever increasing. LGT has attracted much public attention for its potential to transfer pathogenic elements and antibiotic resistance in bacteria, and to transfer pesticide resistance from genetically modified crops to other plants. In a wider perspective, there is a growing body of studies highlighting the role of LGT in enabling organisms to occupy new niches or adapt to environmental changes. The challenge LGT poses to the standard tree-based conception of evolution is also being debated. Studies of LGT have, however, been severely limited by a lack of computational tools. The best currently available LGT algorithms are parsimony-based phylogenetic methods, which require a pre-computed gene tree and cannot choose between sometimes wildly differing most parsimonious solutions. Moreover, in many studies, simple heuristics are applied that can only handle putative orthologs and completely disregard gene duplications (GDs). Consequently, proposed LGT among specific gene families, and the rate of LGT in general, remain debated. We present a Bayesian Markov-chain Monte Carlo-based method that integrates GD, gene loss, LGT, and sequence evolution, and apply the method in a genome-wide analysis of two groups of bacteria: Mollicutes and Cyanobacteria. Our analyses show that although the LGT rate between distant species is high, the net combined rate of duplication and close-species LGT is on average higher. We also show that the common practice of disregarding reconcilability in gene tree inference overestimates the number of LGT and duplication events.


Assuntos
Classificação/métodos , Transferência Genética Horizontal , Teorema de Bayes , Cianobactérias/classificação , Cianobactérias/genética , Evolução Molecular , Modelos Teóricos , Filogenia , Tenericutes/classificação , Tenericutes/genética
13.
BMC Bioinformatics ; 14: 334, 2013 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-24255987

RESUMO

BACKGROUND: Distance methods are ubiquitous tools in phylogenetics. Their primary purpose may be to reconstruct evolutionary history, but they are also used as components in bioinformatic pipelines. However, poor computational efficiency has been a constraint on the applicability of distance methods on very large problem instances. RESULTS: We present fastphylo, a software package containing implementations of efficient algorithms for two common problems in phylogenetics: estimating DNA/protein sequence distances and reconstructing a phylogeny from a distance matrix. We compare fastphylo with other neighbor joining based methods and report the results in terms of speed and memory efficiency. CONCLUSIONS: Fastphylo is a fast, memory efficient, and easy to use software suite. Due to its modular architecture, fastphylo is a flexible tool for many phylogenetic studies.


Assuntos
Biologia Computacional/instrumentação , Biologia Computacional/métodos , Filogenia , Algoritmos , Sequência de Aminoácidos , Evolução Biológica , Idioma , Memória , Família Multigênica , Software
14.
PLoS One ; 8(11): e79012, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24244403

RESUMO

Genetic markers, defined as variable regions of DNA, can be utilized for distinguishing individuals or populations. As long as markers are independent, it is easy to combine the information they provide. For nonrecombinant sequences like mtDNA, choosing the right set of markers for forensic applications can be difficult and requires careful consideration. In particular, one wants to maximize the utility of the markers. Until now, this has mainly been done by hand. We propose an algorithm that finds the most informative subset of a set of markers. The algorithm uses a depth first search combined with a branch-and-bound approach. Since the worst case complexity is exponential, we also propose some data-reduction techniques and a heuristic. We implemented the algorithm and applied it to two forensic caseworks using mitochondrial DNA, which resulted in marker sets with significantly improved haplotypic diversity compared to previous suggestions. Additionally, we evaluated the quality of the estimation with an artificial dataset of mtDNA. The heuristic is shown to provide extensive speedup at little cost in accuracy.


Assuntos
Algoritmos , DNA Mitocondrial/genética , Marcadores Genéticos , Haplótipos/genética , Modelos Genéticos , Bases de Dados de Ácidos Nucleicos , Genética Forense/métodos , Ligação Genética , Humanos
15.
BMC Bioinformatics ; 14 Suppl 7: S6, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23815503

RESUMO

BACKGROUND: In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions. RESULTS: GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools. CONCLUSIONS: The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Algoritmos , Cromossomos/genética , Genoma Bacteriano , Genoma Humano , Humanos , Rhodobacter sphaeroides/genética , Software , Staphylococcus aureus/genética
16.
BMC Bioinformatics ; 14: 209, 2013 Jun 27.
Artigo em Inglês | MEDLINE | ID: mdl-23803001

RESUMO

BACKGROUND: PrIME-GenPhyloData is a suite of tools for creating realistic simulated phylogenetic trees, in particular for families of homologous genes. It supports generation of trees based on a birth-death process and--perhaps more interestingly--also supports generation of gene family trees guided by a known (synthetic or biological) species tree while accounting for events such as gene duplication, gene loss, and lateral gene transfer (LGT). The suite also supports a wide range of branch rate models enabling relaxation of the molecular clock. RESULT: Simulated data created with PrIME-GenPhyloData can be used for benchmarking phylogenetic approaches, or for characterizing models or model parameters with respect to biological data. CONCLUSION: The concept of tree-in-tree evolution can also be used to model, for instance, biogeography or host-parasite co-evolution.


Assuntos
Duplicação Gênica/genética , Família Multigênica/genética , Filogenia , Relógios Biológicos/genética , Simulação por Computador , Evolução Molecular , Técnicas de Transferência de Genes , Humanos , Modelos Biológicos , Especificidade da Espécie
17.
Nature ; 497(7451): 579-84, 2013 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-23698360

RESUMO

Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.


Assuntos
Evolução Molecular , Genoma de Planta/genética , Picea/genética , Sequência Conservada/genética , Elementos de DNA Transponíveis/genética , Inativação Gênica , Genes de Plantas/genética , Genômica , Internet , Íntrons/genética , Fenótipo , RNA não Traduzido/genética , Análise de Sequência de DNA , Sequências Repetidas Terminais/genética , Transcrição Gênica/genética
18.
BMC Bioinformatics ; 14 Suppl 15: S12, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564516

RESUMO

BACKGROUND: Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. RESULTS: Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. CONCLUSIONS: The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.


Assuntos
Sequência de Bases , Homologia de Sequência do Ácido Nucleico , Sintenia , Algoritmos , Animais , Mapeamento Cromossômico , Análise por Conglomerados , Humanos , Camundongos , Proteínas/genética
19.
Bioinformatics ; 28(22): 2994-5, 2012 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-22982573

RESUMO

SUMMARY: PrIME-DLRS (or colloquially: 'Delirious') is a phylogenetic software tool to simultaneously infer and reconcile a gene tree given a species tree. It accounts for duplication and loss events, a relaxed molecular clock and is intended for the study of homologous gene families, for example in a comparative genomics setting involving multiple species. PrIME-DLRS uses a Bayesian MCMC framework, where the input is a known species tree with divergence times and a multiple sequence alignment, and the output is a posterior distribution over gene trees and model parameters. AVAILABILITY AND IMPLEMENTATION: PrIME-DLRS is available for Java SE 6+ under the New BSD License, and JAR files and source code can be downloaded from http://code.google.com/p/jprime/. There is also a slightly older C++ version available as a binary package for Ubuntu, with download instructions at http://prime.sbc.su.se. The C++ source code is available upon request. CONTACT: joel.sjostrand@scilifelab.se or jens.lagergren@scilifelab.se. SUPPLEMENTARY INFORMATION: PrIME-DLRS is based on a sound probabilistic model (Åkerborg et al., 2009) and has been thoroughly validated on synthetic and biological datasets (Supplementary Material online).


Assuntos
Evolução Molecular , Filogenia , Software , Algoritmos , Animais , Teorema de Bayes , Modelos Estatísticos , Linguagens de Programação , Alinhamento de Sequência
20.
Bioinformatics ; 28(17): 2215-22, 2012 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-22923455

RESUMO

MOTIVATION: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance. RESULTS: In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners. AVAILABILITY: A reference implementation is provided at https://github.com/SciLifeLab/gapest. SUPPLEMENTARY INFORMATION: Supplementary data are availible at Bioinformatics online.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Modelos Genéticos , Biblioteca Gênica , Funções Verossimilhança , Probabilidade , Análise de Regressão , Análise de Sequência de DNA/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...