Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 39(39 Suppl 1): i194-i203, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387128

RESUMO

MOTIVATION: Recent methods for selective sweep detection cast the problem as a classification task and use summary statistics as features to capture region characteristics that are indicative of a selective sweep, thereby being sensitive to confounding factors. Furthermore, they are not designed to perform whole-genome scans or to estimate the extent of the genomic region that was affected by positive selection; both are required for identifying candidate genes and the time and strength of selection. RESULTS: We present ASDEC (https://github.com/pephco/ASDEC), a neural-network-based framework that can scan whole genomes for selective sweeps. ASDEC achieves similar classification performance to other convolutional neural network-based classifiers that rely on summary statistics, but it is trained 10× faster and classifies genomic regions 5× faster by inferring region characteristics from the raw sequence data directly. Deploying ASDEC for genomic scans achieved up to 15.2× higher sensitivity, 19.4× higher success rates, and 4× higher detection accuracy than state-of-the-art methods. We used ASDEC to scan human chromosome 1 of the Yoruba population (1000Genomes project), identifying nine known candidate genes.


Assuntos
Genômica , Redes Neurais de Computação , Humanos
2.
BMC Plant Biol ; 23(1): 323, 2023 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-37328739

RESUMO

BACKGROUND: During domestication and subsequent improvement plants were subjected to intensive positive selection for desirable traits. Identification of selection targets is important with respect to the future targeted broadening of diversity in breeding programmes. Rye (Secale cereale L.) is a cereal that is closely related to wheat, and it is an important crop in Central, Eastern and Northern Europe. The aim of the study was (i) to identify diverse groups of rye accessions based on high-density, genome-wide analysis of genetic diversity within a set of 478 rye accessions, covering a full spectrum of diversity within the genus, from wild accessions to inbred lines used in hybrid breeding, and (ii) to identify selective sweeps in the established groups of cultivated rye germplasm and putative candidate genes targeted by selection. RESULTS: Population structure and genetic diversity analyses based on high-quality SNP (DArTseq) markers revealed the presence of three complexes in the Secale genus: S. sylvestre, S. strictum and S. cereale/vavilovii, a relatively narrow diversity of S. sylvestre, very high diversity of S. strictum, and signatures of strong positive selection in S. vavilovii. Within cultivated ryes we detected the presence of genetic clusters and the influence of improvement status on the clustering. Rye landraces represent a reservoir of variation for breeding, and especially a distinct group of landraces from Turkey should be of special interest as a source of untapped variation. Selective sweep detection in cultivated accessions identified 133 outlier positions within 13 sweep regions and 170 putative candidate genes related, among others, to response to various environmental stimuli (such as pathogens, drought, cold), plant fertility and reproduction (pollen sperm cell differentiation, pollen maturation, pollen tube growth), and plant growth and biomass production. CONCLUSIONS: Our study provides valuable information for efficient management of rye germplasm collections, which can help to ensure proper safeguarding of their genetic potential and provides numerous novel candidate genes targeted by selection in cultivated rye for further functional characterisation and allelic diversity studies.


Assuntos
Melhoramento Vegetal , Secale , Secale/genética , Sementes , Fenótipo , Citoplasma
3.
Sci Rep ; 11(1): 23773, 2021 12 10.
Artigo em Inglês | MEDLINE | ID: mdl-34893626

RESUMO

Previous molecular characterization studies conducted in Canadian wheat cultivars shed some light on the impact of plant breeding on genetic diversity, but the number of varieties and markers used was small. Here, we used 28,798 markers of the wheat 90K single nucleotide polymorphisms to (a) assess the extent of genetic diversity, relationship, population structure, and divergence among 174 historical and modern Canadian spring wheat varieties registered from 1905 to 2018 and 22 unregistered lines (hereinafter referred to as cultivars), and (b) identify genomic regions that had undergone selection. About 91% of the pairs of cultivars differed by 20-40% of the scored alleles, but only 7% of the pairs had kinship coefficients of < 0.250, suggesting the presence of a high proportion of redundancy in allelic composition. Although the 196 cultivars represented eight wheat classes, our results from phylogenetic, principal component, and the model-based population structure analyses revealed three groups, with no clear structure among most wheat classes, breeding programs, and breeding periods. FST statistics computed among different categorical variables showed little genetic differentiation (< 0.05) among breeding periods and breeding programs, but a diverse level of genetic differentiation among wheat classes and predicted groups. Diversity indices were the highest and lowest among cultivars registered from 1970 to 1980 and from 2011 to 2018, respectively. Using two outlier detection methods, we identified from 524 to 2314 SNPs and 41 selective sweeps of which some are close to genes with known phenotype, including plant height, photoperiodism, vernalization, gluten strength, and disease resistance.


Assuntos
Variação Genética , Melhoramento Vegetal , Polimorfismo de Nucleotídeo Único , Seleção Genética , Triticum/genética , Alelos , Canadá , Evolução Molecular , Ligação Genética , Marcadores Genéticos , Genética Populacional , Genoma de Planta , Genótipo , Desequilíbrio de Ligação , Fenótipo , Locos de Características Quantitativas , Triticum/classificação
4.
Mol Ecol Resour ; 21(7): 2580-2587, 2021 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-34062051

RESUMO

Software tools for linkage disequilibrium (LD) analyses are designed to calculate LD among all genetic variants in a single region. Since compute and memory requirements grow quadratically with the distance between variants, using these tools for long-range LD calculations leads to long execution times and increased allocation of memory resources. Furthermore, widely used tools do not fully utilize the computational resources of modern processors and/or graphics processing cards, limiting future large-scale analyses on thousands of samples. We present quickLD, a stand-alone and open-source software that computes several LD-related statistics, including the commonly used r2 . quickLD calculates pairwise LD between genetic variants in a single region or in arbitrarily distant regions with negligible memory requirements. Moreover, quickLD achieves up to 95% and 97% of the theoretical peak performance of a CPU and a GPU, respectively, enabling 21.5× faster processing than current state-of-the-art software on a multicore processor and 49.5× faster processing when the aggregate processing power of a multicore CPU and a GPU is harnessed. quickLD can also be used in studies of selection, recombination, genetic drift, inbreeding and gene flow. The software is available at https://github.com/pephco/quickLD.


Assuntos
Algoritmos , Software , Ligação Genética , Desequilíbrio de Ligação
5.
Life (Basel) ; 11(2)2021 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-33562321

RESUMO

Full-genome-sequence computational analyses of the SARS-coronavirus (CoV)-2 genomes allow us to understand the evolutionary events and adaptability mechanisms. We used population genetics analyses on human SARS-CoV-2 genomes available on 2 April 2020 to infer the mutation rate and plausible recombination events between the Betacoronavirus genomes in nonhuman hosts that may have contributed to the evolution of SARS-CoV-2. Furthermore, we localized the targets of recent and strong, positive selection during the first pandemic wave. The genomic regions that appear to be under positive selection are largely co-localized with regions in which recombination from nonhuman hosts took place. Our results suggest that the pangolin coronavirus genome may have contributed to the SARS-CoV-2 genome by recombination with the bat coronavirus genome. However, we find evidence for additional recombination events that involve coronavirus genomes from other hosts, i.e., hedgehogs and sparrows. We further infer that recombination may have recently occurred within human hosts. Finally, we estimate the parameters of a demographic scenario involving an exponential growth of the size of the SARS-CoV-2 populations that have infected European, Asian, and Northern American cohorts, and we demonstrate that a rapid exponential growth in population size from the first wave can support the observed polymorphism patterns in SARS-CoV-2 genomes.

7.
Mol Ecol Resour ; 20(6): 1597-1609, 2020 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-32639602

RESUMO

In recent years, genome-scan methods have been extensively used to detect local signatures of selection and introgression. Most of these methods are either designed for one or the other case, which may impair the study of combined cases. Here, we introduce a series of versatile genome-scan methods applicable for both cases, the detection of selection and introgression. The proposed approaches are based on nonparametric k-nearest neighbour (kNN) techniques, while incorporating pairwise Fixation Index (FST ) and pairwise nucleotide differences (dxy ) as features. We benchmark our methods using a wide range of simulation scenarios, with varying parameters, such as recombination rates, population background histories, selection strengths, the proportion of introgression and the time of gene flow. We find that kNN-based methods perform remarkably well compared with the state-of-the-art. Finally, we demonstrate how to perform kNN-based genome scans on real-world genomic data using the population genomics R-package popgenome.


Assuntos
Simulação por Computador , Genoma , Genômica , Modelos Genéticos , Fluxo Gênico , Genética Populacional , Metagenômica , Polimorfismo de Nucleotídeo Único , Seleção Genética
8.
Methods Mol Biol ; 2090: 87-123, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31975165

RESUMO

High-throughput genomic sequencing allows to disentangle the evolutionary forces acting in populations. Among evolutionary forces, positive selection has received a lot of attention because it is related to the adaptation of populations in their environments, both biotic and abiotic. Positive selection, also known as Darwinian selection, occurs when an allele is favored by natural selection. The frequency of the favored allele increases in the population and, due to genetic hitchhiking, neighboring linked variation diminishes, creating so-called selective sweeps. Such a process leaves traces in genomes that can be detected in a future time point. Detecting traces of positive selection in genomes is achieved by searching for signatures introduced by selective sweeps, such as regions of reduced variation, a specific shift of the site frequency spectrum, and particular linkage disequilibrium (LD) patterns in the region. A variety of approaches can be used for detecting selective sweeps, ranging from simple implementations that compute summary statistics to more advanced statistical approaches, e.g., Bayesian approaches, maximum-likelihood-based methods, and machine learning methods. In this chapter, we discuss selective sweep detection methodologies on the basis of their capacity to analyze whole genomes or just subgenomic regions, and on the specific polymorphism patterns they exploit as selective sweep signatures. We also summarize the results of comparisons among five open-source software releases (SweeD, SweepFinder, SweepFinder2, OmegaPlus, and RAiSD) regarding sensitivity, specificity, and execution times. Furthermore, we test and discuss machine learning methods and present a thorough performance analysis. In equilibrium neutral models or mild bottlenecks, most methods are able to detect selective sweeps accurately. Methods and tools that rely on linkage disequilibrium (LD) rather than single SNPs exhibit higher true positive rates than the site frequency spectrum (SFS)-based methods under the model of a single sweep or recurrent hitchhiking. However, their false positive rate is elevated when a misspecified demographic model is used to build the distribution of the statistic under the null hypothesis. Both LD and SFS-based approaches suffer from decreased accuracy on localizing the true target of selection in bottleneck scenarios. Furthermore, we present an extensive analysis of the effects of gene flow on selective sweep detection, a problem that has been understudied in selective sweep literature.


Assuntos
Genética Populacional/métodos , Seleção Genética , Teorema de Bayes , Bases de Dados Genéticas , Evolução Molecular , Fluxo Gênico , Frequência do Gene , Genoma Humano , Humanos , Funções Verossimilhança , Desequilíbrio de Ligação , Aprendizado de Máquina , Modelos Genéticos , Software
9.
Sci Rep ; 9(1): 13490, 2019 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-31530852

RESUMO

Little is known on maize germplasm adapted to the African highland agro-ecologies. In this study, we analyzed high-density genotyping by sequencing (GBS) data of 298 African highland adapted maize inbred lines to (i) assess the extent of genetic purity, genetic relatedness, and population structure, and (ii) identify genomic regions that have undergone selection (selective sweeps) in response to adaptation to highland environments. Nearly 91% of the pairs of inbred lines differed by 30-36% of the scored alleles, but only 32% of the pairs of the inbred lines had relative kinship coefficient <0.050, which suggests the presence of substantial redundancy in allelic composition that may be due to repeated use of fewer genetic backgrounds (source germplasm) during line development. Results from different genetic relatedness and population structure analyses revealed three different groups, which generally agrees with pedigree information and breeding history, but less so by heterotic groups and endosperm modification. We identified 944 single nucleotide polymorphic (SNP) markers that fell within 22 selective sweeps that harbored 265 protein-coding candidate genes of which some of the candidate genes had known functions. Details of the candidate genes with known functions and differences in nucleotide diversity among groups predicted based on multivariate methods have been discussed.


Assuntos
Variação Genética , Endogamia , Melhoramento Vegetal , Zea mays/genética , Mapeamento Cromossômico , Evolução Molecular , Marcadores Genéticos , Genótipo , Filogenia , Polimorfismo de Nucleotídeo Único , Seleção Genética , Estresse Fisiológico
10.
Theor Appl Genet ; 132(4): 1145-1158, 2019 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-30578434

RESUMO

KEY MESSAGE: The extent of molecular diversity parameters across three rice species was compared using large germplasm collection genotyped with genomewide SNPs and SNPs that fell within selective sweep regions. Previous studies conducted on limited number of accessions have reported very low genetic variation in African rice (Oryza glaberrima Steud.) as compared to its wild progenitor (O. barthii A. Chev.) and to Asian rice (O. sativa L.). Here, we characterized a large collection of African rice and compared its molecular diversity indices and population structure with the two other species using genomewide single nucleotide polymorphisms (SNPs) and SNPs that mapped within selective sweeps. A total of 3245 samples representing African rice (2358), Asian rice (772) and O. barthii (115) were genotyped with 26,073 physically mapped SNPs. Using all SNPs, the level of marker polymorphism, average genetic distance and nucleotide diversity in African rice accounted for 59.1%, 63.2% and 37.1% of that of O. barthii, respectively. SNP polymorphism and overall nucleotide diversity of the African rice accounted for 20.1-32.1 and 16.3-37.3% of that of the Asian rice, respectively. We identified 780 SNPs that fell within 37 candidate selective sweeps in African rice, which were distributed across all 12 rice chromosomes. Nucleotide diversity of the African rice estimated from the 780 SNPs was 8.3 × 10-4, which is not only 20-fold smaller than the value estimated from all genomewide SNPs (π = 1.6 × 10-2), but also accounted for just 4.1%, 0.9% and 2.1% of that of O. barthii, lowland Asian rice and upland Asian rice, respectively. The genotype data generated for a large collection of rice accessions conserved at the AfricaRice genebank will be highly useful for the global rice community and promote germplasm use.


Assuntos
Variação Genética , Genética Populacional , Oryza/genética , Ásia , Cromossomos de Plantas/genética , Estudos de Associação Genética , Marcadores Genéticos , Filogenia , Mapeamento Físico do Cromossomo , Polimorfismo de Nucleotídeo Único/genética , Análise de Componente Principal
11.
Commun Biol ; 1: 79, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30271960

RESUMO

Selective sweeps leave distinct signatures locally in genomes, enabling the detection of loci that have undergone recent positive selection. Multiple signatures of a selective sweep are known, yet each neutrality test only identifies a single signature. We present RAiSD (Raised Accuracy in Sweep Detection), an open-source software that implements a novel, to our knowledge, and parameter-free detection mechanism that relies on multiple signatures of a selective sweep via the enumeration of SNP vectors. RAiSD achieves higher sensitivity and accuracy than the current state of the art, while the computational complexity is greatly reduced, allowing up to 1000 times faster processing than widely used tools, and negligible memory requirements.

12.
Mol Biol Evol ; 34(10): 2704-2715, 2017 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-28957509

RESUMO

One of the most abundant proteins in human saliva, mucin-7, is encoded by the MUC7 gene, which harbors copy number variable subexonic repeats (PTS-repeats) that affect the size and glycosylation potential of this protein. We recently documented the adaptive evolution of MUC7 subexonic copy number variation among primates. Yet, the evolution of MUC7 genetic variation in humans remained unexplored. Here, we found that PTS-repeat copy number variation has evolved recurrently in the human lineage, thereby generating multiple haplotypic backgrounds carrying five or six PTS-repeat copy number alleles. Contrary to previous studies, we found no associations between the copy number of PTS-repeats and protection against asthma. Instead, we revealed a significant association of MUC7 haplotypic variation with the composition of the oral microbiome. Furthermore, based on in-depth simulations, we conclude that a divergent MUC7 haplotype likely originated in an unknown African hominin population and introgressed into ancestors of modern Africans.


Assuntos
Hominidae/genética , Mucinas/genética , Proteínas e Peptídeos Salivares/genética , Alelos , Animais , Asma/genética , Variações do Número de Cópias de DNA/genética , Evolução Molecular , Éxons/genética , Variação Genética , Glicosilação , Haplótipos/genética , Humanos , Microbiota/genética , Filogenia , Saliva
13.
J Biol Res (Thessalon) ; 24: 7, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28405579

RESUMO

Positive selection occurs when an allele is favored by natural selection. The frequency of the favored allele increases in the population and due to genetic hitchhiking the neighboring linked variation diminishes, creating so-called selective sweeps. Detecting traces of positive selection in genomes is achieved by searching for signatures introduced by selective sweeps, such as regions of reduced variation, a specific shift of the site frequency spectrum, and particular LD patterns in the region. A variety of methods and tools can be used for detecting sweeps, ranging from simple implementations that compute summary statistics such as Tajima's D, to more advanced statistical approaches that use combinations of statistics, maximum likelihood, machine learning etc. In this survey, we present and discuss summary statistics and software tools, and classify them based on the selective sweep signature they detect, i.e., SFS-based vs. LD-based, as well as their capacity to analyze whole genomes or just subgenomic regions. Additionally, we summarize the results of comparisons among four open-source software releases (SweeD, SweepFinder, SweepFinder2, and OmegaPlus) regarding sensitivity, specificity, and execution times. In equilibrium neutral models or mild bottlenecks, both SFS- and LD-based methods are able to detect selective sweeps accurately. Methods and tools that rely on LD exhibit higher true positive rates than SFS-based ones under the model of a single sweep or recurrent hitchhiking. However, their false positive rate is elevated when a misspecified demographic model is used to represent the null hypothesis. When the correct (or similar to the correct) demographic model is used instead, the false positive rates are considerably reduced. The accuracy of detecting the true target of selection is decreased in bottleneck scenarios. In terms of execution time, LD-based methods are typically faster than SFS-based methods, due to the nature of required arithmetic.

14.
Gigascience ; 5: 7, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26862394

RESUMO

BACKGROUND: Linkage disequilibrium is defined as the non-random associations of alleles at different loci, and it occurs when genotypes at the two loci depend on each other. The model of genetic hitchhiking predicts that strong positive selection affects the patterns of linkage disequilibrium around the site of a beneficial allele, resulting in specific motifs of correlation between neutral polymorphisms that surround the fixed beneficial allele. Increased levels of linkage disequilibrium are observed on the same side of a beneficial allele, and diminish between sites on different sides of a beneficial mutation. This specific pattern of linkage disequilibrium occurs more frequently when positive selection has acted on the population rather than under various neutral models. Thus, detecting such patterns could accurately reveal targets of positive selection along a recombining chromosome or a genome. Calculating linkage disequilibria in whole genomes is computationally expensive because allele correlations need to be evaluated for millions of pairs of sites. To analyze large datasets efficiently, algorithmic implementations used in modern population genetics need to exploit multiple cores of current workstations in a scalable way. However, population genomic datasets come in various types and shapes while typically showing SNP density heterogeneity, which makes the implementation of generally scalable parallel algorithms a challenging task. FINDINGS: Here we present a series of four parallelization strategies targeting shared-memory systems for the computationally intensive problem of detecting genomic regions that have contributed to the past adaptation of the species, also referred to as regions that have undergone a selective sweep, based on linkage disequilibrium patterns. We provide a thorough performance evaluation of the proposed parallel algorithms for computing linkage disequilibrium, and outline the benefits of each approach. Furthermore, we compare the accuracy of our open-source sweep-detection software OmegaPlus, which implements all four parallelization strategies presented here, with a variety of neutrality tests. CONCLUSIONS: The computational demands of selective sweep detection algorithms depend greatly on the SNP density heterogeneity and the data representation. Choosing the right parallel algorithm for the analysis can lead to significant processing time reduction and major energy savings. However, determining which parallel algorithm will execute more efficiently on a specific processor architecture and number of available cores for a particular dataset is not straightforward.


Assuntos
Algoritmos , Biologia Computacional/métodos , Desequilíbrio de Ligação , Seleção Genética , Alelos , Cromossomos Humanos Par 1/genética , Simulação por Computador , Frequência do Gene , Genética Populacional/métodos , Genoma Humano/genética , Genótipo , Humanos , Modelos Genéticos , Mutação , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes , Análise de Sequência de DNA
15.
Mol Biol Evol ; 30(9): 2224-34, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23777627

RESUMO

The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.


Assuntos
Cromossomos Humanos Par 1 , Genética Populacional/estatística & dados numéricos , Genoma Humano , Polimorfismo de Nucleotídeo Único , Seleção Genética , Software , Algoritmos , Fluxo Gênico , Frequência do Gene , Humanos , Funções Verossimilhança , Desequilíbrio de Ligação , Modelos Genéticos , Mutação , Tamanho da Amostra
16.
BMC Bioinformatics ; 14 Suppl 11: S4, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564250

RESUMO

BACKGROUND: A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read--the seed--an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read--extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable. RESULTS: In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment. CONCLUSIONS: We present libgapmis, a library for extending pairwise short-read alignments. We show that libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any short-read alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis.


Assuntos
Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Éxons , Genoma , Linguagens de Programação , Alinhamento de Sequência , Software
17.
Comput Struct Biotechnol J ; 6: e201303001, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24688709

RESUMO

Automated DNA sequencers generate chromatograms that contain raw sequencing data. They also generate data that translates the chromatograms into molecular sequences of A, C, G, T, or N (undetermined) characters. Since chromatogram translation programs frequently introduce errors, a manual inspection of the generated sequence data is required. As sequence numbers and lengths increase, visual inspection and manual correction of chromatograms and corresponding sequences on a per-peak and per-nucleotide basis becomes an error-prone, time-consuming, and tedious process. Here, we introduce ChromatoGate (CG), an open-source software that accelerates and partially automates the inspection of chromatograms and the detection of sequencing errors for bidirectional sequencing runs. To provide users full control over the error correction process, a fully automated error correction algorithm has not been implemented. Initially, the program scans a given multiple sequence alignment (MSA) for potential sequencing errors, assuming that each polymorphic site in the alignment may be attributed to a sequencing error with a certain probability. The guided MSA assembly procedure in ChromatoGate detects chromatogram peaks of all characters in an alignment that lead to polymorphic sites, given a user-defined threshold. The threshold value represents the sensitivity of the sequencing error detection mechanism. After this pre-filtering, the user only needs to inspect a small number of peaks in every chromatogram to correct sequencing errors. Finally, we show that correcting sequencing errors is important, because population genetic and phylogenetic inferences can be misled by MSAs with uncorrected mis-calls. Our experiments indicate that estimates of population mutation rates can be affected two- to three-fold by uncorrected errors.

18.
BMC Bioinformatics ; 13: 196, 2012 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-22876807

RESUMO

BACKGROUND: Aligning short DNA reads to a reference sequence alignment is a prerequisite for detecting their biological origin and analyzing them in a phylogenetic context. With the PaPaRa tool we introduced a dedicated dynamic programming algorithm for simultaneously aligning short reads to reference alignments and corresponding evolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles that correspond to the branches of such a reference tree. The algorithm needs to perform an immense number of pairwise alignments. Therefore, we explore vector intrinsics and GPUs to accelerate the PaPaRa alignment kernel. RESULTS: We optimized and parallelized PaPaRa on CPUs and GPUs. Via SSE 4.1 SIMD (Single Instruction, Multiple Data) intrinsics for x86 SIMD architectures and multi-threading, we obtained a 9-fold acceleration on a single core as well as linear speedups with respect to the number of cores. The peak CPU performance amounts to 18.1 GCUPS (Giga Cell Updates per Second) using all four physical cores on an Intel i7 2600 CPU running at 3.4 GHz. The average CPU performance (averaged over all test runs) is 12.33 GCUPS. We also used OpenCL to execute PaPaRa on a GPU SIMT (Single Instruction, Multiple Threads) architecture. A NVIDIA GeForce 560 GPU delivered peak and average performance of 22.1 and 18.4 GCUPS respectively. Finally, we combined the SIMD and SIMT implementations into a hybrid CPU-GPU system that achieved an accumulated peak performance of 33.8 GCUPS. CONCLUSIONS: This accelerated version of PaPaRa (available at http://www.exelixis-lab.org/software.html) provides a significant performance improvement that allows for analyzing larger datasets in less time. We observe that state-of-the-art SIMD and SIMT architectures deliver comparable performance for this dynamic programming kernel when the "competing programmer approach" is deployed. Finally, we show that overall performance can be substantially increased by designing a hybrid CPU-GPU system with appropriate load distribution mechanisms.


Assuntos
Filogenia , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos
19.
Bioinformatics ; 26(12): i132-9, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20529898

RESUMO

MOTIVATION: The current molecular data explosion poses new challenges for large-scale phylogenomic analyses that can comprise hundreds or even thousands of genes. A property that characterizes phylogenomic datasets is that they tend to be gappy, i.e. can contain taxa with (many and disparate) missing genes. In current phylogenomic analyses, this type of alignment gappyness that is induced by missing data frequently exceeds 90%. We present and implement a generally applicable mechanism that allows for reducing memory footprints of likelihood-based [maximum likelihood (ML) or Bayesian] phylogenomic analyses proportional to the amount of missing data in the alignment. We also introduce a set of algorithmic rules to efficiently conduct tree searches via subtree pruning and re-grafting moves using this mechanism. RESULTS: On a large phylogenomic DNA dataset with 2177 taxa, 68 genes and a gappyness of 90%, we achieve a memory footprint reduction from 9 GB down to 1 GB, a speedup for optimizing ML model parameters of 11, and accelerate the Subtree Pruning Regrafting tree search phase by factor 16. Thus, our approach can be deployed to improve efficiency for the two most important resources, CPU time and memory, by up to one order of magnitude. AVAILABILITY: Current open-source version of RAxML v7.2.6 available at http://wwwkramer.in.tum.de/exelixis/software.html.


Assuntos
Genômica/métodos , Filogenia , Alinhamento de Sequência/métodos , Algoritmos , Evolução Molecular , Funções Verossimilhança
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...