Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 109
Filtrar
1.
Genes (Basel) ; 12(11)2021 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-34828415

RESUMO

Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.


Assuntos
Automação Laboratorial/métodos , Genômica/métodos , Alinhamento de Sequência/métodos , Algoritmos , Animais , Automação Laboratorial/normas , Genômica/normas , Humanos , Filogenia , Sensibilidade e Especificidade , Alinhamento de Sequência/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/normas
2.
Nature ; 587(7833): 246-251, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33177663

RESUMO

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Assuntos
Genoma/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Vertebrados/genética , Âmnio , Animais , Simulação por Computador , Genômica/normas , Haplótipos , Humanos , Controle de Qualidade , Alinhamento de Sequência/normas , Software/normas
3.
Gigascience ; 9(2)2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-32025702

RESUMO

BACKGROUND: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.


Assuntos
Genoma Bacteriano , Genômica/normas , Polimorfismo de Nucleotídeo Único , Software/normas , Escherichia coli/genética , Genômica/métodos , Técnicas de Genotipagem/métodos , Técnicas de Genotipagem/normas , Mycobacterium tuberculosis/genética , Recombinação Genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas
4.
Gigascience ; 8(7)2019 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-31289836

RESUMO

BACKGROUND: Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference. RESULTS: Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3). CONCLUSIONS: Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.


Assuntos
Cromossomos Humanos X/genética , Cromossomos Humanos Y/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Artefatos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Feminino , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Masculino , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas
5.
Gigascience ; 8(7)2019 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-31251324

RESUMO

Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.


Assuntos
Big Data , Genômica/métodos , Análise de Sequência de DNA/métodos , Análise por Conglomerados , Mineração de Dados/métodos , Genoma Humano , Genômica/normas , Humanos , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas , Software
6.
BMC Vet Res ; 15(1): 135, 2019 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-31068211

RESUMO

BACKGROUND: Porcine reproductive and respiratory syndrome (PRRS) is a major threat to the swine industry. It is caused by the PRRS virus (PRRSV). Determination and comparison of the nucleotide sequences of PRRSV strains provides useful information in support of control initiatives or epidemiological studies on transmission patterns. The alignment of sequences is the first step in analyzing sequence data, with multiple algorithms being available, but little is known on the impact of this methodological choice. Here, a study was conducted to evaluate the impact of different alignment algorithms on the resulting aligned sequence dataset and on practical issues when applied to a large field database of PRRSV open reading frame (ORF) 5 sequences collected in Quebec, Canada, from 2010 to 2014. Five multiple sequence alignment programs were compared: Clustal W, Clustal Omega, Muscle, T-Coffee and MAFFT. RESULTS: The resulting alignments showed very similar results in terms of average pairwise genetic similarity, proportion of pairwise comparisons having ≥97.5% genetic similarity and sum of pairs (SP) score, except for T-Coffee where increased length of aligned datasets as well as limitation to handle large datasets were observed. CONCLUSIONS: Based on efficiency at minimizing the number of gaps in different dataset sizes with default open gap values as well as the capability to handle a large number of sequences in a timely manner, the use of Clustal Omega might be recommended for the management of PRRSV extensive database for both research and surveillance purposes.


Assuntos
Algoritmos , Variação Genética , Vírus da Síndrome Respiratória e Reprodutiva Suína/genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas
7.
Genes (Basel) ; 10(2)2019 01 22.
Artigo em Inglês | MEDLINE | ID: mdl-30678245

RESUMO

Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.


Assuntos
Filogenia , Alinhamento de Sequência/métodos , Software , Alinhamento de Sequência/normas
8.
Syst Biol ; 68(3): 396-411, 2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-30329135

RESUMO

The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.].


Assuntos
Classificação/métodos , Bases de Dados de Proteínas , Modelos Estatísticos , Alinhamento de Sequência/normas , Simulação por Computador , Conjuntos de Dados como Assunto
9.
PLoS Comput Biol ; 14(11): e1006547, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30383764

RESUMO

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.


Assuntos
Motivos de Aminoácidos , DNA/química , Proteínas/química , Alinhamento de Sequência/normas , Algoritmos , Sequência de Aminoácidos , Sequência Conservada , HIV-1/química , Homologia de Sequência de Aminoácidos , Produtos do Gene env do Vírus da Imunodeficiência Humana/química
10.
J Comput Biol ; 25(8): 841-849, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-30084692

RESUMO

The comparison and assessment of similarity across metagenomes are still an open problem. Uncultivated samples suffer from high variability, thus making it difficult for heuristic sequence comparison methods to find precise matches in reference databases. Finer methods are required to provide higher accuracy and certainty, although these come at the expense of larger computation times. Therefore, in this work, we present our software for the highly parallel, fine-grained pairwise alignment of metagenomes. First, an analysis of the computational limitations of performing coarse-grained global alignments in parallel manner is described, and a solution is discussed and employed by our proposal. Second, we show that our development is competitive with state-of-the-art software in terms of speed and consumption of resources, while achieving more accurate results. In addition, the parallel scheme adopted is tested, depicting a performance of up to 98% efficiency while using up to 64 cores. Sequential optimizations are also tested and show a speedup of 9× over our previous proposal.


Assuntos
Biologia Computacional/métodos , Metagenoma , Metagenômica/métodos , Metagenômica/normas , Alinhamento de Sequência/normas , Software , Algoritmos , Humanos
11.
J Comput Biol ; 25(10): 1106-1119, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-29993269

RESUMO

The Smith-Waterman (SW) algorithm explores all the possible alignments between two or more sequences and as a result it returns the optimal local alignment. However, the computational cost of this algorithm is very high, and the exponential growth of computation makes SW unrealistic for searching similarities in large sets of sequences. Fortunately, the dynamic programming kernel of the SW algorithm involves mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply polyhedral compilation techniques to optimize the studied SW dense array code. In this article, we present an approach to generate efficient SW implementations for two and three sequences by using the transitive closure of a dependence graph and loop skewing. Generated programs are represented with parallel tiled loop nests, which expose significantly higher performance than that of programs obtained with closely related compilers. The approach is able to tile all loops of original loop nests as opposed to well-known affine transformation techniques. Furthermore, it allows for code optimization of three-sequence alignment. Such a code cannot be generated by means of state-of-the-art automatic optimizing compilers. We demonstrate that an under-approximation of transitive closure (instead of exact transitive closure) can be used to generate valid parallel tiled code. This considerably reduces the computational complexity of the approach. Generated codes were run on cores of a modern Intel multiprocessor and they expose high speedup and good scalability on this platform.


Assuntos
Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Humanos , Software
12.
Genes Genomics ; 40(2): 189-197, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29568413

RESUMO

In addition to the rapid advancement in Next-Generation Sequencing (NGS) technology, clinical panel sequencing is being used increasingly in clinical studies and tests. However, tools that are used in NGS data analysis have not been comparatively evaluated in performance for panel sequencing. This study aimed to evaluate the tools used in the alignment process, the first procedure in bioinformatics analysis, by comparing tools that have been widely used with ones that have been introduced recently. With the accumulated panel sequencing data, detected variant lists were cataloged and inserted into simulated reads produced from the reference genome (h19). The amount of unmapped reads and misaligned reads, mapping quality distribution, and runtime were measured as standards for comparison. As the most widely used tools, Bowtie2 and BWA-MEM each showed explicit performance with AUC of 0.9984 and 0.9970 respectively. Kart, maintaining superior runtime and less number of misaligned read, also similarly possessed high level of AUC (0.9723). Such selection and optimization method of tools appropriate for panel sequencing can be utilized for fields requiring error minimization, such as clinical application and liquid biopsy studies.


Assuntos
Simulação por Computador , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Software , Genômica/métodos , Genômica/normas , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Alinhamento de Sequência/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas
13.
Mitochondrial DNA A DNA Mapp Seq Anal ; 29(7): 1128-1138, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-29338473

RESUMO

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filtered out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.


Assuntos
Código de Barras de DNA Taxonômico/métodos , DNA de Cloroplastos/genética , Filogenia , Alinhamento de Sequência/métodos , Solanum lycopersicum/genética , Código de Barras de DNA Taxonômico/normas , Solanum lycopersicum/classificação , Alinhamento de Sequência/normas
14.
Genomics ; 110(5): 263-273, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-29180261

RESUMO

Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences.


Assuntos
Algoritmos , Genômica/métodos , Alinhamento de Sequência/métodos , Animais , Humanos , Alinhamento de Sequência/normas
15.
Arch Med Sadowej Kryminol ; 68(4): 242-258, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-31025842

RESUMO

Although mitochondrial DNA (mtDNA) testing has been used in forensic genetics only since the mid-1990s, forensic DNA laboratories have been recently increasing the range of mtDNA sequencing, employing new analytical approaches and methods of data analysis. Therefore, it seems fitting to gather and systematize existing recommendations in the field of mtDNA analysis for forensic purposes, and formulate a set of interpretative guidelines which are especially relevant in view of recent developments in the forensic casework. The starting point is the recommendations of the International Society for Forensic Genetics (ISFG) which, in the opinion of the Polish Speaking Working Group of the ISFG (ISFG- PL), should be followed by all Polish laboratories conducting forensic testing.


Assuntos
Impressões Digitais de DNA/normas , DNA Mitocondrial/genética , Genética Forense/normas , Análise de Sequência de DNA/normas , Genética Forense/métodos , Humanos , Polônia , Alinhamento de Sequência/normas , Sociedades Científicas
16.
Gigascience ; 6(11): 1-6, 2017 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-29048539

RESUMO

The BAM and CRAM formats provide a supplementary linear index that facilitates rapid access to sequence alignments in arbitrary genomic regions. Comparing consecutive entries in a BAM or CRAM index allows one to infer the number of alignment records per genomic region for use as an effective proxy of sequence depth in each genomic region. Based on these properties, we have developed indexcov, an efficient estimator of whole-genome sequencing coverage to rapidly identify samples with aberrant coverage profiles, reveal large-scale chromosomal anomalies, recognize potential batch effects, and infer the sex of a sample. Indexcov is available at https://github.com/brentp/goleft under the MIT license.


Assuntos
Alinhamento de Sequência/normas , Software/normas , Sequenciamento Completo do Genoma/normas , Genoma Humano , Humanos , Controle de Qualidade , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos , Sequenciamento Completo do Genoma/métodos
17.
Sci Rep ; 7(1): 10963, 2017 09 08.
Artigo em Inglês | MEDLINE | ID: mdl-28887485

RESUMO

Complementary to reference-based variant detection, recent studies revealed that many novel variants could be detected with de novo assembled genomes. To evaluate the effect of reads coverage and the accuracy of assembly-based variant calling, we simulated short reads containing more than 3 million of single nucleotide variants (SNVs) from the whole human genome and compared the efficiency of SNV calling between the assembly-based and alignment-based calling approaches. We assessed the quality of the assembled contig and found that a minimum of 30X coverage of short reads was needed to ensure reliable SNV calling and to generate assembled contigs with a good coverage of genome and genes. In addition, we observed that the assembly-based approach had a much lower recall rate and precision comparing to the alignment-based approach that would recover 99% of imputed SNVs. We observed similar results with experimental reads for NA24385, an individual whose germline variants were well characterized. Although there are additional values for SNVs detection, the assembly-based approach would have great risk of false discovery of novel SNVs. Further improvement of de novo assembly algorithms are needed in order to warrant a good completeness of genome with haplotype resolved and high fidelity of assembled sequences.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Mapeamento de Sequências Contíguas/normas , Estudo de Associação Genômica Ampla/normas , Humanos , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas
18.
Gigascience ; 6(7): 1-8, 2017 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-28531267

RESUMO

The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Projeto Genoma Humano , Alinhamento de Sequência/métodos , Sequenciamento Completo do Genoma/métodos , Algoritmos , Mapeamento de Sequências Contíguas/normas , Humanos , Padrões de Referência , Alinhamento de Sequência/normas , Sequenciamento Completo do Genoma/normas
19.
Genet Mol Res ; 16(2)2017 Apr 20.
Artigo em Inglês | MEDLINE | ID: mdl-28437554

RESUMO

Molecular identification is very useful in cases where morphology-based species identification is not possible. Examples for its application in cetaceans include the identification of carcasses of stranded animals in advanced state of decomposition and body parts that are illegally traded. One DNA region that is often used for molecular identification is the Folmer region of the mitochondrial gene cytochrome c oxidase subunit I (COI) (locus 48 to 705 bp). This locus has been used for the identification of several animal species, including whales and dolphins. The goal of the present study was to evaluate the usefulness of another region of COI, the E3-I5 (locus 685 to locus 1179; 495 bp) as a marker for identification of cetaceans from northeastern Canada and northeastern Brazil. The identification markers were successfully obtained for seven cetacean species after performing percent identity and Basic Local Alignment Search Tool analyses. The obtained markers are now publicly available and are useful for the identification of the endangered blue whale (Balaenoptera musculus), common minke whale (B. acutorostrata), vulnerable sperm whale (Physeter macrocephalus), harbor porpoise (Phocoena phocoena), common bottlenose dolphin (Tursiops truncatus), Guiana dolphin (Sotalia guianensis), and melon-headed whale (Peponocephala electra).


Assuntos
Cetáceos/genética , Código de Barras de DNA Taxonômico/normas , Complexo IV da Cadeia de Transporte de Elétrons/genética , Alinhamento de Sequência/normas , Animais , Cetáceos/classificação , Código de Barras de DNA Taxonômico/métodos , Espécies em Perigo de Extinção , Marcadores Genéticos , Padrões de Referência , Alinhamento de Sequência/métodos
20.
G3 (Bethesda) ; 7(5): 1405-1416, 2017 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-28235826

RESUMO

Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42), and compare this with the first sequenced genome of this species (TCO), which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with ∼7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs) and a higher frequency of gained introns in the TCO genome.


Assuntos
Daphnia/genética , Sequenciamento Completo do Genoma/métodos , Animais , Elementos de DNA Transponíveis , Íntrons , Anotação de Sequência Molecular/métodos , Anotação de Sequência Molecular/normas , Padrões de Referência , Sensibilidade e Especificidade , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Sequenciamento Completo do Genoma/normas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...