Search | VHL Regional Portal

1.

Post-Alignment Adjustment and Its Automation.

Xia, Xuhua.

Genes (Basel) ; 12(11)2021 11 18.

Article in English | MEDLINE | ID: mdl-34828415

ABSTRACT

Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.

Subject(s)

Automation, Laboratory/methods , Genomics/methods , Sequence Alignment/methods , Algorithms , Animals , Automation, Laboratory/standards , Genomics/standards , Humans , Phylogeny , Sensitivity and Specificity , Sequence Alignment/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards , Sequence Analysis, Protein/methods , Sequence Analysis, Protein/standards

2.

Progressive Cactus is a multiple-genome aligner for the thousand-genome era.

Armstrong, Joel; Hickey, Glenn; Diekhans, Mark; Fiddes, Ian T; Novak, Adam M; Deran, Alden; Fang, Qi; Xie, Duo; Feng, Shaohong; Stiller, Josefin; Genereux, Diane; Johnson, Jeremy; Marinescu, Voichita Dana; Alföldi, Jessica; Harris, Robert S; Lindblad-Toh, Kerstin; Haussler, David; Karlsson, Elinor; Jarvis, Erich D; Zhang, Guojie; Paten, Benedict.

Nature ; 587(7833): 246-251, 2020 11.

Article in English | MEDLINE | ID: mdl-33177663

ABSTRACT

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.

Subject(s)

Genome/genetics , Genomics/methods , Sequence Alignment/methods , Software , Vertebrates/genetics , Amnion , Animals , Computer Simulation , Genomics/standards , Haplotypes , Humans , Quality Control , Sequence Alignment/standards , Software/standards

3.

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.

Bush, Stephen J; Foster, Dona; Eyre, David W; Clark, Emily L; De Maio, Nicola; Shaw, Liam P; Stoesser, Nicole; Peto, Tim E A; Crook, Derrick W; Walker, A Sarah.

Gigascience ; 9(2)2020 02 01.

Article in English | MEDLINE | ID: mdl-32025702

ABSTRACT

BACKGROUND: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

Subject(s)

Genome, Bacterial , Genomics/standards , Polymorphism, Single Nucleotide , Software/standards , Escherichia coli/genetics , Genomics/methods , Genotyping Techniques/methods , Genotyping Techniques/standards , Mycobacterium tuberculosis/genetics , Recombination, Genetic , Sequence Alignment/methods , Sequence Alignment/standards

4.

Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data.

Webster, Timothy H; Couse, Madeline; Grande, Bruno M; Karlins, Eric; Phung, Tanya N; Richmond, Phillip A; Whitford, Whitney; Wilson, Melissa A.

Gigascience ; 8(7)2019 07 01.

Article in English | MEDLINE | ID: mdl-31289836

ABSTRACT

BACKGROUND: Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference. RESULTS: Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3). CONCLUSIONS: Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.

Subject(s)

Chromosomes, Human, X/genetics , Chromosomes, Human, Y/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Sequence Homology, Nucleic Acid , Artifacts , Contig Mapping/methods , Contig Mapping/standards , Female , High-Throughput Nucleotide Sequencing/standards , Humans , Male , Sequence Alignment/methods , Sequence Alignment/standards , Sequence Analysis, DNA/standards

5.

Scalable biclustering - the future of big data exploration?

Orzechowski, Patryk; Boryczko, Krzysztof; Moore, Jason H.

Gigascience ; 8(7)2019 07 01.

Article in English | MEDLINE | ID: mdl-31251324

ABSTRACT

Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.

Subject(s)

Big Data , Genomics/methods , Sequence Analysis, DNA/methods , Cluster Analysis , Data Mining/methods , Genome, Human , Genomics/standards , Humans , Sequence Alignment/methods , Sequence Alignment/standards , Sequence Analysis, DNA/standards , Software

6.

Impact of alignment algorithm on the estimation of pairwise genetic similarity of porcine reproductive and respiratory syndrome virus (PRRSV).

Lambert, Marie-Ève; Arsenault, Julie; Delisle, Benjamin; Audet, Pascal; Poljak, Zvonimir; D'Allaire, Sylvie.

BMC Vet Res ; 15(1): 135, 2019 May 08.

Article in English | MEDLINE | ID: mdl-31068211

ABSTRACT

BACKGROUND: Porcine reproductive and respiratory syndrome (PRRS) is a major threat to the swine industry. It is caused by the PRRS virus (PRRSV). Determination and comparison of the nucleotide sequences of PRRSV strains provides useful information in support of control initiatives or epidemiological studies on transmission patterns. The alignment of sequences is the first step in analyzing sequence data, with multiple algorithms being available, but little is known on the impact of this methodological choice. Here, a study was conducted to evaluate the impact of different alignment algorithms on the resulting aligned sequence dataset and on practical issues when applied to a large field database of PRRSV open reading frame (ORF) 5 sequences collected in Quebec, Canada, from 2010 to 2014. Five multiple sequence alignment programs were compared: Clustal W, Clustal Omega, Muscle, T-Coffee and MAFFT. RESULTS: The resulting alignments showed very similar results in terms of average pairwise genetic similarity, proportion of pairwise comparisons having ≥97.5% genetic similarity and sum of pairs (SP) score, except for T-Coffee where increased length of aligned datasets as well as limitation to handle large datasets were observed. CONCLUSIONS: Based on efficiency at minimizing the number of gaps in different dataset sizes with default open gap values as well as the capability to handle a large number of sequences in a timely manner, the use of Clustal Omega might be recommended for the management of PRRSV extensive database for both research and surveillance purposes.

Subject(s)

Algorithms , Genetic Variation , Porcine respiratory and reproductive syndrome virus/genetics , Sequence Alignment/methods , Sequence Alignment/standards

7.

PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction.

Kang, Yongyong; Yang, Xiaofei; Lin, Jiadong; Ye, Kai.

Genes (Basel) ; 10(2)2019 01 22.

Article in English | MEDLINE | ID: mdl-30678245

ABSTRACT

Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.

Subject(s)

Phylogeny , Sequence Alignment/methods , Software , Sequence Alignment/standards

8.

Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets.

Nute, Michael; Saleh, Ehsan; Warnow, Tandy.

Syst Biol ; 68(3): 396-411, 2019 05 01.

Article in English | MEDLINE | ID: mdl-30329135

ABSTRACT

The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.].

Subject(s)

Classification/methods , Databases, Protein , Models, Statistical , Sequence Alignment/standards , Computer Simulation , Datasets as Topic

9.

Motif-Aware PRALINE: Improving the alignment of motif regions.

Dijkstra, Maurits; Bawono, Punto; Abeln, Sanne; Feenstra, K Anton; Fokkink, Wan; Heringa, Jaap.

PLoS Comput Biol ; 14(11): e1006547, 2018 11.

Article in English | MEDLINE | ID: mdl-30383764

ABSTRACT

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.

Subject(s)

Amino Acid Motifs , DNA/chemistry , Proteins/chemistry , Sequence Alignment/standards , Algorithms , Amino Acid Sequence , Conserved Sequence , HIV-1/chemistry , Sequence Homology, Amino Acid , env Gene Products, Human Immunodeficiency Virus/chemistry

10.

Precise and Parallel Pairwise Metagenomic Comparisons.

Pérez-Wohlfeil, Esteban; Trelles, Oswaldo.

J Comput Biol ; 25(8): 841-849, 2018 08.

Article in English | MEDLINE | ID: mdl-30084692

ABSTRACT

The comparison and assessment of similarity across metagenomes are still an open problem. Uncultivated samples suffer from high variability, thus making it difficult for heuristic sequence comparison methods to find precise matches in reference databases. Finer methods are required to provide higher accuracy and certainty, although these come at the expense of larger computation times. Therefore, in this work, we present our software for the highly parallel, fine-grained pairwise alignment of metagenomes. First, an analysis of the computational limitations of performing coarse-grained global alignments in parallel manner is described, and a solution is discussed and employed by our proposal. Second, we show that our development is competitive with state-of-the-art software in terms of speed and consumption of resources, while achieving more accurate results. In addition, the parallel scheme adopted is tested, depicting a performance of up to 98% efficiency while using up to 64 cores. Sequential optimizations are also tested and show a speedup of 9× over our previous proposal.

Subject(s)

Computational Biology/methods , Metagenome , Metagenomics/methods , Metagenomics/standards , Sequence Alignment/standards , Software , Algorithms , Humans

11.

Parallel Tiled Codes Implementing the Smith-Waterman Alignment Algorithm for Two and Three Sequences.

Palkowski, Marek; Bielecki, Wlodzimierz.

J Comput Biol ; 25(10): 1106-1119, 2018 10.

Article in English | MEDLINE | ID: mdl-29993269

ABSTRACT

The Smith-Waterman (SW) algorithm explores all the possible alignments between two or more sequences and as a result it returns the optimal local alignment. However, the computational cost of this algorithm is very high, and the exponential growth of computation makes SW unrealistic for searching similarities in large sets of sequences. Fortunately, the dynamic programming kernel of the SW algorithm involves mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply polyhedral compilation techniques to optimize the studied SW dense array code. In this article, we present an approach to generate efficient SW implementations for two and three sequences by using the transitive closure of a dependence graph and loop skewing. Generated programs are represented with parallel tiled loop nests, which expose significantly higher performance than that of programs obtained with closely related compilers. The approach is able to tile all loops of original loop nests as opposed to well-known affine transformation techniques. Furthermore, it allows for code optimization of three-sequence alignment. Such a code cannot be generated by means of state-of-the-art automatic optimizing compilers. We demonstrate that an under-approximation of transitive closure (instead of exact transitive closure) can be used to generate valid parallel tiled code. This considerably reduces the computational complexity of the approach. Generated codes were run on cores of a modern Intel multiprocessor and they expose high speedup and good scalability on this platform.

Subject(s)

Algorithms , Computational Biology/methods , Sequence Alignment/methods , Sequence Alignment/standards , Humans , Software

12.

Performance evaluation method for read mapping tool in clinical panel sequencing.

Lee, Hojun; Lee, Ki-Wook; Lee, Taeseob; Park, Donghyun; Chung, Jongsuk; Lee, Chung; Park, Woong-Yang; Son, Dae-Soon.

Genes Genomics ; 40(2): 189-197, 2018.

Article in English | MEDLINE | ID: mdl-29568413

ABSTRACT

In addition to the rapid advancement in Next-Generation Sequencing (NGS) technology, clinical panel sequencing is being used increasingly in clinical studies and tests. However, tools that are used in NGS data analysis have not been comparatively evaluated in performance for panel sequencing. This study aimed to evaluate the tools used in the alignment process, the first procedure in bioinformatics analysis, by comparing tools that have been widely used with ones that have been introduced recently. With the accumulated panel sequencing data, detected variant lists were cataloged and inserted into simulated reads produced from the reference genome (h19). The amount of unmapped reads and misaligned reads, mapping quality distribution, and runtime were measured as standards for comparison. As the most widely used tools, Bowtie2 and BWA-MEM each showed explicit performance with AUC of 0.9984 and 0.9970 respectively. Kart, maintaining superior runtime and less number of misaligned read, also similarly possessed high level of AUC (0.9723). Such selection and optimization method of tools appropriate for panel sequencing can be utilized for fields requiring error minimization, such as clinical application and liquid biopsy studies.

Subject(s)

Computer Simulation , Genome, Human , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Software , Genomics/methods , Genomics/standards , High-Throughput Nucleotide Sequencing/standards , Humans , Sequence Alignment/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards

13.

Phylogenomics of tomato chloroplasts using assembly and alignment-free method.

Amado Cattáneo, Raúl Martin; Diambra, Luis; McCarthy, Andrés Norman.

Mitochondrial DNA A DNA Mapp Seq Anal ; 29(7): 1128-1138, 2018 10.

Article in English | MEDLINE | ID: mdl-29338473

ABSTRACT

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filtered out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.

Subject(s)

DNA Barcoding, Taxonomic/methods , DNA, Chloroplast/genetics , Phylogeny , Sequence Alignment/methods , Solanum lycopersicum/genetics , DNA Barcoding, Taxonomic/standards , Solanum lycopersicum/classification , Sequence Alignment/standards

14.

Optimal choice of k-mer in composition vector method for genome sequence comparison.

Das, Subhram; Deb, Tamal; Dey, Nilanjan; Ashour, Amira S; Bhattacharya, D K; Tibarewala, D N.

Genomics ; 110(5): 263-273, 2018 09.

Article in English | MEDLINE | ID: mdl-29180261

ABSTRACT

Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences.

Subject(s)

Algorithms , Genomics/methods , Sequence Alignment/methods , Animals , Humans , Sequence Alignment/standards

15.

Recommendations of the Polish Speaking Working Group of the International Society for Forensic Genetics for forensic mitochondrial DNA testing.

Grzybowski, Tomasz; Pawlowski, Ryszard; Kupiec, Tomasz; Branicki, Wojciech; Jacewicz, Renata.

Arch Med Sadowej Kryminol ; 68(4): 242-258, 2018.

Article in English | MEDLINE | ID: mdl-31025842

ABSTRACT

Although mitochondrial DNA (mtDNA) testing has been used in forensic genetics only since the mid-1990s, forensic DNA laboratories have been recently increasing the range of mtDNA sequencing, employing new analytical approaches and methods of data analysis. Therefore, it seems fitting to gather and systematize existing recommendations in the field of mtDNA analysis for forensic purposes, and formulate a set of interpretative guidelines which are especially relevant in view of recent developments in the forensic casework. The starting point is the recommendations of the International Society for Forensic Genetics (ISFG) which, in the opinion of the Polish Speaking Working Group of the ISFG (ISFG- PL), should be followed by all Polish laboratories conducting forensic testing.

Subject(s)

DNA Fingerprinting/standards , DNA, Mitochondrial/genetics , Forensic Genetics/standards , Sequence Analysis, DNA/standards , Forensic Genetics/methods , Humans , Poland , Sequence Alignment/standards , Societies, Scientific

16.

Indexcov: fast coverage quality control for whole-genome sequencing.

Pedersen, Brent S; Collins, Ryan L; Talkowski, Michael E; Quinlan, Aaron R.

Gigascience ; 6(11): 1-6, 2017 11 01.

Article in English | MEDLINE | ID: mdl-29048539

ABSTRACT

The BAM and CRAM formats provide a supplementary linear index that facilitates rapid access to sequence alignments in arbitrary genomic regions. Comparing consecutive entries in a BAM or CRAM index allows one to infer the number of alignment records per genomic region for use as an effective proxy of sequence depth in each genomic region. Based on these properties, we have developed indexcov, an efficient estimator of whole-genome sequencing coverage to rapidly identify samples with aberrant coverage profiles, reveal large-scale chromosomal anomalies, recognize potential batch effects, and infer the sex of a sample. Indexcov is available at https://github.com/brentp/goleft under the MIT license.

Subject(s)

Sequence Alignment/standards , Software/standards , Whole Genome Sequencing/standards , Genome, Human , Humans , Quality Control , Reproducibility of Results , Sequence Alignment/methods , Whole Genome Sequencing/methods

17.

Direct comparison of performance of single nucleotide variant calling in human genome with alignment-based and assembly-based approaches.

Wu, Leihong; Yavas, Gokhan; Hong, Huixiao; Tong, Weida; Xiao, Wenming.

Sci Rep ; 7(1): 10963, 2017 09 08.

Article in English | MEDLINE | ID: mdl-28887485

ABSTRACT

Complementary to reference-based variant detection, recent studies revealed that many novel variants could be detected with de novo assembled genomes. To evaluate the effect of reads coverage and the accuracy of assembly-based variant calling, we simulated short reads containing more than 3 million of single nucleotide variants (SNVs) from the whole human genome and compared the efficiency of SNV calling between the assembly-based and alignment-based calling approaches. We assessed the quality of the assembled contig and found that a minimum of 30X coverage of short reads was needed to ensure reliable SNV calling and to generate assembled contigs with a good coverage of genome and genes. In addition, we observed that the assembly-based approach had a much lower recall rate and precision comparing to the alignment-based approach that would recover 99% of imputed SNVs. We observed similar results with experimental reads for NA24385, an individual whose germline variants were well characterized. Although there are additional values for SNVs detection, the assembly-based approach would have great risk of false discovery of novel SNVs. Further improvement of de novo assembly algorithms are needed in order to warrant a good completeness of genome with haplotype resolved and high fidelity of assembled sequences.

Subject(s)

Contig Mapping/methods , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Algorithms , Contig Mapping/standards , Genome-Wide Association Study/standards , Humans , Sequence Alignment/standards , Sequence Analysis, DNA/standards

18.

Alignment of 1000 Genomes Project reads to reference assembly GRCh38.

Zheng-Bradley, Xiangqun; Streeter, Ian; Fairley, Susan; Richardson, David; Clarke, Laura; Flicek, Paul.

Gigascience ; 6(7): 1-8, 2017 07 01.

Article in English | MEDLINE | ID: mdl-28531267

ABSTRACT

The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.

Subject(s)

Contig Mapping/methods , Human Genome Project , Sequence Alignment/methods , Whole Genome Sequencing/methods , Algorithms , Contig Mapping/standards , Humans , Reference Standards , Sequence Alignment/standards , Whole Genome Sequencing/standards

19.

Molecular identification of cetaceans from the West Atlantic using the E3-I5 region of COI.

Falcão, L H O; Campos, A S; Freitas, J E P; Furtado-Neto, M A A; Faria, V V.

Genet Mol Res ; 16(2)2017 Apr 20.

Article in English | MEDLINE | ID: mdl-28437554

ABSTRACT

Molecular identification is very useful in cases where morphology-based species identification is not possible. Examples for its application in cetaceans include the identification of carcasses of stranded animals in advanced state of decomposition and body parts that are illegally traded. One DNA region that is often used for molecular identification is the Folmer region of the mitochondrial gene cytochrome c oxidase subunit I (COI) (locus 48 to 705 bp). This locus has been used for the identification of several animal species, including whales and dolphins. The goal of the present study was to evaluate the usefulness of another region of COI, the E3-I5 (locus 685 to locus 1179; 495 bp) as a marker for identification of cetaceans from northeastern Canada and northeastern Brazil. The identification markers were successfully obtained for seven cetacean species after performing percent identity and Basic Local Alignment Search Tool analyses. The obtained markers are now publicly available and are useful for the identification of the endangered blue whale (Balaenoptera musculus), common minke whale (B. acutorostrata), vulnerable sperm whale (Physeter macrocephalus), harbor porpoise (Phocoena phocoena), common bottlenose dolphin (Tursiops truncatus), Guiana dolphin (Sotalia guianensis), and melon-headed whale (Peponocephala electra).

Subject(s)

Cetacea/genetics , DNA Barcoding, Taxonomic/standards , Electron Transport Complex IV/genetics , Sequence Alignment/standards , Animals , Cetacea/classification , DNA Barcoding, Taxonomic/methods , Endangered Species , Genetic Markers , Reference Standards , Sequence Alignment/methods

20.

A New Reference Genome Assembly for the Microcrustacean Daphnia pulex.

Ye, Zhiqiang; Xu, Sen; Spitze, Ken; Asselman, Jana; Jiang, Xiaoqian; Ackerman, Matthew S; Lopez, Jacqueline; Harker, Brent; Raborn, R Taylor; Thomas, W Kelley; Ramsdell, Jordan; Pfrender, Michael E; Lynch, Michael.

G3 (Bethesda) ; 7(5): 1405-1416, 2017 05 05.

Article in English | MEDLINE | ID: mdl-28235826

ABSTRACT

Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42), and compare this with the first sequenced genome of this species (TCO), which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with â¼7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs) and a higher frequency of gained introns in the TCO genome.

Subject(s)

Daphnia/genetics , Whole Genome Sequencing/methods , Animals , DNA Transposable Elements , Introns , Molecular Sequence Annotation/methods , Molecular Sequence Annotation/standards , Reference Standards , Sensitivity and Specificity , Sequence Alignment/methods , Sequence Alignment/standards , Whole Genome Sequencing/standards

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL