Pesquisa | Portal Regional da BVS (teste)

Heterozygous genome assembly via binary classification of homologous sequence.

Bodily, Paul M; Fujimoto, M; Ortega, Cameron; Okuda, Nozomu; Price, Jared C; Clement, Mark J; Snell, Quinn.

BMC Bioinformatics ; 16 Suppl 7: S5, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-25952609

RESUMO

BACKGROUND: Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking. METHODS: We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups. RESULTS: We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads). CONCLUSION: More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.

Assuntos

Mapeamento de Sequências Contíguas/métodos , Diploide , Genoma Humano , Heterozigoto , Análise de Sequência de DNA/métodos , Homologia de Sequência , Software , Algoritmos , Inteligência Artificial , Sequenciamento de Nucleotídeos em Larga Escala , Humanos

Effects of error-correction of heterozygous next-generation sequencing data.

Fujimoto, M; Bodily, Paul M; Okuda, Nozomu; Clement, Mark J; Snell, Quinn.

BMC Bioinformatics ; 15 Suppl 7: S3, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-25077414

RESUMO

BACKGROUND: Error correction is an important step in increasing the quality of next-generation sequencing data for downstream analysis and use. Polymorphic datasets are a challenge for many bioinformatic software packages that are designed for or assume homozygosity of an input dataset. This assumption ignores the true genomic composition of many organisms that are diploid or polyploid. In this survey, two different error correction packages, Quake and ECHO, are examined to see how they perform on next-generation sequence data from heterozygous genomes. RESULTS: Quake and ECHO perform well and were able to correct many errors found within the data. However, errors that occur at heterozygous positions had unique trends. Errors at these positions were sometimes corrected incorrectly, introducing errors into the dataset with the possibility of creating a chimeric read. Quake was much less likely to create chimeric reads. Quake's read trimming removed a large portion of the original data and often left reads with few heterozygous markers. ECHO resulted in more chimeric reads and introduced more errors than Quake but preserved heterozygous markers. CONCLUSIONS: These findings suggest that Quake and ECHO both have strengths and weaknesses when applied to heterozygous data. With the increased interest in haplotype specific analysis, new tools that are designed to be haplotype-aware are necessary that do not have the weaknesses of Quake and ECHO.

Assuntos

Genômica/métodos , Heterozigoto , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Diploide , Genoma , Haplótipos , Humanos

Mspire-Simulator: LC-MS shotgun proteomic simulator for creating realistic gold standard data.

Noyce, Andrew B; Smith, Rob; Dalgleish, James; Taylor, Ryan M; Erb, K C; Okuda, Nozomu; Prince, John T.

J Proteome Res ; 12(12): 5742-9, 2013 Dec 06.

Artigo em Inglês | MEDLINE | ID: mdl-24090032

RESUMO

The most important step in any quantitative proteomic pipeline is feature detection (aka peak picking). However, generating quality hand-annotated data sets to validate the algorithms, especially for lower abundance peaks, is nearly impossible. An alternative for creating gold standard data is to simulate it with features closely mimicking real data. We present Mspire-Simulator, a free, open-source shotgun proteomic simulator that goes beyond previous simulation attempts by generating LC-MS features with realistic m/z and intensity variance along with other noise components. It also includes machine-learned models for retention time and peak intensity prediction and a genetic algorithm to custom fit model parameters for experimental data sets. We show that these methods are applicable to data from three different mass spectrometers, including two fundamentally different types, and show visually and analytically that simulated peaks are nearly indistinguishable from actual data. Researchers can use simulated data to rigorously test quantitation software, and proteomic researchers may benefit from overlaying simulated data on actual data sets.

Assuntos

Cromatografia Líquida/normas , Espectrometria de Massas/normas , Modelos Estatísticos , Proteínas/análise , Proteômica/estatística & dados numéricos , Software , Algoritmos , Sequência de Aminoácidos , Animais , Bovinos , Cromatografia Líquida/estatística & dados numéricos , Simulação por Computador , Humanos , Espectrometria de Massas/estatística & dados numéricos , Dados de Sequência Molecular , Proteínas/química , Proteômica/métodos , Padrões de Referência

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA