Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Comput Biol Chem ; 35(3): 199-209, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21704267

RESUMO

Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.


Assuntos
Biologia Computacional , Metagenoma/genética , Algoritmos , Bactérias/genética , Simulação por Computador , Mineração de Dados , Bases de Dados Genéticas , Análise de Sequência de DNA
2.
Curr Genomics ; 10(7): 493-510, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-20436876

RESUMO

Traditionally, studies in microbial genomics have focused on single-genomes from cultured species, thereby limiting their focus to the small percentage of species that can be cultured outside their natural environment. Fortunately, recent advances in high-throughput sequencing and computational analyses have ushered in the new field of metagenomics, which aims to decode the genomes of microbes from natural communities without the need for cultivation. Although metagenomic studies have shed a great deal of insight into bacterial diversity and coding capacity, several computational challenges remain due to the massive size and complexity of metagenomic sequence data. Current tools and techniques are reviewed in this paper which address challenges in 1) genomic fragment annotation, 2) phylogenetic reconstruction, 3) functional classification of samples, and 4) interpreting complementary metaproteomics and metametabolomics data. Also surveyed are important applications of metagenomic studies, including microbial forensics and the roles of microbial communities in shaping human health and soil ecology.

3.
Artigo em Inglês | MEDLINE | ID: mdl-19163534

RESUMO

In recent years, oligo microarrays, or more commonly-known DNA chips, have had a major impact in disease diagnosis, drug discovery, and gene identification. Microarrays contain Nmer DNA fragments, or oligos, in a series of 'wells' placed across the chip, where each well contains thousands of the same fragments and acts as a probe that detects the amount of a specific fragment. A recent use for microarrays is for identification of genomes, such as pathogens. In current techniques, probes that detect unique gene regions of particular species are selected to be placed on the microarray, using the assumption that if one gene unique to a pathogen species can be detected, then the pathogen can be classified. This approach is useful, but the technology relies on finding the gene sequences that are divergent enough to be used as a genomic identifier and robust to cross-hybridization. In our work, we present a method to choose the most unique probes between two organisms. We accomplish this by choosing the oligo probes that maximize the level of divergence between the genomes, calculated by three different information-theoretic measures. We show the results for a 12-mer and 25-mer oligo pathogen probe set and that our method chooses probes less likely to cross-hybridize.


Assuntos
Biologia Computacional/métodos , Genômica , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Biologia Computacional/estatística & dados numéricos , DNA/química , DNA Complementar/genética , Etiquetas de Sequências Expressas , Genes Bacterianos , Variação Genética/genética , Genoma , Genoma Bacteriano , Modelos Estatísticos , Modelos Teóricos , Hibridização de Ácido Nucleico , Análise de Sequência com Séries de Oligonucleotídeos/instrumentação , Oligonucleotídeos/química
4.
Adv Bioinformatics ; 2008: 205969, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-19956701

RESUMO

A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...