Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Microbiome Res Rep ; 3(2): 25, 2024.
Article in English | MEDLINE | ID: mdl-38841411

ABSTRACT

Objectives: This study introduces MetaBIDx, a computational method designed to enhance species prediction in metagenomic environments. The method addresses the challenge of accurate species identification in complex microbiomes, which is due to the large number of generated reads and the ever-expanding number of bacterial genomes. Bacterial identification is essential for disease diagnosis and tracing outbreaks associated with microbial infections. Methods: MetaBIDx utilizes a modified Bloom filter for efficient indexing of reference genomes and incorporates a novel strategy for reducing false positives by clustering species based on their genomic coverages by identified reads. The approach was evaluated and compared with several well-established tools across various datasets. Precision, recall, and F1-score were used to quantify the accuracy of species prediction. Results: MetaBIDx demonstrated superior performance compared to other tools, especially in terms of precision and F1-score. The application of clustering based on approximate coverages significantly improved precision in species identification, effectively minimizing false positives. We further demonstrated that other methods can also benefit from our approach to removing false positives by clustering species based on approximate coverages. Conclusion: With a novel approach to reducing false positives and the effective use of a modified Bloom filter to index species, MetaBIDx represents an advancement in metagenomic analysis. The findings suggest that the proposed approach could also benefit other metagenomic tools, indicating its potential for broader application in the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of microbial databases.

2.
Front Big Data ; 5: 1018356, 2022.
Article in English | MEDLINE | ID: mdl-36466712

ABSTRACT

Classifying or identifying bacteria in metagenomic samples is an important problem in the analysis of metagenomic data. This task can be computationally expensive since microbial communities usually consist of hundreds to thousands of environmental microbial species. We proposed a new method for representing bacteria in a microbial community using genomic signatures of those bacteria. With respect to the microbial community, the genomic signatures of each bacterium are unique to that bacterium; they do not exist in other bacteria in the community. Further, since the genomic signatures of a bacterium are much smaller than its genome size, the approach allows for a compressed representation of the microbial community. This approach uses a modified Bloom filter to store short k-mers with hash values that are unique to each bacterium. We show that most bacteria in many microbiomes can be represented uniquely using the proposed genomic signatures. This approach paves the way toward new methods for classifying bacteria in metagenomic samples.

3.
Genes (Basel) ; 11(8)2020 08 17.
Article in English | MEDLINE | ID: mdl-32824429

ABSTRACT

Most current approach to metagenomic classification employ short next generation sequencing (NGS) reads that are present in metagenomic samples to identify unique genomic regions. NGS reads, however, might not be long enough to differentiate similar genomes. This suggests a potential for using longer reads to improve classification performance. Presently, longer reads tend to have a higher rate of sequencing errors. Thus, given the pros and cons, it remains unclear which types of reads is better for metagenomic classification. We compared two taxonomic classification protocols: a traditional assembly-free protocol and a novel assembly-based protocol. The novel assembly-based protocol consists of assembling short-reads into longer reads, which will be subsequently classified by a traditional taxonomic classifier. We discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Generally, we observed a significant increase in precision, while having similar recall rates. On real data, we observed similar characteristics that suggest that the classifiers might have similar performance of higher precision with similar recall with longer reads. We have shown a noticeable difference in performance between assembly-based and assembly-free taxonomic classification. This finding strongly suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. Further, it also suggests that long-read technologies might be better for species classification.


Subject(s)
DNA Barcoding, Taxonomic , Metagenome , Metagenomics , Computational Biology , DNA Barcoding, Taxonomic/methods , Metagenomics/methods , Reproducibility of Results , Workflow
4.
Bioinformatics ; 35(21): 4411-4412, 2019 11 01.
Article in English | MEDLINE | ID: mdl-31038667

ABSTRACT

SUMMARY: Although heteroplasmy has been studied extensively in animal systems, there is a lack of tools for analyzing, exploring and visualizing heteroplasmy at the genome-wide level in other taxonomic systems. We introduce icHET, which is a computational workflow that produces an interactive visualization that facilitates the exploration, analysis and discovery of heteroplasmy across multiple genomic samples. icHET works on short reads from multiple samples from any organism with an organellar reference genome (mitochondrial or plastid) and a nuclear reference genome. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/vtphan/HeteroplasmyWorkflow. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Software , Animals , Genome , Workflow
5.
Bioinformatics ; 34(17): 2918-2926, 2018 09 01.
Article in English | MEDLINE | ID: mdl-29590294

ABSTRACT

Motivation: The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results: We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation: Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , INDEL Mutation , Algorithms , Genome, Human , Genomics/methods , Humans
6.
J Bioinform Comput Biol ; 15(3): 1740001, 2017 Jun.
Article in English | MEDLINE | ID: mdl-28345370

ABSTRACT

Determining abundances of microbial genomes in metagenomic samples is an important problem in analyzing metagenomic data. Although homology-based methods are popular, they have shown to be computationally expensive due to the alignment of tens of millions of reads from metagenomic samples to reference genomes of hundreds to thousands of environmental microbial species. We introduce an efficient alignment-free approach to estimate abundances of microbial genomes in metagenomic samples. The approach is based on solving linear and quadratic programs, which are represented by genome-specific markers (GSM). We compared our method against popular alignment-free and homology-based methods. Without contamination, our method was more accurate than other alignment-free methods while being much faster than a homology-based method. In more realistic settings where samples were contaminated with human DNA, our method was the most accurate method in predicting abundance at varying levels of contamination. We achieve higher accuracy than both alignment-free and homology-based methods.


Subject(s)
Metagenomics/methods , Microbial Consortia/genetics , Sequence Analysis, DNA/methods , Databases, Genetic , Genetic Markers , Genome
7.
BMC Bioinformatics ; 18(Suppl 14): 499, 2017 12 28.
Article in English | MEDLINE | ID: mdl-29297282

ABSTRACT

BACKGROUND: Quantification and identification of microbial genomes based on next-generation sequencing data is a challenging problem in metagenomics. Although current methods have mostly focused on analyzing bacteria whose genomes have been sequenced, such analyses are, however, complicated by the presence of unknown bacteria or bacteria whose genomes have not been sequence. RESULTS: We propose a method for detecting unknown bacteria in environmental samples. Our approach is unique in its utilization of short reads only from 16S rRNA genes, not from entire genomes. We show that short reads from 16S rRNA genes retain sufficient information for detecting unknown bacteria in oral microbial communities. CONCLUSION: In our experimentation with bacterial genomes from the Human Oral Microbiome Database, we found that this method made accurate and robust predictions at different read coverages and percentages of unknown bacteria. Advantages of this approach include not only a reduction in experimental and computational costs but also a potentially high accuracy across environmental samples due to the strong conservation of the 16S rRNA gene.


Subject(s)
Bacteria/genetics , Bacteria/isolation & purification , Microbiota/genetics , RNA, Ribosomal, 16S/genetics , Algorithms , Genetic Markers , Genome, Bacterial , High-Throughput Nucleotide Sequencing , Humans , Metagenome , Sequence Analysis, DNA/methods
8.
BMC Bioinformatics ; 17(Suppl 13): 349, 2016 Oct 06.
Article in English | MEDLINE | ID: mdl-27766935

ABSTRACT

Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners' called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.


Subject(s)
Genome, Human , INDEL Mutation , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Alleles , Data Accuracy , Genomics/methods , Humans , Polymorphism, Genetic
9.
BMC Bioinformatics ; 16 Suppl 17: S3, 2015.
Article in English | MEDLINE | ID: mdl-26678826

ABSTRACT

BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. RESULTS: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. CONCLUSIONS: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.


Subject(s)
Genome , Sequence Alignment/methods , Animals , Base Sequence , Humans , Sequence Analysis, DNA , Software , Time Factors
10.
BMC Bioinformatics ; 15 Suppl 11: S2, 2014.
Article in English | MEDLINE | ID: mdl-25350806

ABSTRACT

BACKGROUND: The analysis of gene expression has played an important role in medical and bioinformatics research. Although it is known that a large number of samples is needed to determine the patterns of gene expression accurately, practical designs of gene expression studies occasionally have insufficient numbers of samples, making it difficult to ascertain true response patterns of variantly expressed genes. RESULTS: We describe an approach to cope with the challenge of predicting true orders of gene response to treatments. We show that true patterns of gene response must be orderable sets. In experiments with few samples, we modify the conventional pairwise comparison tests and increase the significance level α intelligently to deduce orderable patterns, which are most likely true orders of gene response. Additionally, motivated by the fact that a gene can be involved in multiple biological functions, our method further resamples experimental replicates and predicts multiple response patterns for each gene. CONCLUSIONS: This method can be useful in designing cost-effective experiments with small sample sizes. Patterns of highly-variantly expressed genes can be predicted by varying α intelligently. Furthermore, clusters are labeled meaningfully with patterns that describe precisely how genes in such clusters respond to treatments.


Subject(s)
Gene Expression Profiling/methods , Animals , Cluster Analysis , Gene Regulatory Networks , Rats, Sprague-Dawley , Sample Size , Transcription Factors/metabolism
11.
BMC Genomics ; 15 Suppl 5: S2, 2014.
Article in English | MEDLINE | ID: mdl-25081493

ABSTRACT

BACKGROUND: The alignment of short reads generated by next-generation sequencers to genomes is an important problem in many biomedical and bioinformatics applications. Although many proposed methods work very well on narrow ranges of read lengths, they tend to suffer in performance and alignment quality for reads outside of these ranges. RESULTS: We introduce RandAL, a novel method that aligns DNA sequences to reference genomes. Our approach utilizes two FM indices to facilitate efficient bidirectional searching, a pruning heuristic to speed up the computing of edit distances, and most importantly, a randomized strategy that enables effective estimation of key parameters. Extensive comparisons showed that RandAL outperformed popular aligners in most instances and was unique in its consistent and accurate performance over a wide range of read lengths and error rates. The software package is publicly available at https://github.com/namsyvo/RandAL. CONCLUSIONS: RandAL promises to align effectively and accurately short reads that come from a variety of technologies with different read lengths and rates of sequencing error.


Subject(s)
Algorithms , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Computational Biology , Genome , High-Throughput Nucleotide Sequencing/methods , Software
12.
BMC Bioinformatics ; 12 Suppl 10: S19, 2011 Oct 18.
Article in English | MEDLINE | ID: mdl-22165960

ABSTRACT

BACKGROUND: Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation. RESULTS: In this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs Rel, Stat6, Ddit3, Stat5 and Nfic. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the tf-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs. CONCLUSIONS: Our results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.


Subject(s)
Algorithms , Data Mining/methods , Gene Regulatory Networks , Oligonucleotide Array Sequence Analysis , Transcription Factors/isolation & purification , Amino Acid Motifs , Animals , Humans , Mice , PubMed , Systems Biology , Transcription Factors/chemistry , Transcription Factors/genetics , Transcription Factors/metabolism
13.
Int J Data Min Bioinform ; 4(4): 377-94, 2010.
Article in English | MEDLINE | ID: mdl-20815138

ABSTRACT

Hidden stops are nucleotide triples TAA, TAG and TGA that appear on the second and third reading frames of a protein coding gene. Recent studies suggested the important role of hidden stops in preventing misread of mRNA. We study the problem of designing protein-encoding genes with large number of hidden stops under several biological constraints. With simple constraints, redesigned genes have provable maximal number of hidden stops. With more complex constraints, redesigned genes still have many more hidden stops than wild-type genes. We showed that redesigned genes have a distinct positional advantage in assisting early termination of frame-shifts.


Subject(s)
Genes, Synthetic , Base Sequence , Codon, Terminator , Open Reading Frames , RNA, Messenger/metabolism
14.
Int J Bioinform Res Appl ; 6(1): 21-36, 2010.
Article in English | MEDLINE | ID: mdl-20110207

ABSTRACT

We propose a novel method to estimate editing efficiency by adenosine deaminases that act on RNA (ADARs). The method employs the notion of stability of secondary structure in the vicinity of edited sites during transcription. Such an analysis of 'dynamic' structural motifs of RNA is important because as a pre-spliced RNA is being transcribed and elongated, its entire structure, and thus its local structures, may change drastically. Our simulation showed that the stability of structures in the vicinity of edited sites correlates moderately highly with editing efficiency of edited sites recently established in laboratory experiments.


Subject(s)
Adenosine Deaminase/chemistry , RNA Editing , RNA/chemistry , Adenosine , Adenosine Deaminase/metabolism , Base Sequence , Computer Simulation , Molecular Sequence Data , Nucleic Acid Conformation
15.
J Bioinform Comput Biol ; 7(1): 135-56, 2009 Feb.
Article in English | MEDLINE | ID: mdl-19226664

ABSTRACT

Post hoc assignment of patterns determined by all pairwise comparisons in microarray experiments with multiple treatments has been proven to be useful in assessing treatment effects. We propose the usage of transitive directed acyclic graphs (tDAG) as the representation of these patterns and show that such representation can be useful in clustering treatment effects, annotating existing clustering methods, and analyzing sample sizes. Advantages of this approach include: (1) unique and descriptive meaning of each cluster in terms of how genes respond to all pairs of treatments; (2) insensitivity of the observed patterns to the number of genes analyzed; and (3) a combinatorial perspective to address the sample size problem by observing the rate of contractible tDAG as the number of replicates increases. The advantages and overall utility of the method in elaborating drug structure activity relationships are exemplified in a controlled study with real and simulated data.


Subject(s)
Algorithms , Artificial Intelligence , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , Data Display
16.
Carcinogenesis ; 30(3): 480-6, 2009 Mar.
Article in English | MEDLINE | ID: mdl-19126641

ABSTRACT

3H-1,2-dithiole-3-thione (D3T) and its analogues 4-methyl-5-pyrazinyl-3H-1,2-dithiole-3-thione (OLT) and 5-tert-butyl-3H-1,2-dithiole-3-thione (TBD) are chemopreventive agents that block or diminish early stages of carcinogenesis by inducing activities of detoxication enzymes. While OLT has been used in clinical trials, TBD has been shown to be more efficacious and possibly less toxic than OLT in animals. Here, we utilize a robust and high-resolution chemical genomics procedure to examine the pharmacological structure-activity relationships of these compounds in livers of male rats by microarray analyses. We identified 226 differentially expressed genes that were common to all treatments. Functional analysis identified the relation of these genes to glutathione metabolism and the nuclear factor, erythroid derived 2-related factor 2 pathway (Nrf2) that is known to regulate many of the protective actions of dithiolethiones. OLT and TBD were shown to have similar efficacies and both were weaker than D3T. In addition, we identified 40 genes whose responses were common to OLT and TBD, yet distinct from D3T. As inhibition of cytochrome P450 (CYP) has been associated with the effects of OLT on CYP expression, we determined the half maximal inhibitory concentration (IC(50)) values for inhibition of CYP1A2. The rank order of inhibitor potency was OLT >> TBD >> D3T, with IC(50) values estimated as 0.2, 12.8 and >100 microM, respectively. Functional analysis revealed that OLT and TBD, in addition to their effects on CYP, modulate liver lipid metabolism, especially fatty acids. Together, these findings provide new insight into the actions of clinically relevant and lead dithiolethione analogues.


Subject(s)
Anticarcinogenic Agents , Gene Expression Profiling , Heterocyclic Compounds, 1-Ring , Thiones , Thiophenes , Animals , Male , Rats , Anticarcinogenic Agents/pharmacology , Cytochrome P-450 CYP1A2/metabolism , Genomics , Glutathione/metabolism , Heterocyclic Compounds, 1-Ring/pharmacology , Liver/drug effects , Liver/metabolism , Multigene Family , Oligonucleotide Array Sequence Analysis , Pyrazines , Rats, Inbred F344 , Structure-Activity Relationship , Thiones/pharmacology , Thiophenes/pharmacology , NF-E2-Related Factor 2/metabolism
17.
Bioinformatics ; 24(24): 2930-1, 2008 Dec 15.
Article in English | MEDLINE | ID: mdl-19017656

ABSTRACT

MOTIVATION: Motif Tool Manager is a web-based framework for comparing and combining different approaches to discover novel DNA motifs. It comes with a set of five well-known approaches to motif discovery. It provides an easy mechanism for adding new motif finding tools to the framework through a web-interface and a minimal setup of the tools on the server. Users can execute the tools through the web-based framework and compare results from such executions. The framework provides a basic mechanism for identifying the most similar motif candidates found by a majority of themotif finding tools. AVAILABILITY: http://cetus.cs.memphis.edu/motif


Subject(s)
Sequence Analysis, DNA/methods , Software , Algorithms , DNA/chemistry , Internet
18.
Int J Comput Biol Drug Des ; 1(2): 174-84, 2008.
Article in English | MEDLINE | ID: mdl-20058488

ABSTRACT

Proper management of bioinformatics data and tools is crucial because the amount of data is enormous, the type of data varies, and there are often different approaches (and consequently tools) for solving a particular problem. While specialised systems exist to serve specific needs, such systems are difficult to adapt and require large resource commitments for development and maintenance. We propose a system called Bioinformatics Tools and Data Management System (BioTDMS) that uses open-source technologies to provide a platform for managing both data and tools. We present case studies that show some potential applications of this system.


Subject(s)
Algorithms , Computational Biology/methods , Research Design , Information Management , Internet , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...