Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 19(Suppl 13): 377, 2019 Feb 04.
Article in English | MEDLINE | ID: mdl-30717665

ABSTRACT

BACKGROUND: Estimating the parameters that describe the ecology of viruses,particularly those that are novel, can be made possible using metagenomic approaches. However, the best-performing existing methods require databases to first estimate an average genome length of a viral community before being able to estimate other parameters, such as viral richness. Although this approach has been widely used, it can adversely skew results since the majority of viruses are yet to be catalogued in databases. RESULTS: In this paper, we present ENVirT, a method for estimating the richness of novel viral mixtures, and for the first time we also show that it is possible to simultaneously estimate the average genome length without a priori information. This is shown to be a significant improvement over database-dependent methods, since we can now robustly analyze samples that may include novel viral types under-represented in current databases. We demonstrate that the viral richness estimates produced by ENVirT are several orders of magnitude higher in accuracy than the estimates produced by existing methods named PHACCS and CatchAll when benchmarked against simulated data. We repeated the analysis of 20 metavirome samples using ENVirT, which produced results in close agreement with complementary in virto analyses. CONCLUSIONS: These insights were previously not captured by existing computational methods. As such, ENVirT is shown to be an essential tool for enhancing our understanding of novel viral populations.


Subject(s)
Algorithms , Ecological and Environmental Phenomena , Metagenomics , Computer Simulation , Fermented Foods , Gastrointestinal Microbiome , Genome, Viral , Humans , Lakes/virology , Time Factors , Viruses/genetics
2.
BMC Genomics ; 16 Suppl 12: S12, 2015.
Article in English | MEDLINE | ID: mdl-26680279

ABSTRACT

BACKGROUND: Mass Spectrometry (MS) is a ubiquitous analytical tool in biological research and is used to measure the mass-to-charge ratio of bio-molecules. Peak detection is the essential first step in MS data analysis. Precise estimation of peak parameters such as peak summit location and peak area are critical to identify underlying bio-molecules and to estimate their abundances accurately. We propose a new method to detect and quantify peaks in mass spectra. It uses dual-tree complex wavelet transformation along with Stein's unbiased risk estimator for spectra smoothing. Then, a new method, based on the modified Asymmetric Pseudo-Voigt (mAPV) model and hierarchical particle swarm optimization, is used for peak parameter estimation. RESULTS: Using simulated data, we demonstrated the benefit of using the mAPV model over Gaussian, Lorentz and Bi-Gaussian functions for MS peak modelling. The proposed mAPV model achieved the best fitting accuracy for asymmetric peaks, with lower percentage errors in peak summit location estimation, which were 0.17% to 4.46% less than that of the other models. It also outperformed the other models in peak area estimation, delivering lower percentage errors, which were about 0.7% less than its closest competitor - the Bi-Gaussian model. In addition, using data generated from a MALDI-TOF computer model, we showed that the proposed overall algorithm outperformed the existing methods mainly in terms of sensitivity. It achieved a sensitivity of 85%, compared to 77% and 71% of the two benchmark algorithms, continuous wavelet transformation based method and Cromwell respectively. CONCLUSIONS: The proposed algorithm is particularly useful for peak detection and parameter estimation in MS data with overlapping peak distributions and asymmetric peaks. The algorithm is implemented using MATLAB and the source code is freely available at http://mapv.sourceforge.net.


Subject(s)
Computational Biology/methods , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization/methods , Algorithms , Computer Simulation
3.
BMC Bioinformatics ; 16 Suppl 18: S3, 2015.
Article in English | MEDLINE | ID: mdl-26678073

ABSTRACT

BACKGROUND: Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. RESULTS: On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. CONCLUSIONS: The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors. AVAILABILITY: http://sourceforge.net/projects/viquas/.


Subject(s)
Metagenomics , Algorithms , Benchmarking , High-Throughput Nucleotide Sequencing , Internet , User-Computer Interface
4.
Bioinformatics ; 31(19): 3198-206, 2015 Oct 01.
Article in English | MEDLINE | ID: mdl-26063840

ABSTRACT

MOTIVATION: Matrix Assisted Laser Desorption Ionization-Imaging Mass Spectrometry (MALDI-IMS) in 'omics' data acquisition generates detailed information about the spatial distribution of molecules in a given biological sample. Various data processing methods have been developed for exploring the resultant high volume data. However, most of these methods process data in the spectral domain and do not make the most of the important spatial information available through this technology. Therefore, we propose a novel streamlined data analysis pipeline specifically developed for MALDI-IMS data utilizing significant spatial information for identifying hidden significant molecular distribution patterns in these complex datasets. METHODS: The proposed unsupervised algorithm uses Sliding Window Normalization (SWN) and a new spatial distribution based peak picking method developed based on Gray level Co-Occurrence (GCO) matrices followed by clustering of biomolecules. We also use gist descriptors and an improved version of GCO matrices to extract features from molecular images and minimum medoid distance to automatically estimate the number of possible groups. RESULTS: We evaluated our algorithm using a new MALDI-IMS metabolomics dataset of a plant (Eucalypt) leaf. The algorithm revealed hidden significant molecular distribution patterns in the dataset, which the current Component Analysis and Segmentation Map based approaches failed to extract. We further demonstrate the performance of our peak picking method over other traditional approaches by using a publicly available MALDI-IMS proteomics dataset of a rat brain. Although SWN did not show any significant improvement as compared with using no normalization, the visual assessment showed an improvement as compared to using the median normalization. AVAILABILITY AND IMPLEMENTATION: The source code and sample data are freely available at http://exims.sourceforge.net/. CONTACT: awgcdw@student.unimelb.edu.au or chalini_w@live.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Brain/metabolism , Eucalyptus/chemistry , Metabolomics/methods , Plant Leaves/metabolism , Proteomics/methods , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization/methods , Animals , Rats
5.
Bioinformatics ; 31(6): 886-96, 2015 Mar 15.
Article in English | MEDLINE | ID: mdl-25398613

ABSTRACT

MOTIVATION: The combined effect of a high replication rate and the low fidelity of the viral polymerase in most RNA viruses and some DNA viruses results in the formation of a viral quasispecies. Uncovering information about quasispecies populations significantly benefits the study of disease progression, antiviral drug design, vaccine design and viral pathogenesis. We present a new analysis pipeline called ViQuaS for viral quasispecies spectrum reconstruction using short next-generation sequencing reads. ViQuaS is based on a novel reference-assisted de novo assembly algorithm for constructing local haplotypes. A significantly extended version of an existing global strain reconstruction algorithm is also used. RESULTS: Benchmarking results showed that ViQuaS outperformed three other previously published methods named ShoRAH, QuRe and PredictHaplo, with improvements of at least 3.1-53.9% in recall, 0-12.1% in precision and 0-38.2% in F-score in terms of strain sequence assembly and improvements of at least 0.006-0.143 in KL-divergence and 0.001-0.035 in root mean-squared error in terms of strain frequency estimation, over the next-best algorithm under various simulation settings. We also applied ViQuaS on a real read set derived from an in vitro human immunodeficiency virus (HIV)-1 population, two independent datasets of foot-and-mouth-disease virus derived from the same biological sample and a real HIV-1 dataset and demonstrated better results than other methods available.


Subject(s)
Algorithms , Foot-and-Mouth Disease Virus/genetics , HIV-1/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing/methods , Foot-and-Mouth Disease Virus/classification , HIV-1/classification , Humans
6.
BMC Genomics ; 15: 732, 2014 Aug 28.
Article in English | MEDLINE | ID: mdl-25167919

ABSTRACT

BACKGROUND: Using whole exome sequencing to predict aberrations in tumours is a cost effective alternative to whole genome sequencing, however is predominantly used for variant detection and infrequently utilised for detection of somatic copy number variation. RESULTS: We propose a new method to infer copy number and genotypes using whole exome data from paired tumour/normal samples. Our algorithm uses two Hidden Markov Models to predict copy number and genotypes and computationally resolves polyploidy/aneuploidy, normal cell contamination and signal baseline shift. Our method makes explicit detection on chromosome arm level events, which are commonly found in tumour samples. The methods are combined into a package named ADTEx (Aberration Detection in Tumour Exome). We applied our algorithm to a cohort of 17 in-house generated and 18 TCGA paired ovarian cancer/normal exomes and evaluated the performance by comparing against the copy number variations and genotypes predicted using Affymetrix SNP 6.0 data of the same samples. Further, we carried out a comparison study to show that ADTEx outperformed its competitors in terms of precision and F-measure. CONCLUSIONS: Our proposed method, ADTEx, uses both depth of coverage ratios and B allele frequencies calculated from whole exome sequencing data, to predict copy number variations along with their genotypes. ADTEx is implemented as a user friendly software package using Python and R statistical language. Source code and sample data are freely available under GNU license (GPLv3) at http://adtex.sourceforge.net/.


Subject(s)
DNA Copy Number Variations , Exome , Genotype , Neoplasms/genetics , Algorithms , Chromosome Aberrations , Computational Biology/methods , Female , Genomics/methods , Genotyping Techniques , High-Throughput Nucleotide Sequencing , Humans , Loss of Heterozygosity , Ovarian Neoplasms/genetics , Polymorphism, Single Nucleotide , Polyploidy , Reproducibility of Results , Sensitivity and Specificity
7.
PLoS One ; 9(4): e95217, 2014.
Article in English | MEDLINE | ID: mdl-24752294

ABSTRACT

Targeted resequencing by massively parallel sequencing has become an effective and affordable way to survey small to large portions of the genome for genetic variation. Despite the rapid development in open source software for analysis of such data, the practical implementation of these tools through construction of sequencing analysis pipelines still remains a challenging and laborious activity, and a major hurdle for many small research and clinical laboratories. We developed TREVA (Targeted REsequencing Virtual Appliance), making pre-built pipelines immediately available as a virtual appliance. Based on virtual machine technologies, TREVA is a solution for rapid and efficient deployment of complex bioinformatics pipelines to laboratories of all sizes, enabling reproducible results. The analyses that are supported in TREVA include: somatic and germline single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses. TREVA is flexible and easy to use, and can be customised by Linux-based extensions if required. TREVA can also be deployed on the cloud (cloud computing), enabling instant access without investment overheads for additional hardware. TREVA is available at http://bioinformatics.petermac.org/treva/.


Subject(s)
Computational Biology/methods , Exome/genetics , Genome, Human/genetics , Sequence Analysis, DNA , User-Computer Interface , Animals , Humans , Melanoma/genetics , Mice , Mutation/genetics
8.
BMC Bioinformatics ; 14 Suppl 2: S2, 2013.
Article in English | MEDLINE | ID: mdl-23368785

ABSTRACT

BACKGROUND: One of the main types of genetic variations in cancer is Copy Number Variations (CNV). Whole exome sequencing (WES) is a popular alternative to whole genome sequencing (WGS) to study disease specific genomic variations. However, finding CNV in Cancer samples using WES data has not been fully explored. RESULTS: We present a new method, called CoNVEX, to estimate copy number variation in whole exome sequencing data. It uses ratio of tumour and matched normal average read depths at each exonic region, to predict the copy gain or loss. The useful signal produced by WES data will be hindered by the intrinsic noise present in the data itself. This limits its capacity to be used as a highly reliable CNV detection source. Here, we propose a method that consists of discrete wavelet transform (DWT) to reduce noise. The identification of copy number gains/losses of each targeted region is performed by a Hidden Markov Model (HMM). CONCLUSION: HMM is frequently used to identify CNV in data produced by various technologies including Array Comparative Genomic Hybridization (aCGH) and WGS. Here, we propose an HMM to detect CNV in cancer exome data. We used modified data from 1000 Genomes project to evaluate the performance of the proposed method. Using these data we have shown that CoNVEX outperforms the existing methods significantly in terms of precision. Overall, CoNVEX achieved a sensitivity of more than 92% and a precision of more than 50%.


Subject(s)
DNA Copy Number Variations , Exome , Neoplasms/genetics , Comparative Genomic Hybridization , Exons , Genomics/methods , Humans , Markov Chains , Models, Statistical
9.
Bioinformatics ; 28(10): 1307-13, 2012 May 15.
Article in English | MEDLINE | ID: mdl-22474122

ABSTRACT

MOTIVATION: In light of the increasing adoption of targeted resequencing (TR) as a cost-effective strategy to identify disease-causing variants, a robust method for copy number variation (CNV) analysis is needed to maximize the value of this promising technology. RESULTS: We present a method for CNV detection for TR data, including whole-exome capture data. Our method calls copy number gains and losses for each target region based on normalized depth of coverage. Our key strategies include the use of base-level log-ratios to remove GC-content bias, correction for an imbalanced library size effect on log-ratios, and the estimation of log-ratio variations via binning and interpolation. Our methods are made available via CONTRA (COpy Number Targeted Resequencing Analysis), a software package that takes standard alignment formats (BAM/SAM) and outputs in variant call format (VCF4.0), for easy integration with other next-generation sequencing analysis packages. We assessed our methods using samples from seven different target enrichment assays, and evaluated our results using simulated data and real germline data with known CNV genotypes.


Subject(s)
DNA Copy Number Variations , Exome , Sequence Analysis, DNA , Animals , Computer Simulation , HapMap Project , Humans , Mice , Software
10.
Nucleic Acids Res ; 40(5): e34, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22180538

ABSTRACT

An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis.


Subject(s)
Metagenome , Metagenomics/methods , Animals , Base Composition , Biofilms , Bone and Bones/microbiology , Cluster Analysis , Genomic Library , Geologic Sediments/microbiology , Oligochaeta/microbiology , Sequence Analysis, DNA , Sewage/microbiology
11.
Genomics ; 96(2): 92-101, 2010 Aug.
Article in English | MEDLINE | ID: mdl-20417269

ABSTRACT

The second codon of a transcript, besides encoding for an amino acid, is now known to also have multiple molecular functions and is involved in translation efficiency and protein turn-over and maturation processing. These multiple purposes therefore make the selection constraints on this codon's composition more complex. To examine the biological significance of various permutations of the second codon, we conducted a systematic survey of second codon composition from 442 selected genomes across three domains. The amino acid bias of the second codon is associated with specific protein functions. The most common amino acids (S, A, K and T) are significantly avoided in Cell Envelope-related genes but preferred in Translation or Energy Metabolism-related genes, suggesting that the function of a gene product is a significant factor influencing the composition of the second codon.


Subject(s)
Amino Acids/genetics , Codon/physiology , Genome/genetics , Proteins/physiology , Selection, Genetic , Archaea/genetics , Bacteria/genetics , Base Composition , Codon/genetics , Eukaryota/genetics , Mutation/genetics , Proteins/genetics , Sequence Analysis, Protein
12.
BMC Genomics ; 10 Suppl 3: S10, 2009 Dec 03.
Article in English | MEDLINE | ID: mdl-19958473

ABSTRACT

BACKGROUND: The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods. RESULTS: We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency. CONCLUSION: We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms.


Subject(s)
Gene Frequency , Metagenome , Oligonucleotides/genetics , Base Sequence , Genomics , Oligonucleotides/classification , Sequence Analysis, DNA
13.
BMC Bioinformatics ; 9: 215, 2008 Apr 28.
Article in English | MEDLINE | ID: mdl-18442374

ABSTRACT

BACKGROUND: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity. RESULTS: The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the >/= 10 reads datasets and comparable in the > or = 8 kb benchmark tests. CONCLUSION: In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.


Subject(s)
Database Management Systems , Pattern Recognition, Automated/methods , Phylogeny , RNA, Archaeal/classification , RNA, Bacterial/classification , Algorithms , Artificial Intelligence , Base Sequence , Confidence Intervals , Databases, Genetic , Genes, rRNA , Genomics/methods , Information Storage and Retrieval/methods , Information Storage and Retrieval/statistics & numerical data , Pattern Recognition, Automated/statistics & numerical data , RNA, Archaeal/analysis , RNA, Bacterial/analysis , Sample Size , Sequence Analysis, RNA , Species Specificity , Uncertainty
14.
BMC Evol Biol ; 8: 116, 2008 Apr 23.
Article in English | MEDLINE | ID: mdl-18430250

ABSTRACT

BACKGROUND: Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes based on gene order and genome position, which helps to reveal species relationships that may not be apparent from traditional phylogenetic methods. RESULTS: We propose the use of an overlapping subspace clustering algorithm for such genome classification problems. The advantage of subspace clustering over traditional clustering is that it can associate clusters with gene arrangement patterns, preserving genomic information in the clusters produced. Additionally, overlapping capability is desirable for the discovery of multiple conserved patterns within a single genome, such as those acquired from different species via horizontal gene transfers. The proposed method involves a novel strategy to vectorize genomes based on their gene distribution. A number of existing subspace clustering and biclustering algorithms were evaluated to identify the best framework upon which to develop our algorithm; we extended a generic subspace clustering algorithm called HARP to incorporate overlapping capability. The proposed algorithm was assessed and applied on bacteriophage genomes. The phage grouping results are consistent overall with the Phage Proteomic Tree and showed common genomic characteristics among the TP901-like, Sfi21-like and sk1-like phage groups. Among 441 phage genomes, we identified four significantly conserved distribution patterns structured by the terminase, portal, integrase, holin and lysin genes. We also observed a subgroup of Sfi21-like phages comprising a distinctive divergent genome organization and identified nine new phage members to the Sfi21-like genus: Staphylococcus 71, phiPVL108, Listeria A118, 2389, Lactobacillus phi AT3, A2, Clostridium phi3626, Geobacillus GBSV1, and Listeria monocytogenes PSA. CONCLUSION: The method described in this paper can assist evolutionary study through objectively classifying genomes based on their resemblance in gene order, gene content and gene positions. The method is suitable for application to genomes with high genetic exchange and various conserved gene arrangement, as demonstrated through our application on phages.


Subject(s)
Cluster Analysis , Genome , Models, Genetic , Proteomics/methods , Algorithms , Bacteriophages/genetics , Gene Order , Genome, Viral , Pattern Recognition, Automated , Peptide Library
15.
J Biomed Biotechnol ; 2008: 513701, 2008.
Article in English | MEDLINE | ID: mdl-18288261

ABSTRACT

Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining 7%-15% speed improvement.


Subject(s)
Algorithms , Artificial Intelligence , Chromosome Mapping/methods , Data Interpretation, Statistical , Pattern Recognition, Automated/methods , Sequence Analysis, DNA/methods , Cluster Analysis
16.
BMC Bioinformatics ; 8 Suppl 4: S6, 2007 May 22.
Article in English | MEDLINE | ID: mdl-17570149

ABSTRACT

BACKGROUND: Existing methods for whole-genome comparisons require prior knowledge of related species and provide little automation in the function prediction process. Bacteriophage genomes are an example that cannot be easily analyzed by these methods. This work addresses these shortcomings and aims to provide an automated prediction system of gene function. RESULTS: We have developed a novel system called SynFPS to perform gene function prediction over completed genomes. The prediction system is initialized by clustering a large collection of weakly related genomes into groups based on their resemblance in gene distribution. From each individual group, data are then extracted and used to train a Support Vector Machine that makes gene function predictions. Experiments were conducted with 9 different gene functions over 296 bacteriophage genomes. Cross validation results gave an average prediction accuracy of ~80%, which is comparable to other genomic-context based prediction methods. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment. The software is publicly available at http://www.synteny.net/. CONCLUSION: The proposed system employs genomic context to predict gene function and detect gene correspondence in whole-genome comparisons. Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes as they share a number of similar characteristics with phage genomes such as gene order conservation.


Subject(s)
Artificial Intelligence , Bacteriophages/genetics , Chromosome Mapping/methods , Cluster Analysis , Genome, Viral/genetics , Multigene Family/genetics , Sequence Analysis, DNA/methods , Algorithms , Base Sequence , Discriminant Analysis , Molecular Sequence Data , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Homology, Nucleic Acid
17.
Bioinformatics ; 19(16): 2131-40, 2003 Nov 01.
Article in English | MEDLINE | ID: mdl-14594719

ABSTRACT

MOTIVATION: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. RESULTS: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). AVAILABILITY: JAVA software of dynamic SOM tree algorithm is available upon request for academic use. SUPPLEMENTARY INFORMATION: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf


Subject(s)
Algorithms , Biomarkers, Tumor/genetics , Colonic Neoplasms/classification , Colonic Neoplasms/genetics , Gene Expression Profiling/methods , Leukemia/classification , Leukemia/genetics , Oligonucleotide Array Sequence Analysis/methods , Cluster Analysis , Gene Expression Regulation, Neoplastic/genetics , Genetic Markers/genetics , Genetic Testing/methods , Humans , Reproducibility of Results , Sensitivity and Specificity
18.
Bioinformatics ; 18(8): 1084-90, 2002 Aug.
Article in English | MEDLINE | ID: mdl-12176831

ABSTRACT

MOTIVATION: It is attempted to improve the speed and flexibility of protein motif identification. The proposed algorithm is able to extract both rigid and flexible protein motifs. RESULTS: In this work, we present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences. This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies. Fuzzy logic is used to increase the flexibility of protein motifs. C2H2 Zinc Finger Protein and epidermal growth factor protein sequences are used to demonstrate the capability of the proposed algorithm in finding motifs. AVAILABILITY: This program is freely available for academic use by request.


Subject(s)
Algorithms , Amino Acid Motifs/genetics , Fuzzy Logic , Neural Networks, Computer , Sequence Analysis, Protein/methods , Databases, Protein , Epidermal Growth Factor/chemistry , Epidermal Growth Factor/genetics , False Positive Reactions , Pattern Recognition, Automated , Proteins/chemistry , Proteins/genetics , Reproducibility of Results , Sensitivity and Specificity , Zinc Fingers/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...