Search | VHL Regional Portal

1.

Cellular adhesiveness and cellulolytic capacity in Anaerolineae revealed by omics-based genome interpretation.

Xia, Yu; Wang, Yubo; Wang, Yi; Chin, Francis Y L; Zhang, Tong.

Biotechnol Biofuels ; 9: 111, 2016.

Article in English | MEDLINE | ID: mdl-27222666

ABSTRACT

BACKGROUND: The Anaerolineae lineage of Chloroflexi had been identified as one of the core microbial populations in anaerobic digesters; however, the ecological role of the Anaerolineae remains uncertain due to the scarcity of isolates and annotated genome sequences. Our previous metatranscriptional analysis revealed this prevalent population that showed minimum involvement in the main pathways of cellulose hydrolysis and subsequent methanogenesis in the thermophilic cellulose fermentative consortium (TCF). RESULTS: In further pursuit, five high-quality curated draft genomes (>98 % completeness) of this population, including two affiliated with the inaccessible lineage of SBR1031, were retrieved by sequence-based multi-dimensional coverage binning. Comparative genomic analyses revealed versatile genetic capabilities for carbohydrate-based fermentative lifestyle including key genes catalyzing cellulose hydrolysis in Anaerolinea phylotypes. However, the low transcriptional activities of carbohydrate-active genes (CAGs) excluded cellulolytic capability as the selective advantage for their prevalence in the community. Instead, a substantially active type VI pili (Tfp) assembly was observed. Expression of the tight adherence protein on the Tfp indicated its function for cellular attachment which was further testified to be more likely related to cell aggregation other than cellulose surface adhesion. Meanwhile, this Tfp structure was found not contributing to syntrophic methanogenesis. Members of the SBR1031 encoded key genes for acetogenic dehydrogenation that may allow ethanol to be used as a carbon source. CONCLUSION: The common prevalence of Anaerolineae in anaerobic digesters should be originated from advantageous cellular adhesiveness enabled by Tfp assembly other than its potential as cellulose degrader or anaerobic syntrophs.

2.

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

Zhu, Xiao; Leung, Henry C M; Wang, Rongjie; Chin, Francis Y L; Yiu, Siu Ming; Quan, Guangri; Li, Yajie; Zhang, Rui; Jiang, Qinghua; Liu, Bo; Dong, Yucui; Zhou, Guohui; Wang, Yadong.

BMC Bioinformatics ; 16: 386, 2015 Nov 16.

Article in English | MEDLINE | ID: mdl-26573684

ABSTRACT

BACKGROUND: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence). RESULTS: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls. CONCLUSIONS: We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder.

Subject(s)

Escherichia coli/genetics , High-Throughput Nucleotide Sequencing/methods , Schizosaccharomyces/genetics , Sequence Analysis, DNA/methods , Software , Computer Simulation

3.

Phylogeny-structured carbohydrate metabolism across microbiomes collected from different units in wastewater treatment process.

Xia, Yu; Chin, Francis Y L; Chao, Yuanqing; Zhang, Tong.

Biotechnol Biofuels ; 8: 172, 2015.

Article in English | MEDLINE | ID: mdl-26500698

ABSTRACT

BACKGROUND: With respect to global priority for bioenergy production from plant biomass, understanding the fundamental genetic associations underlying carbohydrate metabolisms is crucial for the development of effective biorefinery process. Compared with gut microbiome of ruminal animals and wood-feed insects, knowledge on carbohydrate metabolisms of engineered biosystems is limited. RESULTS: In this study, comparative metagenomics coupled with metabolic network analysis was carried out to study the inter-species cooperation and competition among carbohydrate-active microbes in typical units of wastewater treatment process including activated sludge and anaerobic digestion. For the first time, sludge metagenomes demonstrated rather diverse pool of carbohydrate-active genes (CAGs) comparable to that of rumen microbiota. Overall, the CAG composition correlated strongly with the microbial phylogenetic structure across sludge types. Gene-centric clustering analysis showed the carbohydrate pathways of sludge systems were shaped by different environmental factors, including dissolved oxygen and salinity, and the latter showed more determinative influence of phylogenetic composition. Eventually, the highly clustered co-occurrence network of CAGs and saccharolytic phenotypes, revealed three metabolic modules in which the prevalent populations of Actinomycetales, Clostridiales and Thermotogales, respectively, play significant roles as interaction hubs, while broad negative co-exclusion correlations observed between anaerobic and aerobic microbes, probably implicated roles of niche separation by dissolved oxygen in determining the microbial assembly. CONCLUSIONS: Sludge microbiomes encoding diverse pool of CAGs was another potential source for effective lignocellulosic biomass breakdown. But unlike gut microbiomes in which Clostridiales, Lactobacillales and Bacteroidales play a vital role, the carbohydrate metabolism of sludge systems is built on the inter-species cooperation and competition among Actinomycetales, Clostridiales and Thermotogales.

4.

A Simple Algorithm for Finding All k-Edge-Connected Components.

Wang, Tianhao; Zhang, Yong; Chin, Francis Y L; Ting, Hing-Fung; Tsin, Yung H; Poon, Sheung-Hung.

PLoS One ; 10(9): e0136264, 2015.

Article in English | MEDLINE | ID: mdl-26368134

ABSTRACT

The problem of finding k-edge-connected components is a fundamental problem in computer science. Given a graph G = (V, E), the problem is to partition the vertex set V into {V1, V2,, Vh}, where each Vi is maximized, such that for any two vertices x and y in Vi, there are k edge-disjoint paths connecting them. In this paper, we present an algorithm to solve this problem for all k. The algorithm preprocesses the input graph to construct an Auxiliary Graph to store information concerning edge-connectivity among every vertex pair in O(Fn) time, where F is the time complexity to find the maximum flow between two vertices in graph G and n = â£Vâ£. For any value of k, the k-edge-connected components can then be determined by traversing the auxiliary graph in O(n) time. The input graph can be a directed or undirected, simple graph or multigraph. Previous works on this problem mainly focus on fixed value of k.

Subject(s)

Algorithms , Informatics/methods

5.

Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering.

Shi, Jian-Yu; Yiu, Siu-Ming; Li, Yiming; Leung, Henry C M; Chin, Francis Y L.

Methods ; 83: 98-104, 2015 Jul 15.

Article in English | MEDLINE | ID: mdl-25957673

ABSTRACT

Predicting drug-target interaction using computational approaches is an important step in drug discovery and repositioning. To predict whether there will be an interaction between a drug and a target, most existing methods identify similar drugs and targets in the database. The prediction is then made based on the known interactions of these drugs and targets. This idea is promising. However, there are two shortcomings that have not yet been addressed appropriately. Firstly, most of the methods only use 2D chemical structures and protein sequences to measure the similarity of drugs and targets respectively. However, this information may not fully capture the characteristics determining whether a drug will interact with a target. Secondly, there are very few known interactions, i.e. many interactions are "missing" in the database. Existing approaches are biased towards known interactions and have no good solutions to handle possibly missing interactions which affect the accuracy of the prediction. In this paper, we enhance the similarity measures to include non-structural (and non-sequence-based) information and introduce the concept of a "super-target" to handle the problem of possibly missing interactions. Based on evaluations on real data, we show that our similarity measure is better than the existing measures and our approach is able to achieve higher accuracy than the two best existing algorithms, WNN-GIP and KBMF2K. Our approach is available at http://web.hku.hk/â¼liym1018/projects/drug/drug.html or http://www.bmlnwpu.org/us/tools/PredictingDTI_S2/METHODS.html.

Subject(s)

Cluster Analysis , Computational Biology/methods , Drug Discovery/methods , Genomics/methods , Algorithms , Artificial Intelligence , Humans , Pharmaceutical Preparations/chemistry

6.

IDBA-MTP: A Hybrid Metatranscriptomic Assembler Based on Protein Information.

Leung, Henry C M; Yiu, Siu-Ming; Chin, Francis Y L.

J Comput Biol ; 22(5): 367-76, 2015 May.

Article in English | MEDLINE | ID: mdl-25535824

ABSTRACT

Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using next-generation sequencing (NGS) technology, biologists can study the microbe community by sampling short reads from a mixture of mRNAs (metatranscriptomic data). As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBA-MT, designed specifically for the assembly of metatranscriptomic data and performs well only on high-expressed mRNAs. This article introduces IDBA-MTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is nontrivial given the size of the database and given that different mRNAs might lead to proteins with similar functions (because different amino acids might have similar characteristics). IDBA-MTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques, and seed-and-extend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBA-MTP outperforms existing assemblers by reconstructing 14% more mRNAs.

Subject(s)

Bacterial Proteins/chemistry , Contig Mapping/statistics & numerical data , Microbial Consortia/genetics , RNA, Messenger/chemistry , Software , Transcriptome , Algorithms , Bacterial Proteins/genetics , Contig Mapping/methods , Data Mining , High-Throughput Nucleotide Sequencing , Metagenomics/methods , Metagenomics/statistics & numerical data , Proteome/chemistry , Proteome/genetics , RNA, Bacterial/chemistry , RNA, Bacterial/genetics , RNA, Messenger/genetics , Sequence Analysis, DNA

7.

PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.

Zhu, Xiao; Leung, Henry C M; Chin, Francis Y L; Yiu, Siu Ming; Quan, Guangri; Liu, Bo; Wang, Yadong.

PLoS One ; 9(12): e114253, 2014.

Article in English | MEDLINE | ID: mdl-25461763

ABSTRACT

Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.

Subject(s)

High-Throughput Nucleotide Sequencing , Support Vector Machine , Microsatellite Repeats

8.

Sequence assembly using next generation sequencing data--challenges and solutions.

Chin, Francis Y L; Leung, Henry C M; Yiu, S M.

Sci China Life Sci ; 57(11): 1140-8, 2014 Nov.

Article in English | MEDLINE | ID: mdl-25326069

ABSTRACT

Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.

Subject(s)

Computational Biology/methods , DNA/chemistry , RNA/chemistry , Sequence Analysis, DNA/methods , Algorithms , Contig Mapping/methods , Escherichia coli/genetics , False Positive Reactions , Genome , Genome, Bacterial , Humans , Lactobacillus plantarum/genetics , Metagenomics , Software , Transcription, Genetic , Transcriptome

9.

DDGni: dynamic delay gene-network inference from high-temporal data using gapped local alignment.

Yalamanchili, Hari Krishna; Yan, Bin; Li, Mulin Jun; Qin, Jing; Zhao, Zhongying; Chin, Francis Y L; Wang, Junwen.

Bioinformatics ; 30(3): 377-83, 2014 Feb 01.

Article in English | MEDLINE | ID: mdl-24285602

ABSTRACT

MOTIVATION: Inferring gene-regulatory networks is very crucial in decoding various complex mechanisms in biological systems. Synthesis of a fully functional transcriptional factor/protein from DNA involves series of reactions, leading to a delay in gene regulation. The complexity increases with the dynamic delay induced by other small molecules involved in gene regulation, and noisy cellular environment. The dynamic delay in gene regulation is quite evident in high-temporal live cell lineage-imaging data. Although a number of gene-network-inference methods are proposed, most of them ignore the associated dynamic time delay. RESULTS: Here, we propose DDGni (dynamic delay gene-network inference), a novel gene-network-inference algorithm based on the gapped local alignment of gene-expression profiles. The local alignment can detect short-term gene regulations, that are usually overlooked by traditional correlation and mutual Information based methods. DDGni uses 'gaps' to handle the dynamic delay and non-uniform sampling frequency in high-temporal data, like live cell imaging data. Our algorithm is evaluated on synthetic and yeast cell cycle data, and Caenorhabditis elegans live cell imaging data against other prominent methods. The area under the curve of our method is significantly higher when compared to other methods on all three datasets. AVAILABILITY: The program, datasets and supplementary files are available at http://www.jjwanglab.org/DDGni/.

Subject(s)

Algorithms , Gene Expression Profiling/methods , Gene Regulatory Networks , Animals , Caenorhabditis elegans/genetics , Caenorhabditis elegans/metabolism , Gene Expression Regulation , Transcription Factors/metabolism , Yeasts/genetics , Yeasts/metabolism

10.

IDBA-MT: de novo assembler for metatranscriptomic data generated from next-generation sequencing technology.

Leung, Henry C M; Yiu, Siu-Ming; Parkinson, John; Chin, Francis Y L.

J Comput Biol ; 20(7): 540-50, 2013 Jul.

Article in English | MEDLINE | ID: mdl-23829653

ABSTRACT

High-throughput next-generation sequencing technology provides a great opportunity for analyzing metatranscriptomic data. However, the reads produced by these technologies are short and an assembling step is required to combine the short reads into longer contigs. As there are many repeat patterns in mRNAs from different genomes and the abundance ratio of mRNAs in a sample varies a lot, existing assemblers for genomic data, transcriptomic data, and metagenomic data do not work on metatranscriptomic data and produce chimeric contigs, that is, incorrect contigs formed by merging multiple mRNA sequences. To our best knowledge, there is no assembler designed for metatranscriptomic data. In this article, we introduce an assembler called IDBA-MT, which is designed for assembling reads from metatranscriptomic data. IDBA-MT produces much fewer chimeric contigs (reduce by 50% or more) when compared with existing assemblers such as Oases, IDBA-UD, and Trinity.

Subject(s)

Algorithms , Gene Expression Profiling , High-Throughput Nucleotide Sequencing , Metagenomics , Sequence Analysis, DNA , Software , Animals , Bacteria/genetics , Computer Simulation , Gastrointestinal Tract , Mice , RNA, Messenger/genetics , Real-Time Polymerase Chain Reaction , Repetitive Sequences, Nucleic Acid , Reverse Transcriptase Polymerase Chain Reaction

11.

IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels.

Peng, Yu; Leung, Henry C M; Yiu, Siu-Ming; Lv, Ming-Ju; Zhu, Xin-Guang; Chin, Francis Y L.

Bioinformatics ; 29(13): i326-34, 2013 Jul 01.

Article in English | MEDLINE | ID: mdl-23813001

ABSTRACT

MOTIVATION: RNA sequencing based on next-generation sequencing technology is effective for analyzing transcriptomes. Like de novo genome assembly, de novo transcriptome assembly does not rely on any reference genome or additional annotation information, but is more difficult. In particular, isoforms can have very uneven expression levels (e.g. 1:100), which make it very difficult to identify low-expressed isoforms. One challenge is to remove erroneous vertices/edges with high multiplicity (produced by high-expressed isoforms) in the de Bruijn graph without removing correct ones with not-so-high multiplicity from low-expressed isoforms. Failing to do so will result in the loss of low-expressed isoforms or having complicated subgraphs with transcripts of different genes mixed together due to erroneous vertices/edges. Contributions: Unlike existing tools, which remove erroneous vertices/edges with multiplicities lower than a global threshold, we use a probabilistic progressive approach to iteratively remove them with local thresholds. This enables us to decompose the graph into disconnected components, each containing a few genes, if not a single gene, while retaining many correct vertices/edges of low-expressed isoforms. Combined with existing techniques, IDBA-Tran is able to assemble both high-expressed and low-expressed transcripts and outperform existing assemblers in terms of sensitivity and specificity for both simulated and real data. AVAILABILITY: http://www.cs.hku.hk/~alse/idba_tran. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA/methods , Algorithms , Computer Graphics , Genome , Oryza/genetics , Oryza/metabolism , Sensitivity and Specificity , Software

12.

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample.

Wang, Yi; Leung, Henry C M; Yiu, S M; Chin, Francis Y L.

Bioinformatics ; 28(18): i356-i362, 2012 Sep 15.

Article in English | MEDLINE | ID: mdl-22962452

ABSTRACT

MOTIVATION: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for next-generation sequencing (NGS) reads do not perform well on (i) samples with low-abundance species or (ii) samples (even with high abundance) when there are many extremely low-abundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable. RESULTS: We proposed a two-round binning method (MetaCluster 5.0) that aims at identifying both low-abundance and high-abundance species in the presence of a large amount of noise due to many extremely low-abundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely low-abundance species. It separate reads of high-abundance species from those of low-abundance species in two different rounds. To overcome the issue of low coverage for low-abundance species, multiple w values are used to group reads with overlapping w-mers, whereas reads from high-abundance species are grouped with high confidence based on a large w and then binning expands to low-abundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6× to 10×) and can achieve better sensitivity and specificity using less memory and running time. AVAILABILITY: http://i.cs.hku.hk/~alse/MetaCluster/ CONTACT: chin@cs.hku.hk.

Subject(s)

Metagenomics/methods , Software , Algorithms , Sensitivity and Specificity , Sequence Analysis, DNA/methods

13.

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Peng, Yu; Leung, Henry C M; Yiu, S M; Chin, Francis Y L.

Bioinformatics ; 28(11): 1420-8, 2012 Jun 01.

Article in English | MEDLINE | ID: mdl-22495754

ABSTRACT

MOTIVATION: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. RESULTS: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. AVAILABILITY: The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud

Subject(s)

Algorithms , Metagenomics/methods , Sequence Analysis, DNA/methods , Single-Cell Analysis/methods , Bacteria/genetics , Genome , High-Throughput Nucleotide Sequencing

14.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.

Wang, Yi; Leung, Henry C M; Yiu, S M; Chin, Francis Y L.

J Comput Biol ; 19(2): 241-9, 2012 Feb.

Article in English | MEDLINE | ID: mdl-22300323

ABSTRACT

Next-generation sequencing (NGS) technologies allow the sequencing of microbial communities directly from the environment without prior culturing. The output of environmental DNA sequencing consists of many reads from genomes of different unknown species, making the clustering together reads from the same (or similar) species (also known as binning) a crucial step. The difficulties of the binning problem are due to the following four factors: (1) the lack of reference genomes; (2) uneven abundance ratio of species; (3) short NGS reads; and (4) a large number of species (can be more than a hundred). None of the existing binning tools can handle all four factors. No tools, including both AbundanceBin and MetaCluster 3.0, have demonstrated reasonable performance on a sample with more than 20 species. In this article, we introduce MetaCluster 4.0, an unsupervised binning algorithm that can accurately (with about 80% precision and sensitivity in all cases and at least 90% in some cases) and efficiently bin short reads with varying abundance ratios and is able to handle datasets with 100 species. The novelty of MetaCluster 4.0 stems from solving a few important problems: how to divide reads into groups by a probabilistic approach, how to estimate the 4-mer distribution of each group, how to estimate the number of species, and how to modify MetaCluster 3.0 to handle a large number of species. We show that Meta Cluster 4.0 is effective for both simulated and real datasets. Supplementary Material is available at www.liebertonline.com/cmb.

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA/methods , Software , Bacteria/genetics , Base Sequence , Cluster Analysis , Data Interpretation, Statistical , Genome, Bacterial , Models, Statistical

15.

Meta-IDBA: a de Novo assembler for metagenomic data.

Peng, Yu; Leung, Henry C M; Yiu, S M; Chin, Francis Y L.

Bioinformatics ; 27(13): i94-101, 2011 Jul 01.

Article in English | MEDLINE | ID: mdl-21685107

ABSTRACT

MOTIVATION: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated. RESULTS: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy. AVAILABILITY: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba. CONTACT: chin@cs.hku.hk.

Subject(s)

Algorithms , Metagenomics/methods , Software , Escherichia coli/classification , Escherichia coli/genetics , Escherichia coli/isolation & purification , Genome, Bacterial , Sequence Analysis, DNA/methods

16.

A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.

Leung, Henry C M; Yiu, S M; Yang, Bin; Peng, Yu; Wang, Yi; Liu, Zhihua; Chen, Jingchi; Qin, Junjie; Li, Ruiqiang; Chin, Francis Y L.

Bioinformatics ; 27(11): 1489-95, 2011 Jun 01.

Article in English | MEDLINE | ID: mdl-21493653

ABSTRACT

MOTIVATION: With the rapid development of next-generation sequencing techniques, metagenomics, also known as environmental genomics, has emerged as an exciting research area that enables us to analyze the microbial environment in which we live. An important step for metagenomic data analysis is the identification and taxonomic characterization of DNA fragments (reads or contigs) resulting from sequencing a sample of mixed species. This step is referred to as 'binning'. Binning algorithms that are based on sequence similarity and sequence composition markers rely heavily on the reference genomes of known microorganisms or phylogenetic markers. Due to the limited availability of reference genomes and the bias and low availability of markers, these algorithms may not be applicable in all cases. Unsupervised binning algorithms which can handle fragments from unknown species provide an alternative approach. However, existing unsupervised binning algorithms only work on datasets either with balanced species abundance ratios or rather different abundance ratios, but not both. RESULTS: In this article, we present MetaCluster 3.0, an integrated binning method based on the unsupervised top--down separation and bottom--up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios (say 1:1) to very different abundance ratios (e.g. 1:24) with consistently higher accuracy than existing methods. AVAILABILITY: MetaCluster 3.0 can be downloaded at http://i.cs.hku.hk/~alse/MetaCluster/.

Subject(s)

Algorithms , Metagenomics/methods , Sequence Analysis, DNA , Cluster Analysis

17.

Finding optimal threshold for correction error reads in DNA assembling.

Chin, Francis Y L; Leung, Henry C M; Li, Wei-Lin; Yiu, Siu-Ming.

BMC Bioinformatics ; 10 Suppl 1: S15, 2009 Jan 30.

Article in English | MEDLINE | ID: mdl-19208114

ABSTRACT

BACKGROUND: DNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis, these algorithms cannot guarantee their performances on error correction. RESULTS: In this paper, we propose a method to calculate the probabilities of false positive and false negative when determining whether a length-k substring is correct using threshold M. Based on this optimal threshold M that minimizes the total errors (false positives and false negatives). Experimental results on both real data and simulated data showed that our calculation is correct and we can reduce the total error substrings by 77.6% and 65.1% when compared to ECINDEL and SRCorr respectively. CONCLUSION: We introduced a method to calculate the probability of false positives and false negatives of the length-k substring using different thresholds. Based on this calculation, we found the optimal threshold to minimize the total error of false positive plus false negative.

Subject(s)

Computational Biology/methods , DNA/chemistry , Sequence Analysis, DNA/methods , Algorithms , Base Sequence , Genome , Sequence Alignment

18.

Predicting protein complexes from PPI data: a core-attachment approach.

Leung, Henry C M; Xiang, Qian; Yiu, S M; Chin, Francis Y L.

J Comput Biol ; 16(2): 133-44, 2009 Feb.

Article in English | MEDLINE | ID: mdl-19193141

ABSTRACT

UNLABELLED: Protein complexes play a critical role in many biological processes. Identifying the component proteins in a protein complex is an important step in understanding the complex as well as the related biological activities. This paper addresses the problem of predicting protein complexes from the protein-protein interaction (PPI) network of one species using a computational approach. Most of the previous methods rely on the assumption that proteins within the same complex would have relatively more interactions. This translates into dense subgraphs in the PPI network. However, the existing software tools have limited success. Recently, Gavin et al. (2006) provided a detailed study on the organization of protein complexes and suggested that a complex consists of two parts: a core and an attachment. Based on this core-attachment concept, we developed a novel approach to identify complexes from the PPI network by identifying their cores and attachments separately. We evaluated the effectiveness of our proposed approach using three different datasets and compared the quality of our predicted complexes with three existing tools. The evaluation results show that we can predict many more complexes and with higher accuracy than these tools with an improvement of over 30%. To verify the cores we identified in each complex, we compared our cores with the mediators produced by Andreopoulos et al. (2007), which were claimed to be the cores, based on the benchmark result produced by Gavin et al. (2006). We found that the cores we produced are of much higher quality ranging from 10- to 30-fold more correctly predicted cores and with better accuracy. AVAILABILITY: (http://alse.cs.hku.hk/complexes/).

Subject(s)

Models, Theoretical , Multiprotein Complexes , Protein Interaction Mapping , Software , Markov Chains , Mathematics , Multiprotein Complexes/chemistry , Multiprotein Complexes/metabolism , Proteins/chemistry , Proteins/metabolism

19.

Discovering motifs with transcription factor domain knowledge.

Leung, Henry C M; Chin, Francis Y L; Chan, Bethany M Y.

Pac Symp Biocomput ; : 472-83, 2007.

Article in English | MEDLINE | ID: mdl-17990511

ABSTRACT

We introduce a new motif-discovery algorithm, DIMDom, which exploits two additional kinds of information not commonly exploited: (a) the characteristic pattern of binding site classes, where class is determined based on biological information about transcription factor domains and (b) posterior probabilities of these classes. We compared the performance of DIMDom with MEME on all the transcription factors of Drosophila with at least one known binding site in the TRANSFAC database and found that DOMDom outperformed MEME with 2.5 times the number of successes and 1.5 times in the accuracy in finding binding sties and motifs.

Subject(s)

Algorithms , Transcription Factors/chemistry , Transcription Factors/metabolism , Animals , Bayes Theorem , Binding Sites/genetics , Computational Biology , DNA/genetics , DNA/metabolism , Drosophila/genetics , Drosophila/metabolism , Models, Biological , Promoter Regions, Genetic , Protein Structure, Tertiary

20.

Finding linear motif pairs from protein interaction networks: a probabilistic approach.

Leung, Henry C M; Siu, M H; Yiu, S M; Chin, Francis Y L; Sung, Ken W K.

Comput Syst Bioinformatics Conf ; 6: 111-9, 2007.

Article in English | MEDLINE | ID: mdl-17951817

ABSTRACT

Finding motif pairs from a set of protein sequences based on the protein-protein interaction data is a challenging computational problem. Existing effective approaches usually rely on additional information such as some prior knowledge on protein groupings based on protein domains. In reality, this kind of knowledge is not always available. Novel approaches without using this knowledge is much desirable. Recently, Tan et al. proposed such an approach. However, there are two problems with their approach. The scoring function (using chi(2) testing) used in their approach is not adequate. Random motif pairs may have higher scores than the correct ones. Their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. In this paper, our contribution is two-fold. We first introduce a new scoring method, which is shown to be more accurate than the chi-score used in Tan et al. Then, we present two efficient algorithms, one exact algorithm and a heuristic version of it, to solve the problem of finding motif pairs. Based on experiments on real datasets, we show that our algorithms are efficient and can accurately locate the motif pairs. We have also evaluated the sensitivity and efficiency of our heuristics algorithm using simulated datasets, the results show that the algorithm is very efficient with reasonably high sensitivity.

Subject(s)

Models, Biological , Models, Chemical , Protein Interaction Mapping/methods , Proteins/chemistry , Proteins/metabolism , Sequence Analysis, Protein/methods , Signal Transduction/physiology , Amino Acid Motifs , Amino Acid Sequence , Binding Sites , Computer Simulation , Data Interpretation, Statistical , Models, Statistical , Molecular Sequence Data , Protein Binding

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL