Search | VHL Regional Portal

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices.

Li, Dinghua; Luo, Ruibang; Liu, Chi-Man; Leung, Chi-Ming; Ting, Hing-Fung; Sadakane, Kunihiko; Yamashita, Hiroshi; Lam, Tak-Wah.

Methods ; 102: 3-11, 2016 06 01.

Article in English | MEDLINE | ID: mdl-27012178

ABSTRACT

The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU). In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484Gbp), the largest publicly available dataset, can now be assembled using no more than 500GB of memory in 7.5days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.

Subject(s)

Metagenome , Sequence Analysis/methods , Software , Algorithms , Datasets as Topic , Metagenomics/methods , Soil

An O(m log m)-Time Algorithm for Detecting Superbubbles.

Sung, Wing-Kin; Sadakane, Kunihiko; Shibuya, Tetsuo; Belorkar, Abha; Pyrogova, Iana.

IEEE/ACM Trans Comput Biol Bioinform ; 12(4): 770-7, 2015.

Article in English | MEDLINE | ID: mdl-26357315

ABSTRACT

In genome assembly graphs, motifs such as tips, bubbles, and cross links are studied in order to find sequencing errors and to understand the nature of the genome. Superbubble, a complex generalization of bubbles, was recently proposed as an important subgraph class for analyzing assembly graphs. At present, a quadratic time algorithm is known. This paper gives an O(m log m)-time algorithm to solve this problem for a graph with m edges.

Subject(s)

Algorithms , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Humans

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Li, Dinghua; Liu, Chi-Man; Luo, Ruibang; Sadakane, Kunihiko; Lam, Tak-Wah.

Bioinformatics ; 31(10): 1674-6, 2015 May 15.

Article in English | MEDLINE | ID: mdl-25609793

ABSTRACT

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.

Subject(s)

Metagenomics/methods , Algorithms , High-Throughput Nucleotide Sequencing , Software , Soil

Linear-time protein 3-D structure searching with insertions and deletions.

Shibuya, Tetsuo; Jansson, Jesper; Sadakane, Kunihiko.

Algorithms Mol Biol ; 5: 7, 2010 Jan 04.

Article in English | MEDLINE | ID: mdl-20047663

ABSTRACT

BACKGROUND: Two biomolecular 3-D structures are said to be similar if the RMSD (root mean square deviation) between the two molecules' sequences of 3-D coordinates is less than or equal to some given constant bound. Tools for searching for similar structures in biomolecular 3-D structure databases are becoming increasingly important in the structural biology of the post-genomic era. RESULTS: We consider an important, fundamental problem of reporting all substructures in a 3-D structure database of chain molecules (such as proteins) which are similar to a given query 3-D structure, with consideration of indels (i.e., insertions and deletions). This problem has been believed to be very difficult but its exact computational complexity has not been known. In this paper, we first prove that the problem in unbounded dimensions is NP-hard. We then propose a new algorithm that dramatically improves the average-case time complexity of the problem in 3-D in case the number of indels k is bounded by a constant. Our algorithm solves the above problem for a query of size m and a database of size N in average-case O(N) time, whereas the time complexity of the previously best algorithm was O(Nm(k+1)). CONCLUSIONS: Our results show that although the problem of searching for similar structures in a database based on the RMSD measure with indels is NP-hard in the case of unbounded dimensions, it can be solved in 3-D by a simple average-case linear time algorithm when the number of indels is bounded by a constant.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL