Search | VHL Regional Portal

Clustering-Based Compression for Population DNA Sequences.

Cheng, Kin-On; Law, Ngai-Fong; Siu, Wan-Chi.

IEEE/ACM Trans Comput Biol Bioinform ; 16(1): 208-221, 2019.

Article in English | MEDLINE | ID: mdl-29028207

ABSTRACT

Due to the advancement of DNA sequencing techniques, the number of sequenced individual genomes has experienced an exponential growth. Thus, effective compression of this kind of sequences is highly desired. In this work, we present a novel compression algorithm called Reference-based Compression algorithm using the concept of Clustering (RCC). The rationale behind RCC is based on the observation about the existence of substructures within the population sequences. To utilize these substructures, k-means clustering is employed to partition sequences into clusters for better compression. A reference sequence is then constructed for each cluster so that sequences in that cluster can be compressed by referring to this reference sequence. The reference sequence of each cluster is also compressed with reference to a sequence which is derived from all the reference sequences. Experiments show that RCC can further reduce the compressed size by up to 91.0 percent when compared with state-of-the-art compression approaches. There is a compromise between compressed size and processing time. The current implementation in Matlab has time complexity in a factor of thousands higher than the existing algorithms implemented in C/C++. Further investigation is required to improve processing time in future.

Subject(s)

DNA/genetics , Data Compression/methods , Databases, Genetic , Genomics/methods , Cluster Analysis , Humans , Sequence Analysis, DNA

Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities.

Cheng, Kin-On; Wu, Paula; Law, Ngai-Fong; Siu, Wan-Chi.

IEEE/ACM Trans Comput Biol Bioinform ; 12(6): 1322-32, 2015.

Article in English | MEDLINE | ID: mdl-26671804

ABSTRACT

Traditionally, intra-sequence similarity is exploited for compressing a single DNA sequence. Recently, remarkable compression performance of individual DNA sequence from the same population is achieved by encoding its difference with a nearly identical reference sequence. Nevertheless, there is lack of general algorithms that also allow less similar reference sequences. In this work, we extend the intra-sequence to the inter-sequence similarity in that approximate matches of subsequences are found between the DNA sequence and a set of reference sequences. Hence, a set of nearly identical DNA sequences from the same population or a set of partially similar DNA sequences like chromosome sequences and DNA sequences of related species can be compressed together. For practical compressors, the compressed size is usually influenced by the compression order of sequences. Fast search algorithms for the optimal compression order are thus developed for multiple sequences compression. Experimental results on artificial and real datasets demonstrate that our proposed multiple sequences compression methods with fast compression order search are able to achieve good compression performance under different levels of similarity in the multiple DNA sequences.

Subject(s)

Algorithms , Data Compression/methods , Databases, Genetic , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , DNA/genetics , Molecular Sequence Data

A joint framework for missing values estimation and biclusters detection in gene expression data.

Cheng, Kin-On; Law, Ngai-Fong; Chan, Yui-Lam; Siu, Wan-Chi.

Int J Bioinform Res Appl ; 10(6): 574-86, 2014.

Article in English | MEDLINE | ID: mdl-25335564

ABSTRACT

DNA microarray experiment unavoidably generates gene expression data with missing values. This hardens subsequent analysis such as biclusters detection which aims to find a set of co-expressed genes under some experimental conditions. Missing values are thus required to be estimated before biclusters detection. Existing missing values estimation algorithms rely on finding coherence among expression values throughout the data. In view that both missing values estimation and biclusters detection aim at exploiting coherence inside the expression data, we propose to integrate these two steps into a joint framework. The benefits are twofold; the missing values estimation can improve biclusters analysis and the coherence in detected biclusters can be exploited for accurate missing values estimation. Experimental results show that the bicluster information can significantly improve the accuracy in missing values estimation. Also, the joint framework enables the detection of biologically meaningful biclusters.

Subject(s)

Algorithms , Data Interpretation, Statistical , Gene Expression Profiling/methods , Models, Statistical , Oligonucleotide Array Sequence Analysis/methods , Proteome/metabolism , Computer Simulation , Sample Size

Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization.

Cheng, Kin-On; Law, Ngai-Fong; Siu, Wan-Chi; Liew, Alan Wee-Chung.

BMC Bioinformatics ; 9: 210, 2008 Apr 23.

Article in English | MEDLINE | ID: mdl-18433478

ABSTRACT

BACKGROUND: The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only 1. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis. RESULTS: We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm. CONCLUSION: We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.

Subject(s)

Algorithms , Cluster Analysis , Computer Graphics , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , User-Computer Interface , Artificial Intelligence , Programming Languages

A novel fast and reduced redundancy structure for multiscale directional filter banks.

Cheng, Kin-On; Law, Ngai-Fong; Siu, Wan-Chi.

IEEE Trans Image Process ; 16(8): 2058-68, 2007 Aug.

Article in English | MEDLINE | ID: mdl-17688211

ABSTRACT

The multiscale directional filter bank (MDFB) improves the radial frequency resolution of the contourlet transform by introducing an additional decomposition in the high-frequency band. The increase in frequency resolution is particularly useful for texture description because of the quasi-periodic property of textures. However, the MDFB needs an extra set of scale and directional decomposition, which is performed on the full image size. The rise in computational complexity is, thus, prominent. In this paper, we develop an efficient implementation framework for the MDFB. In the new framework, directional decomposition on the first two scales is performed prior to the scale decomposition. This allows sharing of directional decomposition among the two scales and, hence, reduces the computational complexity significantly. Based on this framework, two fast implementations of the MDFB are proposed. The first one can maintain the same flexibility in directional selectivity in the first two scales while the other has the same redundancy ratio as the contourlet transform. Experimental results show that the first and the second schemes can reduce the computational time by 33.3%-34.6% and 37.1%-37.5%, respectively, compared to the original MDFB algorithm. Meanwhile, the texture retrieval performance of the proposed algorithms is more or less the same as the original MDFB approach which outperforms the steerable pyramid and the contourlet transform approaches.

Subject(s)

Algorithms , Artificial Intelligence , Data Compression/methods , Image Enhancement/methods , Image Interpretation, Computer-Assisted/methods , Pattern Recognition, Automated/methods , Computer Systems , Reproducibility of Results , Sensitivity and Specificity

On relationship of Z-curve and Fourier approaches for DNA coding sequence classification.

Law, Ngai-Fong; Cheng, Kin-On; Siu, Wan-Chi.

Bioinformation ; 1(7): 242-6, 2006 Nov 14.

Article in English | MEDLINE | ID: mdl-17597898

ABSTRACT

Z-curve features are one of the popular features used in exon/intron classification. We showed that although both Z-curve and Fourier approaches are based on detecting 3-periodicity in coding regions, there are significant differences in their spectral formulation. From the spectral formulation of the Z-curve, we obtained three modified sequences that characterize different biological properties. Spectral analysis on the modified sequences showed a much more prominent 3-periodicity peak in coding regions than the Fourier approach. For long sequences, prominent peaks at 2Pi/3 are observed at coding regions, whereas for short sequences, clearly discernible peaks are still visible. Better classification can be obtained using spectral features derived from the modified sequences.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL