Search | VHL Regional Portal

1.

Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads.

Park, Seong-Joon; Kim, Sunghwan; Jeong, Jaeho; No, Albert; No, Jong-Seon; Park, Hosung.

Bioinformatics ; 39(9)2023 09 02.

Article in English | MEDLINE | ID: mdl-37669160

ABSTRACT

MOTIVATION: DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed. RESULTS: We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works. AVAILABILITY AND IMPLEMENTATION: Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.

Subject(s)

Algorithms , Reading , Female , Pregnancy , Humans , Cluster Analysis , DNA

2.

FCLQC: fast and concurrent lossless quality scores compressor.

Cho, Minhyeok; No, Albert.

BMC Bioinformatics ; 22(1): 606, 2021 Dec 20.

Article in English | MEDLINE | ID: mdl-34930110

ABSTRACT

BACKGROUND: Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. RESULTS: This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. CONCLUSION: The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.

3.

Cooperative sequence clustering and decoding for DNA storage system with fountain codes.

Jeong, Jaeho; Park, Seong-Joon; Kim, Jae-Won; No, Jong-Seon; Jeon, Ha Hyeon; Lee, Jeong Wook; No, Albert; Kim, Sunghwan; Park, Hosung.

Bioinformatics ; 37(19): 3136-3143, 2021 Oct 11.

Article in English | MEDLINE | ID: mdl-33904574

ABSTRACT

MOTIVATION: In DNA storage systems, there are tradeoffs between writing and reading costs. Increasing the code rate of error-correcting codes may save writing cost, but it will need more sequence reads for data retrieval. There is potentially a way to improve sequencing and decoding processes in such a way that the reading cost induced by this tradeoff is reduced without increasing the writing cost. In past researches, clustering, alignment and decoding processes were considered as separate stages but we believe that using the information from all these processes together may improve decoding performance. Actual experiments of DNA synthesis and sequencing should be performed because simulations cannot be relied on to cover all error possibilities in practical circumstances. RESULTS: For DNA storage systems using fountain code and Reed-Solomon (RS) code, we introduce several techniques to improve the decoding performance. We designed the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, RS error correction as well as detection and quality score-based ordering of sequences. We synthesized 513.6 KB data into DNA oligo pools and sequenced this data successfully with Illumina MiSeq instrument. Compared to Erlich's research, the proposed decoding method additionally incorporates sequence reads with minor errors which had been discarded before, and thus was able to make use of 10.6-11.9% more sequence reads from the same sequencing environment, this resulted in 6.5-8.9% reduction in the reading cost. Channel characteristics including sequence coverage and read-length distributions are provided as well. AVAILABILITY AND IMPLEMENTATION: The raw data files and the source codes of our experiments are available at: https://github.com/jhjeong0702/dna-storage.

4.

CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores.

No, Albert; Hernaez, Mikel; Ochoa, Idoia.

J Bioinform Comput Biol ; 18(6): 2050031, 2020 12.

Article in English | MEDLINE | ID: mdl-32938284

ABSTRACT

The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.

Subject(s)

Algorithms , Data Compression/methods , High-Throughput Nucleotide Sequencing/methods , Chromosomes, Human, Pair 20/genetics , Computational Biology , Computer Simulation , Data Compression/standards , Data Compression/statistics & numerical data , Databases, Genetic/statistics & numerical data , Fourier Analysis , High-Throughput Nucleotide Sequencing/standards , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Software

5.

Universality of Logarithmic Loss in Successive Refinement.

No, Albert.

Entropy (Basel) ; 21(2)2019 Feb 08.

Article in English | MEDLINE | ID: mdl-33266874

ABSTRACT

We establish an universal property of logarithmic loss in the successive refinement problem. If the first decoder operates under logarithmic loss, we show that any discrete memoryless source is successively refinable under an arbitrary distortion criterion for the second decoder. Based on this result, we propose a low-complexity lossy compression algorithm for any discrete memoryless source.

6.

Universality of Logarithmic Loss in Fixed-Length Lossy Compression.

No, Albert.

Entropy (Basel) ; 21(6)2019 Jun 10.

Article in English | MEDLINE | ID: mdl-33267294

ABSTRACT

We established a universality of logarithmic loss over a finite alphabet as a distortion criterion in fixed-length lossy compression. For any fixed-length lossy-compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy-compression problem under logarithmic loss. The equivalence is in the strong sense that we show that finding good schemes in corresponding lossy compression under logarithmic loss is essentially equivalent to finding good schemes in the original problem. This equivalence relation also provides an algebraic structure in the reconstruction alphabet, which allows us to use known techniques in the clustering literature. Furthermore, our result naturally suggests a new clustering algorithm in the categorical data-clustering problem.

7.

Information Geometric Approach on Most Informative Boolean Function Conjecture.

No, Albert.

Entropy (Basel) ; 20(9)2018 Sep 10.

Article in English | MEDLINE | ID: mdl-33265777

ABSTRACT

Let X n be a memoryless uniform Bernoulli source and Y n be the output of it through a binary symmetric channel. Courtade and Kumar conjectured that the Boolean function f : { 0 , 1 } n â { 0 , 1 } that maximizes the mutual information I ( f ( X n ) ; Y n ) is a dictator function, i.e., f ( x n ) = x i for some i. We propose a clustering problem, which is equivalent to the above problem where we emphasize an information geometry aspect of the equivalent problem. Moreover, we define a normalized geometric mean of measures and interesting properties of it. We also show that the conjecture is true when the arithmetic and geometric mean coincide in a specific set of measures.

8.

Rateless Lossy Compression via the Extremes.

No, Albert; Weissman, Tsachy.

IEEE Trans Inf Theory ; 62(10): 5484-5495, 2016 Oct.

Article in English | MEDLINE | ID: mdl-29375154

ABSTRACT

We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log ß n) per source symbol for ß > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL