Search | VHL Regional Portal

GradHC: highly reliable gradual hash-based clustering for DNA storage systems.

Ben Shabat, Dvir; Hadad, Adar; Boruchovsky, Avital; Yaakobi, Eitan.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38648049

ABSTRACT

MOTIVATION: As data storage challenges grow and existing technologies approach their limits, synthetic DNA emerges as a promising storage solution due to its remarkable density and durability advantages. While cost remains a concern, emerging sequencing and synthetic technologies aim to mitigate it, yet introduce challenges such as errors in the storage and retrieval process. One crucial task in a DNA storage system is clustering numerous DNA reads into groups that represent the original input strands. RESULTS: In this paper, we review different methods for evaluating clustering algorithms and introduce a novel clustering algorithm for DNA storage systems, named Gradual Hash-based clustering (GradHC). The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, including varying strand lengths, cluster sizes (including extremely small clusters), and different error ranges. Benchmark analysis demonstrates that GradHC is significantly more stable and robust than other clustering algorithms previously proposed for DNA storage, while also producing highly reliable clustering results. AVAILABILITY AND IMPLEMENTATION: https://github.com/bensdvir/GradHC.

Subject(s)

Algorithms , DNA , Sequence Analysis, DNA , DNA/chemistry , Cluster Analysis , Sequence Analysis, DNA/methods , Software , Information Storage and Retrieval/methods

Reconstruction algorithms for DNA-storage systems.

Sabary, Omer; Yucovich, Alexander; Shapira, Guy; Yaakobi, Eitan.

Sci Rep ; 14(1): 1951, 2024 01 23.

Article in English | MEDLINE | ID: mdl-38263421

ABSTRACT

Motivated by DNA storage systems, this work presents the DNA reconstruction problem, in which a length-n string, is passing through the DNA-storage channel, which introduces deletion, insertion and substitution errors. This channel generates multiple noisy copies of the transmitted string which are called traces. A DNA reconstruction algorithm is a mapping which receives t traces as an input and produces an estimation of the original string. The goal in the DNA reconstruction problem is to minimize the edit distance between the original string and the algorithm's estimation. In this work, we present several new algorithms for this problem. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original string. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data, on data from previous DNA storage experiments, and on a new synthesized dataset, and are shown to outperform previous algorithms in reconstruction accuracy.

Subject(s)

Algorithms , DNA , Motivation , Probability , Records

Design of optimal labeling patterns for optical genome mapping via information theory.

Nogin, Yevgeni; Bar-Lev, Daniella; Hanania, Dganit; Detinis Zur, Tahir; Ebenstein, Yuval; Yaakobi, Eitan; Weinberger, Nir; Shechtman, Yoav.

Bioinformatics ; 39(10)2023 10 03.

Article in English | MEDLINE | ID: mdl-37758248

ABSTRACT

MOTIVATION: Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application. RESULTS: In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples. AVAILABILITY AND IMPLEMENTATION: https://github.com/yevgenin/PatternCode.

Subject(s)

Information Theory , Software , Humans , Genome , Restriction Mapping , DNA

SOLQC: Synthetic Oligo Library Quality Control tool.

Sabary, Omer; Orlev, Yoav; Shafir, Roy; Anavy, Leon; Yaakobi, Eitan; Yakhini, Zohar.

Bioinformatics ; 37(5): 720-722, 2021 05 05.

Article in English | MEDLINE | ID: mdl-32840559

ABSTRACT

MOTIVATION: Recent years have seen a growing number and an expanding scope of studies using synthetic oligo libraries for a range of applications in synthetic biology. As experiments are growing by numbers and complexity, analysis tools can facilitate quality control and support better assessment and inference. RESULTS: We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on NGS analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates and their dependence on sequence or library properties. SOLQC produces graphical reports from the analysis, in a flexible format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis. AVAILABILITY AND IMPLEMENTATION: SOLQC is a free software for non-commercial use, available at https://app.gitbook.com/@yoav-orlev/s/solqc/. For commercial use please contact the authors. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Libraries , Software , Gene Library , Quality Control , Synthetic Biology

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL