Search | VHL Regional Portal

NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning.

Wang, Rongshu; Chen, Jianhua.

BMC Genomics ; 25(1): 573, 2024 Jun 07.

Article in English | MEDLINE | ID: mdl-38849740

ABSTRACT

BACKGROUNDS: The single-pass long reads generated by third-generation sequencing technology exhibit a higher error rate. However, the circular consensus sequencing (CCS) produces shorter reads. Thus, it is effective to manage the error rate of long reads algorithmically with the help of the homologous high-precision and low-cost short reads from the Next Generation Sequencing (NGS) technology. METHODS: In this work, a hybrid error correction method (NmTHC) based on a generative neural machine translation model is proposed to automatically capture discrepancies within the aligned regions of long reads and short reads, as well as the contextual relationships within the long reads themselves for error correction. Akin to natural language sequences, the long read can be regarded as a special "genetic language" and be processed with the idea of generative neural networks. The algorithm builds a sequence-to-sequence(seq2seq) framework with Recurrent Neural Network (RNN) as the core layer. The before and post-corrected long reads are regarded as the sentences in the source and target language of translation, and the alignment information of long reads with short reads is used to create the special corpus for training. The well-trained model can be used to predict the corrected long read. RESULTS: NmTHC outperforms the latest mainstream hybrid error correction methods on real-world datasets from two mainstream platforms, including PacBio and Nanopore. Our experimental evaluation results demonstrate that NmTHC can align more bases with the reference genome without any segmenting in the six benchmark datasets, proving that it enhances alignment identity without sacrificing any length advantages of long reads. CONCLUSION: Consequently, NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a novel perspective for solving long-read error correction problems with the ideas of Natural Language Processing (NLP). More remarkably, the proposed methodology is sequencing-technology-independent and can produce more precise reads.

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing , Neural Networks, Computer , High-Throughput Nucleotide Sequencing/methods , Humans , Machine Learning

Reference-based genome compression using the longest matched substrings with parallelization consideration.

Lu, Zhiwen; Guo, Lu; Chen, Jianhua; Wang, Rongshu.

BMC Bioinformatics ; 24(1): 369, 2023 Sep 30.

Article in English | MEDLINE | ID: mdl-37777730

ABSTRACT

BACKGROUND: A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms. RESULTS: We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs. CONCLUSIONS: Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.

Subject(s)

Data Compression , Data Compression/methods , Sequence Analysis, DNA/methods , Algorithms , Genome , Software , High-Throughput Nucleotide Sequencing/methods

CMIC: an efficient quality score compressor with random access functionality.

Chen, Hansen; Chen, Jianhua; Lu, Zhiwen; Wang, Rongshu.

BMC Bioinformatics ; 23(1): 294, 2022 Jul 23.

Article in English | MEDLINE | ID: mdl-35870880

ABSTRACT

BACKGROUND: Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access. RESULTS: In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores. CONCLUSIONS: CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic .

Subject(s)

Data Compression , High-Throughput Nucleotide Sequencing , Algorithms , Data Compression/methods , High-Throughput Nucleotide Sequencing/methods , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL