Search | VHL Regional Portal

mzMD: visualization-oriented MS data storage and retrieval.

Yang, Runmin; Ma, Jingjing; Zhang, Shu; Zheng, Yu; Wang, Lusheng; Zhu, Daming.

Bioinformatics ; 38(8): 2333-2340, 2022 04 12.

Article in English | MEDLINE | ID: mdl-35171986

ABSTRACT

MOTIVATION: Drawing peaks in a data window of an MS dataset happens at all time in MS data visualization applications. This asks to retrieve from an MS dataset some selected peaks in a data window whose image in a display window reflects the visual feature of all peaks in the data window. If an algorithm for this purpose is asked to output high-quality solutions in real time, then the most fundamental dependence of it is on the storage format of the MS dataset. RESULTS: We present mzMD, a new storage format of MS datasets and an algorithm to query this format of a storage system for a summary (a set of selected representative peaks) of a given data window. We propose a criterion Q-score to examine the quality of data window summaries. Experimental statistics on real MS datasets verified the high speed of mzMD in retrieving high-quality data window summaries. mzMD reported summaries of data windows whose Q-score outperforms those mzTree reported. The query speed of mzMD is the same as that of mzTree whereas its query speed stability is better than that of mzTree. AVAILABILITY AND IMPLEMENTATION: The source code is freely available at https://github.com/yrm9837/mzMD-java. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Software , Information Storage and Retrieval , Data Visualization , Data Accuracy

Algorithms and Hardness for Scaffold Filling to Maximize Increased Duo-Preservations.

Ma, Jingjing; Jiang, Haitao; Zhu, Daming; Yang, Runmin.

IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2071-2079, 2022.

Article in English | MEDLINE | ID: mdl-34038366

ABSTRACT

Scaffold filling is a critical step in DNA assembly. Its basic task is to fill the missing genes (fragments) into an incomplete genome (scaffold) to make it similar to the reference genome. There have been a lot of work under distinct measurements in the literature of genome comparison. For genomes with gene duplications, common string partition reveals the similarity more precisely, since it constructs a one-to-one correspondence between the same segments in the two genomes. In this paper, we adopt duo-preservation as the measurement, which is the complement of common string partition, i.e., the number of duo-preservations added to the number of common strings is exactly the length of a genome. Towards a proper scaffold filling, we just focus on the increased duo-preservations. This problem is called scaffold filling to maximize increased duo-preservations (abbr. SF-MIDP). We show that SF-MIDP is solvable in linear time for a simple version where all the genes of the scaffold are matched in a block-matching, but MAX SNP-complete for the general version, and cannot be approximated within [Formula: see text]. Moreover, we present a basic approximation algorithm of factor 2, by which the optimal solution can be described in a new way, and then, improve the approximation factor to [Formula: see text] via a greedy method.

Subject(s)

Algorithms , Genome , DNA , Genome/genetics , Hardness

A graph-based filtering method for top-down mass spectral identification.

Yang, Runmin; Zhu, Daming.

BMC Genomics ; 19(Suppl 7): 666, 2018 Sep 24.

Article in English | MEDLINE | ID: mdl-30255788

ABSTRACT

BACKGROUND: Database search has been the main approach for proteoform identification by top-down tandem mass spectrometry. However, when the target proteoform that produced the spectrum contains post-translational modifications (PTMs) and/or mutations, it is quite time consuming to align a query spectrum against all protein sequences without any PTMs and mutations in a large database. Consequently, it is essential to develop efficient and sensitive filtering algorithms for speeding up database search. RESULTS: In this paper, we propose a spectrum graph matching (SGM) based protein sequence filtering method for top-down mass spectral identification. It uses the subspectra of a query spectrum to generate spectrum graphs and searches them against a protein database to report the best candidates. As the sequence tag and gaped tag approaches need the preprocessing step to extract and select tags, the SGM filtering method circumvents this preprocessing step, thus simplifying data processing. We evaluated the filtration efficiency of the SGM filtering method with various parameter settings on an Escherichia coli top-down mass spectrometry data set and compared the performances of the SGM filtering method and two tag-based filtering methods on a data set of MCF-7 cells. CONCLUSIONS: Experimental results on the data sets show that the SGM filtering method achieves high sensitivity in protein sequence filtration. When coupled with a spectral alignment algorithm, the SGM filtering method significantly increases the number of identified proteoform spectrum-matches compared with the tag-based methods in top-down mass spectrometry data analysis.

Subject(s)

Algorithms , Computer Graphics , Escherichia coli Proteins/analysis , Escherichia coli/metabolism , Proteome/analysis , Tandem Mass Spectrometry/methods , Databases, Protein , Protein Processing, Post-Translational , Sequence Analysis, Protein/methods

A Spectrum Graph-Based Protein Sequence Filtering Algorithm for Proteoform Identification by Top-Down Mass Spectrometry.

Yang, Runmin; Zhu, Daming; Kou, Qiang; Bhat-Nakshatri, Poomima; Nakshatri, Harikrishna; Wu, Si; Liu, Xiaowen.

Proceedings (IEEE Int Conf Bioinformatics Biomed) ; 2017: 222-229, 2017 Nov.

Article in English | MEDLINE | ID: mdl-29503761

ABSTRACT

Database search is the main approach for identifying proteoforms using top-down tandem mass spectra. However, it is extremely slow to align a query spectrum against all protein sequences in a large database when the target proteoform that produced the spectrum contains post-translational modifications and/or mutations. As a result, efficient and sensitive protein sequence filtering algorithms are essential for speeding up database search. In this paper, we propose a novel filtering algorithm, which generates spectrum graphs from subspectra of the query spectrum and searches them against the protein database to find good candidates. Compared with the sequence tag and gaped tag approaches, the proposed method circumvents the step of tag extraction, thus simplifying data processing. Experimental results on real data showed that the proposed method achieved both high speed and high sensitivity in protein sequence filtration.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL