Search | VHL Regional Portal

1.

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.

Tian, Qinzhong; Zhang, Pinglu; Zhai, Yixiao; Wang, Yansu; Zou, Quan.

Genome Biol Evol ; 16(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38748485

ABSTRACT

The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.

Subject(s)

High-Throughput Nucleotide Sequencing , Machine Learning , Databases, Genetic , Computational Biology/methods , Classification/methods

2.

TPMA: A two pointers meta-alignment tool to ensemble different multiple nucleic acid sequence alignments.

Zhai, Yixiao; Chao, Jiannan; Wang, Yizheng; Zhang, Pinglu; Tang, Furong; Zou, Quan.

PLoS Comput Biol ; 20(4): e1011988, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38557416

ABSTRACT

Accurate multiple sequence alignment (MSA) is imperative for the comprehensive analysis of biological sequences. However, a notable challenge arises as no single MSA tool consistently outperforms its counterparts across diverse datasets. Users often have to try multiple MSA tools to achieve optimal alignment results, which can be time-consuming and memory-intensive. While the overall accuracy of certain MSA results may be lower, there could be local regions with the highest alignment scores, prompting researchers to seek a tool capable of merging these locally optimal results from multiple initial alignments into a globally optimal alignment. In this study, we introduce Two Pointers Meta-Alignment (TPMA), a novel tool designed for the integration of nucleic acid sequence alignments. TPMA employs two pointers to partition the initial alignments into blocks containing identical sequence fragments. It selects blocks with the high sum of pairs (SP) scores to concatenate them into an alignment with an overall SP score superior to that of the initial alignments. Through tests on simulated and real datasets, the experimental results consistently demonstrate that TPMA outperforms M-Coffee in terms of aSP, Q, and total column (TC) scores across most datasets. Even in cases where TPMA's scores are comparable to M-Coffee, TPMA exhibits significantly lower running time and memory consumption. Furthermore, we comprehensively assessed all the MSA tools used in the experiments, considering accuracy, time, and memory consumption. We propose accurate and fast combination strategies for small and large datasets, which streamline the user tool selection process and facilitate large-scale dataset integration. The dataset and source code of TPMA are available on GitHub (https://github.com/malabz/TPMA).

Subject(s)

Algorithms , Nucleic Acids , Sequence Alignment , Coffee , Software

3.

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.

Zhang, Pinglu; Liu, Huan; Wei, Yanming; Zhai, Yixiao; Tian, Qinzhong; Zou, Quan.

Bioinformatics ; 40(1)2024 01 02.

Article in English | MEDLINE | ID: mdl-38200554

ABSTRACT

MOTIVATION: In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. RESULTS: FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.

Subject(s)

Algorithms , Software , Sequence Alignment , Base Sequence , Computational Biology

4.

PlantCADB: A Comprehensive Plant Chromatin Accessibility Database.

Ding, Ke; Sun, Shanwen; Luo, Yang; Long, Chaoyue; Zhai, Jingwen; Zhai, Yixiao; Wang, Guohua.

Genomics Proteomics Bioinformatics ; 21(2): 311-323, 2023 04.

Article in English | MEDLINE | ID: mdl-36328151

ABSTRACT

Chromatin accessibility landscapes are essential for detecting regulatory elements, illustrating the corresponding regulatory networks, and, ultimately, understanding the molecular basis underlying key biological processes. With the advancement of sequencing technologies, a large volume of chromatin accessibility data has been accumulated and integrated for humans and other mammals. These data have greatly advanced the study of disease pathogenesis, cancer survival prognosis, and tissue development. To advance the understanding of molecular mechanisms regulating plant key traits and biological processes, we developed a comprehensive plant chromatin accessibility database (PlantCADB) from 649 samples of 37 species. These samples are abiotic stress-related (such as heat, cold, drought, and salt; 159 samples), development-related (232 samples), and/or tissue-specific (376 samples). Overall, 18,339,426 accessible chromatin regions (ACRs) were compiled. These ACRs were annotated with genomic information, associated genes, transcription factor footprint, motif, and single-nucleotide polymorphisms (SNPs). Additionally, PlantCADB provides various tools to visualize ACRs and corresponding annotations. It thus forms an integrated, annotated, and analyzed plant-related chromatin accessibility resource, which can aid in better understanding genetic regulatory networks underlying development, important traits, stress adaptations, and evolution.PlantCADB is freely available at https://bioinfor.nefu.edu.cn/PlantCADB/.

Subject(s)

Chromatin , Genomics , Animals , Humans , Chromatin/genetics , Gene Regulatory Networks , Databases, Factual , Mammals/genetics

5.

HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.

Tang, Furong; Chao, Jiannan; Wei, Yanming; Yang, Fenglong; Zhai, Yixiao; Xu, Lei; Zou, Quan.

Mol Biol Evol ; 39(8)2022 08 03.

Article in English | MEDLINE | ID: mdl-35915051

ABSTRACT

HAlign is a cross-platform program that performs multiple sequence alignments based on the center star strategy. Here we present two major updates of HAlign 3, which helped improve the time efficiency and the alignment quality, and made HAlign 3 a specialized program to process ultra-large numbers of similar DNA/RNA sequences, such as closely related viral or prokaryotic genomes. HAlign 3 can be easily installed via the Anaconda and Java release package on macOS, Linux, Windows subsystem for Linux, and Windows systems, and the source code is available on GitHub (https://github.com/malabz/HAlign-3).

Subject(s)

Algorithms , Software , Base Sequence , DNA/genetics , Sequence Alignment

6.

Effects of DNA Methylation on TFs in Human Embryonic Stem Cells.

Luo, Ximei; Zhang, Tianjiao; Zhai, Yixiao; Wang, Fang; Zhang, Shumei; Wang, Guohua.

Front Genet ; 12: 639461, 2021.

Article in English | MEDLINE | ID: mdl-33708244

ABSTRACT

DNA methylation is an important epigenetic mechanism for gene regulation. The conventional view of DNA methylation is that DNA methylation could disrupt protein-DNA interactions and repress gene expression. Several recent studies reported that DNA methylation could alter transcription factors (TFs) binding sequence specificity in vitro. Here, we took advantage of the large sets of ChIP-seq data for TFs and whole-genome bisulfite sequencing data in many cell types to perform a systematic analysis of the protein-DNA methylation in vivo. We observed that many TFs could bind methylated DNA regions, especially in H1-hESC cells. By locating binding sites, we confirmed that some TFs could bind to methylated CpGs directly. The different proportion of CpGs at TF binding specificity motifs in different methylation statuses shows that some TFs are sensitive to methylation and some could bind to the methylated DNA with different motifs, such as CEBPB and CTCF. At the same time, TF binding could interactively alter local DNA methylation. The TF hypermethylation binding sites extensively overlap with enhancers. And we also found that some DNase I hypersensitive sites were specifically hypermethylated in H1-hESC cells. At last, compared with TFs' binding regions in multiple cell types, we observed that CTCF binding to high methylated regions in H1-hESC were not conservative. These pieces of evidence indicate that TFs that bind to hypermethylation DNA in H1-hESC cells may associate with enhancers to regulate special biological functions.

7.

AOPM: Application of Antioxidant Protein Classification Model in Predicting the Composition of Antioxidant Drugs.

Zhai, Yixiao; Zhang, Jingyu; Zhang, Tianjiao; Gong, Yue; Zhang, Zixiao; Zhang, Dandan; Zhao, Yuming.

Front Pharmacol ; 12: 818115, 2021.

Article in English | MEDLINE | ID: mdl-35115948

ABSTRACT

Antioxidant proteins can not only balance the oxidative stress in the body, but are also an important component of antioxidant drugs. Accurate identification of antioxidant proteins is essential to help humans fight diseases and develop new drugs. In this paper, we developed a friendly method AOPM to identify antioxidant proteins. 188D and the Composition of k-spaced Amino Acid Pairs were adopted as the feature extraction method. In addition, the Max-Relevance-Max-Distance algorithm (MRMD) and random forest were the feature selection and classifier, respectively. We used 5-folds cross-validation and independent test dataset to evaluate our model. On the test dataset, AOPM presented a higher performance compared with the state-of-the-art methods. The sensitivity, specificity, accuracy, Matthew's Correlation Coefficient and an Area Under the Curve reached 87.3, 94.2, 92.0%, 0.815 and 0.972, respectively. In addition, AOPM still has excellent performance in predicting the catalytic enzymes of antioxidant drugs. This work proved the feasibility of virtual drug screening based on sequence information and provided new ideas and solutions for drug development.

8.

VTP-Identifier: Vesicular Transport Proteins Identification Based on PSSM Profiles and XGBoost.

Gong, Yue; Dong, Benzhi; Zhang, Zixiao; Zhai, Yixiao; Gao, Bo; Zhang, Tianjiao; Zhang, Jingyu.

Front Genet ; 12: 808856, 2021.

Article in English | MEDLINE | ID: mdl-35047020

ABSTRACT

Vesicular transport proteins are related to many human diseases, and they threaten human health when they undergo pathological changes. Protein function prediction has been one of the most in-depth topics in bioinformatics. In this work, we developed a useful tool to identify vesicular transport proteins. Our strategy is to extract transition probability composition, autocovariance transformation and other information from the position-specific scoring matrix as feature vectors. EditedNearesNeighbours (ENN) is used to address the imbalance of the data set, and the Max-Relevance-Max-Distance (MRMD) algorithm is adopted to reduce the dimension of the feature vector. We used 5-fold cross-validation and independent test sets to evaluate our model. On the test set, VTP-Identifier presented a higher performance compared with GRU. The accuracy, Matthew's correlation coefficient (MCC) and area under the ROC curve (AUC) were 83.6%, 0.531 and 0.873, respectively.

9.

Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm.

Zhao, Ziye; Yang, Wen; Zhai, Yixiao; Liang, Yingjian; Zhao, Yuming.

Front Genet ; 12: 821996, 2021.

Article in English | MEDLINE | ID: mdl-35154264

ABSTRACT

The exploration of DNA-binding proteins (DBPs) is an important aspect of studying biological life activities. Research on life activities requires the support of scientific research results on DBPs. The decline in many life activities is closely related to DBPs. Generally, the detection method for identifying DBPs is achieved through biochemical experiments. This method is inefficient and requires considerable manpower, material resources and time. At present, several computational approaches have been developed to detect DBPs, among which machine learning (ML) algorithm-based computational techniques have shown excellent performance. In our experiments, our method uses fewer features and simpler recognition methods than other methods and simultaneously obtains satisfactory results. First, we use six feature extraction methods to extract sequence features from the same group of DBPs. Then, this feature information is spliced together, and the data are standardized. Finally, the extreme gradient boosting (XGBoost) model is used to construct an effective predictive model. Compared with other excellent methods, our proposed method has achieved better results. The accuracy achieved by our method is 78.26% for PDB2272 and 85.48% for PDB186. The accuracy of the experimental results achieved by our strategy is similar to that of previous detection methods.

10.

Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions.

Zhai, Yixiao; Chen, Yu; Teng, Zhixia; Zhao, Yuming.

Front Cell Dev Biol ; 8: 591487, 2020.

Article in English | MEDLINE | ID: mdl-33195258

ABSTRACT

Excessive oxidative stress responses can threaten our health, and thus it is essential to produce antioxidant proteins to regulate the body's oxidative responses. The low number of antioxidant proteins makes it difficult to extract their representative features. Our experimental method did not use structural information but instead studied antioxidant proteins from a sequenced perspective while focusing on the impact of data imbalance on sensitivity, thus greatly improving the model's sensitivity for antioxidant protein recognition. We developed a method based on the Composition of k-spaced Amino Acid Pairs (CKSAAP) and the Conjoint Triad (CT) features derived from the amino acid composition and protein-protein interactions. SMOTE and the Max-Relevance-Max-Distance algorithm (MRMD) were utilized to unbalance the training data and select the optimal feature subset, respectively. The test set used 10-fold crossing validation and a random forest algorithm for classification according to the selected feature subset. The sensitivity was 0.792, the specificity was 0.808, and the average accuracy was 0.8.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL