Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
Math Biosci Eng ; 21(2): 2626-2645, 2024 Jan 19.
Article in English | MEDLINE | ID: mdl-38454699

ABSTRACT

Calculating single-source shortest paths (SSSPs) rapidly and precisely from weighted digraphs is a crucial problem in graph theory. As a mathematical model of processing uncertain tasks, rough sets theory (RST) has been proven to possess the ability of investigating graph theory problems. Recently, some efficient RST approaches for discovering different subgraphs (e.g. strongly connected components) have been presented. This work was devoted to discovering SSSPs of weighted digraphs by aid of RST. First, SSSPs problem was probed by RST, which aimed at supporting the fundamental theory for taking RST approach to calculate SSSPs from weighted digraphs. Second, a heuristic search strategy was designed. The weights of edges can be served as heuristic information to optimize the search way of $ k $-step $ R $-related set, which is an RST operator. By using heuristic search strategy, some invalid searches can be avoided, thereby the efficiency of discovering SSSPs was promoted. Finally, the W3SP@R algorithm based on RST was presented to calculate SSSPs of weighted digraphs. Related experiments were implemented to verify the W3SP@R algorithm. The result exhibited that W3SP@R can precisely calculate SSSPs with competitive efficiency.

2.
Math Biosci Eng ; 20(7): 12772-12801, 2023 May 31.
Article in English | MEDLINE | ID: mdl-37501466

ABSTRACT

There are approximately 2.2 billion people around the world with varying degrees of visual impairments. Among them, individuals with severe visual impairments predominantly rely on hearing and touch to gather external information. At present, there are limited reading materials for the visually impaired, mostly in the form of audio or text, which cannot satisfy the needs for the visually impaired to comprehend graphical content. Although many scholars have devoted their efforts to investigating methods for converting visual images into tactile graphics, tactile graphic translation fails to meet the reading needs of visually impaired individuals due to image type diversity and limitations in image recognition technology. The primary goal of this paper is to enable the visually impaired to gain a greater understanding of the natural sciences by transforming images of mathematical functions into an electronic format for the production of tactile graphics. In an effort to enhance the accuracy and efficiency of graph element recognition and segmentation of function graphs, this paper proposes an MA Mask R-CNN model which utilizes MA ConvNeXt as its improved feature extraction backbone network and MA BiFPN as its improved feature fusion network. The MA ConvNeXt is a novel feature extraction network proposed in this paper, while the MA BiFPN is a novel feature fusion network introduced in this paper. This model combines the information of local relations, global relations and different channels to form an attention mechanism that is able to establish multiple connections, thus increasing the detection capability of the original Mask R-CNN model on slender and multi-type targets by combining a variety of multi-scale features. Finally, the experimental results show that MA Mask R-CNN attains an 89.6% mAP value for target detection and 72.3% mAP value for target segmentation in the instance segmentation of function graphs. This results in a 9% mAP improvement for target detection and 12.8% mAP improvement for target segmentation compared to the original Mask R-CNN.

3.
IEEE/ACM Trans Comput Biol Bioinform ; 20(5): 3205-3214, 2023.
Article in English | MEDLINE | ID: mdl-37289599

ABSTRACT

It has been demonstrated that RNA modifications play essential roles in multiple biological processes. Accurate identification of RNA modifications in the transcriptome is critical for providing insights into the biological functions and mechanisms. Many tools have been developed for predicting RNA modifications at single-base resolution, which employ conventional feature engineering methods that focus on feature design and feature selection processes that require extensive biological expertise and may introduce redundant information. With the rapid development of artificial intelligence technologies, end-to-end methods are favorably received by researchers. Nevertheless, each well-trained model is only suitable for a specific RNA methylation modification type for nearly all of these approaches. In this study, we present MRM-BERT by feeding task-specific sequences into the powerful BERT (Bidirectional Encoder Representations from Transformers) model and implementing fine-tuning, which exhibits competitive performance to the state-of-the-art methods. MRM-BERT avoids repeated de novo training of the model and can predict multiple RNA modifications such as pseudouridine, m6A, m5C, and m1A in Mus musculus, Arabidopsis thaliana, and Saccharomyces cerevisiae. In addition, we analyse the attention heads to provide high attention regions for the prediction, and conduct saturated in silico mutagenesis of the input sequences to discover potential changes of RNA modifications, which can better assist researchers in their follow-up research.


Subject(s)
Arabidopsis , Artificial Intelligence , Mice , Animals , Pseudouridine , Arabidopsis/genetics , Transcriptome , Saccharomyces cerevisiae/genetics , RNA/genetics
4.
Anal Biochem ; 663: 115020, 2023 02 15.
Article in English | MEDLINE | ID: mdl-36521558

ABSTRACT

X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.


Subject(s)
Computational Biology , Proteins , Crystallization/methods , Proteins/chemistry , Crystallography, X-Ray , Computational Biology/methods
5.
J Chem Inf Model ; 62(17): 4283-4291, 2022 09 12.
Article in English | MEDLINE | ID: mdl-36017565

ABSTRACT

Protein fold recognition refers to predicting the most likely fold type of the query protein and is a critical step of protein structure and function prediction. With the popularity of deep learning in bioinformatics, protein fold recognition has obtained impressive progress. In this study, to extract the fold-specific feature to improve protein fold recognition, we proposed a unified deep metric learning framework based on a joint loss function, termed NPCFold. In addition, we also proposed an integrated machine learning model based on the similarity of proteins in various properties, termed NPCFoldpro. Benchmark experiments show both NPCFold and NPCFoldpro outperform existing protein fold recognition methods at the fold level, indicating that our proposed strategies of fusing loss functions and fusing features could improve the fold recognition level.


Subject(s)
Computational Biology , Proteins , Computational Biology/methods , Machine Learning , Proteins/chemistry
6.
IEEE Trans Neural Netw Learn Syst ; 30(4): 1088-1103, 2019 Apr.
Article in English | MEDLINE | ID: mdl-30137013

ABSTRACT

It is well known that active learning can simultaneously improve the quality of the classification model and decrease the complexity of training instances. However, several previous studies have indicated that the performance of active learning is easily disrupted by an imbalanced data distribution. Some existing imbalanced active learning approaches also suffer from either low performance or high time consumption. To address these problems, this paper describes an efficient solution based on the extreme learning machine (ELM) classification model, called active online-weighted ELM (AOW-ELM). The main contributions of this paper include: 1) the reasons why active learning can be disrupted by an imbalanced instance distribution and its influencing factors are discussed in detail; 2) the hierarchical clustering technique is adopted to select initially labeled instances in order to avoid the missed cluster effect and cold start phenomenon as much as possible; 3) the weighted ELM (WELM) is selected as the base classifier to guarantee the impartiality of instance selection in the procedure of active learning, and an efficient online updated mode of WELM is deduced in theory; and 4) an early stopping criterion that is similar to but more flexible than the margin exhaustion criterion is presented. The experimental results on 32 binary-class data sets with different imbalance ratios demonstrate that the proposed AOW-ELM algorithm is more effective and efficient than several state-of-the-art active learning algorithms that are specifically designed for the class imbalance scenario.

7.
Anal Biochem ; 550: 41-48, 2018 06 01.
Article in English | MEDLINE | ID: mdl-29649472

ABSTRACT

RNA 5-methylcytosine (m5C) is an important post-transcriptional modification that plays an indispensable role in biological processes. The accurate identification of m5C sites from primary RNA sequences is especially useful for deeply understanding the mechanisms and functions of m5C. Due to the difficulty and expensive costs of identifying m5C sites with wet-lab techniques, developing fast and accurate machine-learning-based prediction methods is urgently needed. In this study, we proposed a new m5C site predictor, called M5C-HPCR, by introducing a novel heuristic nucleotide physicochemical property reduction (HPCR) algorithm and classifier ensemble. HPCR extracts multiple reducts of physical-chemical properties for encoding discriminative features, while the classifier ensemble is applied to integrate multiple base predictors, each of which is trained based on a separate reduct of the physical-chemical properties obtained from HPCR. Rigorous jackknife tests on two benchmark datasets demonstrate that M5C-HPCR outperforms state-of-the-art m5C site predictors, with the highest values of MCC (0.859) and AUC (0.962). We also implemented the webserver of M5C-HPCR, which is freely available at http://cslab.just.edu.cn:8080/M5C-HPCR/.


Subject(s)
5-Methylcytosine/chemistry , Machine Learning , RNA , Sequence Analysis, RNA/methods , Software , RNA/chemistry , RNA/genetics
8.
IEEE/ACM Trans Comput Biol Bioinform ; 14(6): 1389-1398, 2017.
Article in English | MEDLINE | ID: mdl-27740495

ABSTRACT

Protein-DNA interactions are ubiquitous in a wide variety of biological processes. Correctly locating DNA-binding residues solely from protein sequences is an important but challenging task for protein function annotations and drug discovery, especially in the post-genomic era where large volumes of protein sequences have quickly accumulated. In this study, we report a new predictor, named TargetDNA, for targeting protein-DNA binding residues from primary sequences. TargetDNA uses a protein's evolutionary information and its predicted solvent accessibility as two base features and employs a centered linear kernel alignment algorithm to learn the weights for weightedly combining the two features. Based on the weightedly combined feature, multiple initial predictors with SVM as classifiers are trained by applying a random under-sampling technique to the original dataset, the purpose of which is to cope with the severe imbalance phenomenon that exists between the number of DNA-binding and non-binding residues. The final ensembled predictor is obtained by boosting the multiple initially trained predictors. Experimental simulation results demonstrate that the proposed TargetDNA achieves a high prediction performance and outperforms many existing sequence-based protein-DNA binding residue predictors. The TargetDNA web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/TargetDNA/ for academic use.


Subject(s)
Computational Biology/methods , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/metabolism , Models, Statistical , Support Vector Machine , Amino Acid Sequence , Databases, Protein
9.
Article in English | MEDLINE | ID: mdl-26357272

ABSTRACT

Disulfide connectivity is an important protein structural characteristic. Accurately predicting disulfide connectivity solely from protein sequence helps to improve the intrinsic understanding of protein structure and function, especially in the post-genome era where large volume of sequenced proteins without being functional annotated is quickly accumulated. In this study, a new feature extracted from the predicted protein 3D structural information is proposed and integrated with traditional features to form discriminative features. Based on the extracted features, a random forest regression model is performed to predict protein disulfide connectivity. We compare the proposed method with popular existing predictors by performing both cross-validation and independent validation tests on benchmark datasets. The experimental results demonstrate the superiority of the proposed method over existing predictors. We believe the superiority of the proposed method benefits from both the good discriminative capability of the newly developed features and the powerful modelling capability of the random forest. The web server implementation, called TargetDisulfide, and the benchmark datasets are freely available at: http://csbio.njust.edu.cn/bioinf/TargetDisulfide for academic use.


Subject(s)
Computational Biology/methods , Disulfides/chemistry , Models, Molecular , Protein Conformation , Proteins/chemistry , Amino Acid Sequence , Decision Trees , Regression Analysis , Sequence Analysis, Protein
10.
Biomed Mater Eng ; 26 Suppl 1: S1855-62, 2015.
Article in English | MEDLINE | ID: mdl-26405957

ABSTRACT

To address the imbalanced classification problem emerging in Bioinformatics, a boundary movement-based extreme learning machine (ELM) algorithm called BM-ELM was proposed. BM-ELM tries to firstly explore the prior information about data distribution by condensing all training instances into the one-dimensional feature space corresponding to the original output in ELM, and then on the transformed space, to find the optimal moving distance of the classification hyperplane by estimating the probability density distributions of the instances in different classes. Experimental results on four real imbalanced bioinformatics classification data sets indicated that the proposed BM-ELM algorithm outperforms some traditional bias correction algorithms due to it can greatly improve the sensitivity of the classification results with small loss of specificity as possible. Also, BM-ELM algorithm has presented better performance than the widely used support vector machine (SVM) classifier. The algorithm can be widely popularized in various large-scale bioinformatics applications.


Subject(s)
Algorithms , High-Throughput Nucleotide Sequencing/methods , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Analysis/methods , Support Vector Machine , Computer Simulation , Data Mining/methods , Databases, Genetic , Machine Learning , Models, Genetic , Models, Statistical
12.
ScientificWorldJournal ; 2014: 538968, 2014.
Article in English | MEDLINE | ID: mdl-25276852

ABSTRACT

Multiscale information system is a new knowledge representation system for expressing the knowledge with different levels of granulations. In this paper, by considering the unknown values, which can be seen everywhere in real world applications, the incomplete multiscale information system is firstly investigated. The descriptor technique is employed to construct rough sets at different scales for analyzing the hierarchically structured data. The problem of unravelling decision rules at different scales is also addressed. Finally, the reduct descriptors are formulated to simplify decision rules, which can be derived from different scales. Some numerical examples are employed to substantiate the conceptual arguments.


Subject(s)
Algorithms , Decision Making, Computer-Assisted , Information Systems/standards , Models, Theoretical , Reproducibility of Results
13.
PLoS One ; 9(9): e107676, 2014.
Article in English | MEDLINE | ID: mdl-25229688

ABSTRACT

Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.


Subject(s)
Algorithms , Computational Biology/methods , Nucleotides/metabolism , Proteins/chemistry , Proteins/metabolism , Supervised Machine Learning , Protein Binding , Support Vector Machine
14.
BMC Bioinformatics ; 15: 297, 2014 Sep 05.
Article in English | MEDLINE | ID: mdl-25189131

ABSTRACT

BACKGROUND: Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated. RESULTS: In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction. CONCLUSIONS: The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita.


Subject(s)
Proteins/metabolism , Support Vector Machine , Vitamins/metabolism , Binding Sites , Models, Biological , Position-Specific Scoring Matrices , Protein Binding , Protein Structure, Secondary , Proteins/chemistry , Software
15.
Biomed Res Int ; 2013: 239628, 2013.
Article in English | MEDLINE | ID: mdl-24078908

ABSTRACT

DNA microarray technology can measure the activities of tens of thousands of genes simultaneously, which provides an efficient way to diagnose cancer at the molecular level. Although this strategy has attracted significant research attention, most studies neglect an important problem, namely, that most DNA microarray datasets are skewed, which causes traditional learning algorithms to produce inaccurate results. Some studies have considered this problem, yet they merely focus on binary-class problem. In this paper, we dealt with multiclass imbalanced classification problem, as encountered in cancer DNA microarray, by using ensemble learning. We utilized one-against-all coding strategy to transform multiclass to multiple binary classes, each of them carrying out feature subspace, which is an evolving version of random subspace that generates multiple diverse training subsets. Next, we introduced one of two different correction technologies, namely, decision threshold adjustment or random undersampling, into each training subset to alleviate the damage of class imbalance. Specifically, support vector machine was used as base classifier, and a novel voting rule called counter voting was presented for making a final decision. Experimental results on eight skewed multiclass cancer microarray datasets indicate that unlike many traditional classification approaches, our methods are insensitive to class imbalance.


Subject(s)
Algorithms , Neoplasms/genetics , Oligonucleotide Array Sequence Analysis , Statistics as Topic , Databases as Topic , Humans , Support Vector Machine
16.
Genomics ; 101(1): 38-48, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23000192

ABSTRACT

An important application of gene expression data is to classify samples in a variety of diagnostic fields. However, high dimensionality and a small number of noisy samples pose significant challenges to existing classification methods. Focused on the problems of overfitting and sensitivity to noise of the dataset in the classification of microarray data, we propose an interval-valued analysis method based on a rough set technique to select discriminative genes and to use these genes to classify tissue samples of microarray data. We first select a small subset of genes based on interval-valued rough set by considering the preference-ordered domains of the gene expression data, and then classify test samples into certain classes with a term of similar degree. Experiments show that the proposed method is able to reach high prediction accuracies with a small number of selected genes and its performance is robust to noise.


Subject(s)
Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Confidence Intervals , Data Interpretation, Statistical , Signal-To-Noise Ratio
SELECTION OF CITATIONS
SEARCH DETAIL
...