Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
Comput Biol Med ; 177: 108623, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38788374

ABSTRACT

Prediction of protein-protein interaction (PPI) types enhances the comprehension of the underlying structural characteristics and functions of proteins, which gives rise to a multi-label classification problem. The nominal features describe the physicochemical characteristics of proteins directly, establishing a more robust correlation with the interaction types between proteins than ordered features. Motivated by this, we propose a multi-label PPI prediction model referred to as CoMPPI (Co-training based Multi-Label prediction of Protein-Protein Interaction). This approach aims to maximize the utility of both ordered and nominal features extracted from protein sequences. Specifically, CoMPPI incorporates graph convolutional network (GCN) and 1D convolution operation to process the complementary subsets of features individually, leveraging both local and contextualized information in a more efficient way. In addition, two multi-type PPI datasets were constructed to eliminate the duplication in previous datasets. We compare the performance of CoMPPI with three state-of-the-art methods on three datasets partitioned using distinct schemes (Breadth-first search, Depth-first search, and Random), CoMPPI consistently outperforms the other methods across all cases, demonstrating improvements ranging from 3.81% to 32.40% in Micro-F1. The subsequent ablation experiment confirms the efficacy of employing the co-training framework for multi-label PPI prediction, indicating promising avenues for future advancements in this domain.


Subject(s)
Protein Interaction Mapping , Proteins , Proteins/chemistry , Proteins/metabolism , Protein Interaction Mapping/methods , Databases, Protein , Humans , Computational Biology/methods
2.
Brief Funct Genomics ; 2023 Jun 20.
Article in English | MEDLINE | ID: mdl-37340778

ABSTRACT

Third-generation sequencing (TGS) technologies have revolutionized genome science in the past decade. However, the long-read data produced by TGS platforms suffer from a much higher error rate than that of the previous technologies, thus complicating the downstream analysis. Several error correction tools for long-read data have been developed; these tools can be categorized into hybrid and self-correction tools. So far, these two types of tools are separately investigated, and their interplay remains understudied. Here, we integrate hybrid and self-correction methods for high-quality error correction. Our procedure leverages the inter-similarity between long-read data and high-accuracy information from short reads. We compare the performance of our method and state-of-the-art error correction tools on Escherichia coli and Arabidopsis thaliana datasets. The result shows that the integration approach outperformed the existing error correction methods and holds promise for improving the quality of downstream analyses in genomic research.

3.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36880207

ABSTRACT

Protein-protein interactions (PPIs) carry out the cellular processes of all living organisms. Experimental methods for PPI detection suffer from high cost and false-positive rate, hence efficient computational methods are highly desirable for facilitating PPI detection. In recent years, benefiting from the enormous amount of protein data produced by advanced high-throughput technologies, machine learning models have been well developed in the field of PPI prediction. In this paper, we present a comprehensive survey of the recently proposed machine learning-based prediction methods. The machine learning models applied in these methods and details of protein data representation are also outlined. To understand the potential improvements in PPI prediction, we discuss the trend in the development of machine learning-based methods. Finally, we highlight potential directions in PPI prediction, such as the use of computationally predicted protein structures to extend the data source for machine learning models. This review is supposed to serve as a companion for further improvements in this field.


Subject(s)
Machine Learning , Protein Interaction Mapping , Protein Interaction Mapping/methods , Proteins/metabolism , Computational Biology/methods
4.
Bioinformatics ; 37(11): 1604-1606, 2021 07 12.
Article in English | MEDLINE | ID: mdl-33112385

ABSTRACT

SUMMARY: Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. AVAILABILITY AND IMPLEMENTATION: https://github.com/yuansliu/minirmd. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Software , Cluster Analysis , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA
5.
Virus Genes ; 56(6): 734-748, 2020 Dec.
Article in English | MEDLINE | ID: mdl-33009986

ABSTRACT

Fowlpox virus (FPV) is used as a vaccine vector to prevent diseases in poultry and mammals. The insertion site is considered as one of the main factors influencing foreign gene expression. Therefore, the identification of insertion sites that can stably and efficiently express foreign genes is crucial for the construction of recombinant vaccines. In this study, we found that the insertion of foreign genes into ORF054 and the ORF161/ORF162 intergenic region of the FPV genome did not affect replication, and that the foreign genes inserted into the intergenic region were more efficiently expressed than when they were inserted into a gene. Based on these results, the recombinant virus rFPVNX10-NDV F-E was constructed and immune protection against virulent FPV and Newcastle disease virus (NDV) was evaluated. Tests for anti-FPV antibodies in the vaccinated chickens were positive within 14 days post-vaccination. After challenge with FPV102, no clinical signs of FP were observed in vaccinated chickens, as compared to that in the control group (unvaccinated), which showed 100% morbidity. Low levels of NDV-specific neutralizing antibodies were detected in vaccinated chickens before challenge. After challenge with NDV ck/CH/LHLJ/01/06, all control chickens died within 4 days post-challenge, whereas 5/15 vaccinated chickens died between 4 and 12 days post-challenge. Vaccination provided an immune protection rate of 66.7%, whereas the control group showed 100% mortality. These results indicate that the ORF161/ORF162 intergenic region of FPVNX10 can be used as a recombination site for foreign gene expression in vivo and in vitro.


Subject(s)
Fowlpox virus/genetics , Fowlpox/prevention & control , Newcastle Disease/prevention & control , Poultry Diseases/prevention & control , Viral Fusion Proteins/genetics , Viral Vaccines/genetics , Animals , Cell Line , Chick Embryo , Chickens , DNA, Intergenic , Fibroblasts , Vaccination/veterinary , Vaccines, Synthetic/genetics
6.
BMC Genomics ; 21(1): 627, 2020 Sep 11.
Article in English | MEDLINE | ID: mdl-32917152

ABSTRACT

BACKGROUND: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. RESULTS: The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. CONCLUSIONS: The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.


Subject(s)
DNA Methylation , Machine Learning , Sequence Analysis, DNA/methods , Animals , Arabidopsis , Caenorhabditis elegans , Cytosine/analogs & derivatives , Cytosine/analysis , DNA/chemistry , DNA/genetics , Drosophila melanogaster , Epigenome , Escherichia coli , Software
7.
BMC Bioinformatics ; 20(Suppl 19): 661, 2019 Dec 24.
Article in English | MEDLINE | ID: mdl-31870276

ABSTRACT

BACKGROUND: Drug-drug interactions (DDIs) are a major concern in patients' medication. It's unfeasible to identify all potential DDIs using experimental methods which are time-consuming and expensive. Computational methods provide an effective strategy, however, facing challenges due to the lack of experimentally verified negative samples. RESULTS: To address this problem, we propose a novel positive-unlabeled learning method named DDI-PULearn for large-scale drug-drug-interaction predictions. DDI-PULearn first generates seeds of reliable negatives via OCSVM (one-class support vector machine) under a high-recall constraint and via the cosine-similarity based KNN (k-nearest neighbors) as well. Then trained with all the labeled positives (i.e., the validated DDIs) and the generated seed negatives, DDI-PULearn employs an iterative SVM to identify a set of entire reliable negatives from the unlabeled samples (i.e., the unobserved DDIs). Following that, DDI-PULearn represents all the labeled positives and the identified negatives as vectors of abundant drug properties by a similarity-based method. Finally, DDI-PULearn transforms these vectors into a lower-dimensional space via PCA (principal component analysis) and utilizes the compressed vectors as input for binary classifications. The performance of DDI-PULearn is evaluated on simulative prediction for 149,878 possible interactions between 548 drugs, comparing with two baseline methods and five state-of-the-art methods. Related experiment results show that the proposed method for the representation of DDIs characterizes them accurately. DDI-PULearn achieves superior performance owing to the identified reliable negatives, outperforming all other methods significantly. In addition, the predicted novel DDIs suggest that DDI-PULearn is capable to identify novel DDIs. CONCLUSIONS: The results demonstrate that positive-unlabeled learning paves a new way to tackle the problem caused by the lack of experimentally verified negatives in the computational prediction of DDIs.


Subject(s)
Drug Interactions , Cluster Analysis , Humans , Support Vector Machine
8.
BMC Bioinformatics ; 20(Suppl 23): 605, 2019 Dec 27.
Article in English | MEDLINE | ID: mdl-31881829

ABSTRACT

BACKGROUND: Detection of new drug-target interactions by computational algorithms is of crucial value to both old drug repositioning and new drug discovery. Existing machine-learning methods rely only on experimentally validated drug-target interactions (i.e., positive samples) for the predictions. Their performance is severely impeded by the lack of reliable negative samples. RESULTS: We propose a method to construct highly-reliable negative samples for drug target prediction by a pairwise drug-target similarity measurement and OCSVM with a high-recall constraint. On one hand, we measure the pairwise similarity between every two drug-target interactions by combining the chemical similarity between their drugs and the Gene Ontology-based similarity between their targets. Then we calculate the accumulative similarity with all known drug-target interactions for each unobserved drug-target interaction. On the other hand, we obtain the signed distance from OCSVM learned from the known interactions with high recall (≥0.95) for each unobserved drug-target interaction. After normalizing all accumulative similarities and signed distances to the range [0,1], we compute the score for each unobserved drug-target interaction via averaging its accumulative similarity and signed distance. Unobserved interactions with lower scores are preferentially served as reliable negative samples for the classification algorithms. The performance of the proposed method is evaluated on the interaction data between 1094 drugs and 1556 target proteins. Extensive comparison experiments using four classical classifiers and one domain predictive method demonstrate the superior performance of the proposed method. A better decision boundary has been learned from the constructed reliable negative samples. CONCLUSIONS: Proper construction of highly-reliable negative samples can help the classification models learn a clear decision boundary which contributes to the performance improvement.


Subject(s)
Algorithms , Drug Discovery , Drug Repositioning , Machine Learning , Area Under Curve , Drug Interactions , Humans
9.
BMC Med Genomics ; 12(Suppl 8): 183, 2019 12 20.
Article in English | MEDLINE | ID: mdl-31856830

ABSTRACT

BACKGROUND: The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two main drawbacks: First, these methods are based on fixed statistical hypotheses and not always effective; Second, these methods can not identify a certain expression level boundary when there is no obvious expression level gap between control and experiment groups. METHODS: This paper proposed a novel approach to identify marker genes and gene expression level boundary for lung cancer. By calculating a kernel maximum mean discrepancy, our method can evaluate the expression differences between normal, normal adjacent to tumor (NAT) and tumor samples. For the potential marker genes, the expression level boundaries among different groups are defined with the information entropy method. RESULTS: Compared with two conventional methods t-test and fold change, the top average ranked genes selected by our method can achieve better performance under all metrics in the 10-fold cross-validation. Then GO and KEGG enrichment analysis are conducted to explore the biological function of the top 100 ranked genes. At last, we choose the top 10 average ranked genes as lung cancer markers and their expression boundaries are calculated and reported. CONCLUSION: The proposed approach is effective to identify gene markers for lung cancer diagnosis. It is not only more accurate than conventional DEA methods but also provides a reliable method to identify the gene expression level boundaries.


Subject(s)
Computational Biology/methods , Entropy , Genetic Markers/genetics , Lung Neoplasms/genetics , Gene Expression Profiling , Humans , Machine Learning
10.
BMC Bioinformatics ; 19(Suppl 19): 517, 2018 Dec 31.
Article in English | MEDLINE | ID: mdl-30598065

ABSTRACT

BACKGROUND: Early and accurate identification of potential adverse drug reactions (ADRs) for combined medication is vital for public health. Existing methods either rely on expensive wet-lab experiments or detecting existing associations from related records. Thus, they inevitably suffer under-reporting, delays in reporting, and inability to detect ADRs for new and rare drugs. The current application of machine learning methods is severely impeded by the lack of proper drug representation and credible negative samples. Therefore, a method to represent drugs properly and to select credible negative samples becomes vital in applying machine learning methods to this problem. RESULTS: In this work, we propose a machine learning method to predict ADRs of combined medication from pharmacologic databases by building up highly-credible negative samples (HCNS-ADR). Specifically, we fuse heterogeneous information from different databases and represent each drug as a multi-dimensional vector according to its chemical substructures, target proteins, substituents, and related pathways first. Then, a drug-pair vector is obtained by appending the vector of one drug to the other. Next, we construct a drug-disease-gene network and devise a scoring method to measure the interaction probability of every drug pair via network analysis. Drug pairs with lower interaction probability are preferentially selected as negative samples. Following that, the validated positive samples and the selected credible negative samples are projected into a lower-dimensional space using the principal component analysis. Finally, a classifier is built for each ADR using its positive and negative samples with reduced dimensions. The performance of the proposed method is evaluated on simulative prediction for 1276 ADRs and 1048 drugs, comparing using four machine learning algorithms and with two baseline approaches. Extensive experiments show that the proposed way to represent drugs characterizes drugs accurately. With highly-credible negative samples selected by HCNS-ADR, the four machine learning algorithms achieve significant performance improvements. HCNS-ADR is also shown to be able to predict both known and novel drug-drug-ADR associations, outperforming two other baseline approaches significantly. CONCLUSIONS: The results demonstrate that integration of different drug properties to represent drugs are valuable for ADR prediction of combined medication and the selection of highly-credible negative samples can significantly improve the prediction performance.


Subject(s)
Adverse Drug Reaction Reporting Systems/statistics & numerical data , Databases, Pharmaceutical , Drug Interactions , Drug-Related Side Effects and Adverse Reactions/metabolism , Gene Regulatory Networks , Pharmaceutical Preparations/metabolism , Predictive Value of Tests , Algorithms , Humans , Models, Statistical
SELECTION OF CITATIONS
SEARCH DETAIL
...