Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
Front Med (Lausanne) ; 11: 1391184, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39109222

RESUMO

Introduction: Tuberculosis (TB) stands as a paramount global health concern, contributing significantly to worldwide mortality rates. Effective containment of TB requires deployment of cost-efficient screening method with limited resources. To enhance the precision of resource allocation in the global fight against TB, this research proposed chest X-ray radiography (CXR) based machine learning screening algorithms with optimization, benchmarking and tuning for the best TB subclassification tasks for clinical application. Methods: This investigation delves into the development and evaluation of a robust ensemble deep learning framework, comprising 43 distinct models, tailored for the identification of active TB cases and the categorization of their clinical subtypes. The proposed framework is essentially an ensemble model with multiple feature extractors and one of three fusion strategies-voting, attention-based, or concatenation methods-in the fusion stage before a final classification. The comprised de-identified dataset contains records of 915 active TB patients alongside 1,276 healthy controls with subtype-specific information. Thus, the realizations of our framework are capable for diagnosis with subclass identification. The subclass tags include: secondary tuberculosis/tuberculous pleurisy; non-cavity/cavity; secondary tuberculosis only/secondary tuberculosis and tuberculous pleurisy; tuberculous pleurisy only/secondary tuberculosis and tuberculous pleurisy. Results: Based on the dataset and model selection and tuning, ensemble models show their capability with self-correction capability of subclass identification with rendering robust clinical predictions. The best double-CNN-extractor model with concatenation/attention fusion strategies may potentially be the successful model for subclass tasks in real application. With visualization techniques, in-depth analysis of the ensemble model's performance across different fusion strategies are verified. Discussion: The findings underscore the potential of such ensemble approaches in augmenting TB diagnostics with subclassification. Even with limited dataset, the self-correction within the ensemble models still guarantees the accuracies to some level for potential clinical decision-making processes in TB management. Ultimately, this study shows a direction for better TB screening in the future TB response strategy.

2.
Genomics ; 113(6): 4052-4060, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34666191

RESUMO

Super-enhancer (SE) is a cluster of active typical enhancers (TE) with high levels of the Mediator complex, master transcriptional factors, and chromatin regulators. SEs play a key role in the control of cell identity and disease. Traditionally, scientists used a variety of high-throughput data of different transcriptional factors or chromatin marks to distinguish SEs from TEs. This kind of experimental methods are usually costly and time-consuming. In this paper, we proposed a model DeepSE, which is based on a deep convolutional neural network model, to distinguish the SEs from TEs. DeepSE represent the DNA sequences using the dna2vec feature embeddings. With only the DNA sequence information, DeepSE outperformed all state-of-the-art methods. In addition, DeepSE can be generalized well across different cell lines, which implied that cell-type specific SEs may share hidden sequence patterns across different cell lines. The source code and data are stored in GitHub (https://github.com/QiaoyingJi/DeepSE).


Assuntos
Cromatina , Elementos Facilitadores Genéticos , Linhagem Celular , Cromatina/genética , Redes Neurais de Computação , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
3.
Molecules ; 23(8)2018 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-30071670

RESUMO

Machine learning based predictions of protein⁻protein interactions (PPIs) could provide valuable insights into protein functions, disease occurrence, and therapy design on a large scale. The intensive feature engineering in most of these methods makes the prediction task more tedious and trivial. The emerging deep learning technology enabling automatic feature engineering is gaining great success in various fields. However, the over-fitting and generalization of its models are not yet well investigated in most scenarios. Here, we present a deep neural network framework (DNN-PPI) for predicting PPIs using features learned automatically only from protein primary sequences. Within the framework, the sequences of two interacting proteins are sequentially fed into the encoding, embedding, convolution neural network (CNN), and long short-term memory (LSTM) neural network layers. Then, a concatenated vector of the two outputs from the previous layer is wired as the input of the fully connected neural network. Finally, the Adam optimizer is applied to learn the network weights in a back-propagation fashion. The different types of features, including semantic associations between amino acids, position-related sequence segments (motif), and their long- and short-term dependencies, are captured in the embedding, CNN and LSTM layers, respectively. When the model was trained on Pan's human PPI dataset, it achieved a prediction accuracy of 98.78% at the Matthew's correlation coefficient (MCC) of 97.57%. The prediction accuracies for six external datasets ranged from 92.80% to 97.89%, making them superior to those achieved with previous methods. When performed on Escherichia coli, Drosophila, and Caenorhabditis elegans datasets, DNN-PPI obtained prediction accuracies of 95.949%, 98.389%, and 98.669%, respectively. The performances in cross-species testing among the four species above coincided in their evolutionary distances. However, when testing Mus Musculus using the models from those species, they all obtained prediction accuracies of over 92.43%, which is difficult to achieve and worthy of note for further study. These results suggest that DNN-PPI has remarkable generalization and is a promising tool for identifying protein interactions.


Assuntos
Redes Neurais de Computação , Mapeamento de Interação de Proteínas , Sequência de Aminoácidos , Animais , Humanos , Memória de Curto Prazo , Ligação Proteica
4.
Genes (Basel) ; 9(8)2018 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-30071697

RESUMO

Nowadays, various machine learning-based approaches using sequence information alone have been proposed for identifying DNA-binding proteins, which are crucial to many cellular processes, such as DNA replication, DNA repair and DNA modification. Among these methods, building a meaningful feature representation of the sequences and choosing an appropriate classifier are the most trivial tasks. Disclosing the significances and contributions of different feature spaces and classifiers to the final prediction is of the utmost importance, not only for the prediction performances, but also the practical clues of biological experiment designs. In this study, we propose a model stacking framework by orchestrating multi-view features and classifiers (MSFBinder) to investigate how to integrate and evaluate loosely-coupled models for predicting DNA-binding proteins. The framework integrates multi-view features including Local_DPP, 188D, Position-Specific Scoring Matrix (PSSM)_DWT and autocross-covariance of secondary structures(AC_Struc), which were extracted based on evolutionary information, sequence composition, physiochemical properties and predicted structural information, respectively. These features are fed into various loosely-coupled classifiers such as SVM and random forest. Then, a logistic regression model was applied to evaluate the contributions of these individual classifiers and to make the final prediction. When performing on the training dataset PDB1075, the proposed method achieves an accuracy of 83.53%. On the independent dataset PDB186, the method achieves an accuracy of 81.72%, which outperforms many existing methods. These results suggest that the framework is able to orchestrate various predicted models flexibly with good performances.

5.
PLoS One ; 12(12): e0188129, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29287069

RESUMO

DNA-binding proteins play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions for both eukaryotic and prokaryotic proteomes. Predicting the functions of these proteins from primary amino acids sequences is becoming one of the major challenges in functional annotations of genomes. Traditional prediction methods often devote themselves to extracting physiochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a deep learning based method to identify DNA-binding proteins from primary sequences alone. It utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependencies, an binary cross entropy to evaluate the quality of the neural networks. When the proposed method is tested with a realistic DNA binding protein dataset, it achieves a prediction accuracy of 94.2% at the Matthew's correlation coefficient of 0.961. Compared with the LibSVM on the arabidopsis and yeast datasets via independent tests, the accuracy raises by 9% and 4% respectively. Comparative experiments using different feature extraction methods show that our model performs similar accuracy with the best of others, but its values of sensitivity, specificity and AUC increase by 27.83%, 1.31% and 16.21% respectively. Those results suggest that our method is a promising tool for identifying DNA-binding proteins.


Assuntos
Proteínas de Ligação a DNA/metabolismo , Sequência de Aminoácidos , Arabidopsis/genética , Proteínas de Ligação a DNA/química , Modelos Teóricos , Reprodutibilidade dos Testes , Leveduras/genética
6.
PLoS One ; 12(8): e0181426, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28792503

RESUMO

Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew's correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.


Assuntos
Sequência de Aminoácidos , Árvores de Decisões , Mapeamento de Interação de Proteínas/métodos , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Biologia Computacional , Conjuntos de Dados como Assunto , Helicobacter pylori , Humanos , Saccharomyces cerevisiae , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Proteínas Wnt/genética , Proteínas Wnt/metabolismo
7.
J Integr Bioinform ; 9(2): 194, 2012 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-22773159

RESUMO

The expression and regulation of genes in different tissues are fundamental questions to be answered in biology. Knowledge enrichment analysis for tissue specific (TS) and housekeeping (HK) genes may help identify their roles in biological process or diseases and gain new biological insights. In this paper, we performed the knowledge enrichment analysis for 17,343 genes in 84 human tissues using Gene Set Enrichment Analysis (GSEA) and Hypergeometric Analysis (HA) against three biological ontologies: Gene Ontology (GO), KEGG pathways and Disease Ontology (DO) respectively. The analyses results demonstrated that the functions of most gene groups are consistent with their tissue origins. Meanwhile three interesting new associations for HK genes and the skeletal muscle tissue genes are found. Firstly, Hypergeometric analysis against KEGG database for HK genes disclosed that three disease terms (Parkinson's disease, Huntington's disease, Alzheimer's disease) are intensively enriched. Secondly, Hypergeometric analysis against the KEGG database for Skeletal Muscle tissue genes shows that two cardiac diseases of "Hypertrophic cardiomyopathy (HCM)" and "Arrhythmogenic right ventricular cardiomyopathy (ARVC)" are heavily enriched, which are also considered as no relationship with skeletal functions. Thirdly, "Prostate cancer" is intensively enriched in Hypergeometric analysis against the disease ontology (DO) for the Skeletal Muscle tissue genes, which is a much unexpected phenomenon.


Assuntos
Perfilação da Expressão Gênica , Genes Essenciais , Bases de Conhecimento , Bases de Dados Genéticas , Frequência do Gene , Humanos , Armazenamento e Recuperação da Informação/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA