Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 55
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Front Microbiol ; 15: 1339156, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38572227

RESUMO

Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.

2.
Chaos ; 34(1)2024 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-38198680

RESUMO

The significance of accurate long-term forecasting of air quality for a long-term policy decision for controlling air pollution and for evaluating its impacts on human health has attracted greater attention recently. This paper proposes an ensemble multi-scale framework to refine the previous version with ensemble empirical mode decomposition (EMD) and nonstationary oscillation resampling (NSOR) for long-term forecasting. Within the proposed ensemble multi-scale framework, we on one hand apply modified EMD to produce more regular and stable EMD components, allowing the long-range oscillation characteristics of the original time series to be better captured. On the other hand, we provide an ensemble mechanism to alleviate the error propagation problem in forecasts caused by iterative implementation of NSOR at all lead times and name it improved NSOR. Application of the proposed multi-scale framework to long-term forecasting of the daily PM2.5 at 14 monitoring stations in Hong Kong demonstrates that it can effectively capture the long-term variation in air pollution processes and significantly increase the forecasting performance. Specifically, the framework can, respectively, reduce the average root-mean-square error and the mean absolute error over all 14 stations by 8.4% and 9.2% for a lead time of 100 days, compared to previous studies. Additionally, better robustness can be obtained by the proposed ensemble framework for 180-day and 365-day long-term forecasting scenarios. It should be emphasized that the proposed ensemble multi-scale framework is a feasible framework, which is applicable for long-term time series forecasting in general.

3.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37401373

RESUMO

Recent advances and achievements of artificial intelligence (AI) as well as deep and graph learning models have established their usefulness in biomedical applications, especially in drug-drug interactions (DDIs). DDIs refer to a change in the effect of one drug to the presence of another drug in the human body, which plays an essential role in drug discovery and clinical research. DDIs prediction through traditional clinical trials and experiments is an expensive and time-consuming process. To correctly apply the advanced AI and deep learning, the developer and user meet various challenges such as the availability and encoding of data resources, and the design of computational methods. This review summarizes chemical structure based, network based, natural language processing based and hybrid methods, providing an updated and accessible guide to the broad researchers and development community with different domain knowledge. We introduce widely used molecular representation and describe the theoretical frameworks of graph neural network models for representing molecular structures. We present the advantages and disadvantages of deep and graph learning methods by performing comparative experiments. We discuss the potential technical challenges and highlight future directions of deep and graph learning models for accelerating DDIs prediction.


Assuntos
Inteligência Artificial , Redes Neurais de Computação , Humanos , Interações Medicamentosas , Processamento de Linguagem Natural , Descoberta de Drogas
4.
Front Immunol ; 14: 1160397, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37377963

RESUMO

Introduction: Substantial links between autoimmune diseases have been shown by an increasing number of studies, and one hypothesis for this comorbidity is that there is a common genetic cause. Methods: In this paper, a large-scale cross-trait Genome-wide Association Studies (GWAS) was conducted to investigate the genetic overlap among rheumatoid arthritis, multiple sclerosis, inflammatory bowel disease and type 1 diabetes. Results and discussion: Through the local genetic correlation analysis, 2 regions with locally significant genetic associations between rheumatoid arthritis and multiple sclerosis, and 4 regions with locally significant genetic associations between rheumatoid arthritis and type 1 diabetes were discovered. By cross-trait meta-analysis, 58 independent loci associated with rheumatoid arthritis and multiple sclerosis, 86 independent loci associated with rheumatoid arthritis and inflammatory bowel disease, and 107 independent loci associated with rheumatoid arthritis and type 1 diabetes were identified with genome-wide significance. In addition, 82 common risk genes were found through genetic identification. Based on gene set enrichment analysis, it was found that shared genes are enriched in exposed dermal system, calf, musculoskeletal, subcutaneous fat, thyroid and other tissues, and are also significantly enriched in 35 biological pathways. To verify the association between diseases, Mendelian randomized analysis was performed, which shows possible causal associations between rheumatoid arthritis and multiple sclerosis, and between rheumatoid arthritis and type 1 diabetes. The common genetic structure of rheumatoid arthritis, multiple sclerosis, inflammatory bowel disease and type 1 diabetes was explored by these studies, and it is believed that this important discovery will lead to new ideas for clinical treatment.


Assuntos
Artrite Reumatoide , Doenças Autoimunes , Diabetes Mellitus Tipo 1 , Doenças Inflamatórias Intestinais , Esclerose Múltipla , Humanos , Estudo de Associação Genômica Ampla , Diabetes Mellitus Tipo 1/genética , Predisposição Genética para Doença , Doenças Autoimunes/genética , Artrite Reumatoide/genética , Loci Gênicos , Esclerose Múltipla/genética , Doenças Inflamatórias Intestinais/genética
5.
Front Cell Infect Microbiol ; 13: 1117421, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36779183

RESUMO

Introduction: The species diversity of microbiomes is a cutting-edge concept in metagenomic research. In this study, we propose a multifractal analysis for metagenomic research. Method and Results: Firstly, we visualized the chaotic game representation (CGR) of simulated metagenomes and real metagenomes. We find that metagenomes are visualized with self-similarity. Then we defined and calculated the multifractal dimension for the visualized plot of simulated and real metagenomes, respectively. By analyzing the Pearson correlation coefficients between the multifractal dimension and the traditional species diversity index, we obtain that the correlation coefficients between the multifractal dimension and the species richness index and Shannon diversity index reached the maximum value when q = 0, 1, and the correlation coefficient between the multifractal dimension and the Simpson diversity index reached the maximum value when q = 5. Finally, we apply our method to real metagenomes of the gut microbiota of 100 infants who are newborn and 4 and 12 months old. The results show that the multifractal dimensions of an infant's gut microbiomes can distinguish age differences. Conclusion and Discussion: There is self-similarity among the CGRs of WGS of metagenomes, and the multifractal spectrum is an important characteristic for metagenomes. The traditional diversity indicators can be unified under the framework of multifractal analysis. These results coincided with similar results in macrobial ecology. The multifractal spectrum of infants' gut microbiomes are related to the development of the infants.


Assuntos
Microbioma Gastrointestinal , Microbiota , Humanos , Lactente , Recém-Nascido , Metagenoma , Microbiota/genética , Microbioma Gastrointestinal/genética , Metagenômica/métodos , Ecologia
6.
Mol Phylogenet Evol ; 179: 107662, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36375789

RESUMO

Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.


Assuntos
Algoritmos , Genoma , Filogenia
7.
Brief Funct Genomics ; 21(5): 399-407, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-35942693

RESUMO

Identification and classification of enhancers are highly significant because they play crucial roles in controlling gene transcription. Recently, several deep learning-based methods for identifying enhancers and their strengths have been developed. However, existing methods are usually limited because they use only local or only global features. The combination of local and global features is critical to further improve the prediction performance. In this work, we propose a novel deep learning-based method, called iEnhancer-DLRA, to identify enhancers and their strengths. iEnhancer-DLRA extracts local and multi-scale global features of sequences by using a residual convolutional network and two bidirectional long short-term memory networks. Then, a self-attention fusion strategy is proposed to deeply integrate these local and global features. The experimental results on the independent test dataset indicate that iEnhancer-DLRA performs better than nine existing state-of-the-art methods in both identification and classification of enhancers in almost all metrics. iEnhancer-DLRA achieves 13.8% (for identifying enhancers) and 12.6% (for classifying strengths) improvement in accuracy compared with the best existing state-of-the-art method. This is the first time that the accuracy of an enhancer identifier exceeds 0.9 and the accuracy of the enhancer classifier exceeds 0.8 on the independent test set. Moreover, iEnhancer-DLRA achieves superior predictive performance on the rice dataset compared with the state-of-the-art method RiceENN.


Assuntos
Atenção , Elementos Facilitadores Genéticos , Elementos Facilitadores Genéticos/genética
8.
Interdiscip Sci ; 14(2): 439-451, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35106702

RESUMO

N4-Acetylcytidine (ac4C) is a highly conserved post-transcriptional and an extensively existing RNA modification, playing versatile roles in the cellular processes. Due to the limitation of techniques and knowledge, large-scale identification of ac4C is still a challenging task. RNA sequences are like sentences containing semantics in the natural language. Inspired by the semantics of language, we proposed a hybrid model for ac4C prediction. The model used long short-term memory and convolution neural network to extract the semantic features hidden in the sequences. The semantic and the two traditional features (k-nucleotide frequencies and pseudo tri-tuple nucleotide composition) were combined to represent ac4C or non-ac4C sequences. The eXtreme Gradient Boosting was used as the learning algorithm. Five-fold cross-validation over the training set consisting of 1160 ac4C and 10,855 non-ac4C sequences obtained the area under the receiver operating characteristic curve (AUROC) of 0.9004, and the independent test over 469 ac4C and 4343 non-ac4C sequences reached an AUROC of 0.8825. The model obtained a sensitivity of 0.6474 in the five-fold cross-validation and 0.6290 in the independent test, outperforming two state-of-the-art methods. The performance of semantic features alone was better than those of k-nucleotide frequencies and pseudo tri-tuple nucleotide composition, implying that ac4C sequences are of semantics. The proposed hybrid model was implemented into a user-friendly web-server which is freely available to scientific communities: http://47.113.117.61/ac4c/ . The presented model and tool are beneficial to identify ac4C on large scale.


Assuntos
Citidina , Nucleotídeos , Algoritmos , Citidina/análogos & derivados , Citidina/genética , Curva ROC
9.
Front Genet ; 12: 766496, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34745231

RESUMO

Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.

10.
Front Genet ; 12: 752732, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34764983

RESUMO

Knowledge about protein-protein interactions is beneficial in understanding cellular mechanisms. Protein-protein interactions are usually determined according to their protein-protein interaction sites. Due to the limitations of current techniques, it is still a challenging task to detect protein-protein interaction sites. In this article, we presented a method based on deep learning and XGBoost (called DeepPPISP-XGB) for predicting protein-protein interaction sites. The deep learning model served as a feature extractor to remove redundant information from protein sequences. The Extreme Gradient Boosting algorithm was used to construct a classifier for predicting protein-protein interaction sites. The DeepPPISP-XGB achieved the following results: area under the receiver operating characteristic curve of 0.681, a recall of 0.624, and area under the precision-recall curve of 0.339, being competitive with the state-of-the-art methods. We also validated the positive role of global features in predicting protein-protein interaction sites.

11.
Biomedicines ; 9(9)2021 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-34572337

RESUMO

Abnormal miRNA functions are widely involved in many diseases recorded in the database of experimentally supported human miRNA-disease associations (HMDD). Some of the associations are complicated: There can be up to five heterogeneous association types of miRNA with the same disease, including genetics type, epigenetics type, circulating miRNAs type, miRNA tissue expression type and miRNA-target interaction type. When one type of association is known for an miRNA-disease pair, it is important to predict any other types of the association for a better understanding of the disease mechanism. It is even more important to reveal associations for currently unassociated miRNAs and diseases. Methods have been recently proposed to make predictions on the association types of miRNA-disease pairs through restricted Boltzman machines, label propagation theories and tensor completion algorithms. None of them has exploited the non-linear characteristics in the miRNA-disease association network to improve the performance. We propose to use attributed multi-layer heterogeneous network embedding to learn the latent representations of miRNAs and diseases from each association type and then to predict the existence of the association type for all the miRNA-disease pairs. The performance of our method is compared with two newest methods via 10-fold cross-validation on the database HMDD v3.2 to demonstrate the superior prediction achieved by our method under different settings. Moreover, our real predictions made beyond the HMDD database can be all validated by NCBI literatures, confirming that our method is capable of accurately predicting new associations of miRNAs with diseases and their association types as well.

12.
Biomed Res Int ; 2021: 9923112, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34159204

RESUMO

Lysine succinylation is a typical protein post-translational modification and plays a crucial role of regulation in the cellular process. Identifying succinylation sites is fundamental to explore its functions. Although many computational methods were developed to deal with this challenge, few considered semantic relationship between residues. We combined long short-term memory (LSTM) and convolutional neural network (CNN) into a deep learning method for predicting succinylation site. The proposed method obtained a Matthews correlation coefficient of 0.2508 on the independent test, outperforming state of the art methods. We also performed the enrichment analysis of succinylation proteins. The results showed that functions of succinylation were conserved across species but differed to a certain extent with species. On basis of the proposed method, we developed a user-friendly web server for predicting succinylation sites.


Assuntos
Algoritmos , Aprendizado Profundo , Redes Neurais de Computação , Ácido Succínico/química , Animais , Área Sob a Curva , Biologia Computacional/métodos , Escherichia coli , Humanos , Internet , Processamento de Proteína Pós-Traducional , Proteínas/metabolismo , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
13.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34111889

RESUMO

Single-cell sequencing is a biotechnology to sequence one layer of genomic information for individual cells in a tissue sample. For example, single-cell DNA sequencing is to sequence the DNA from every single cell. Increasing in complexity, single-cell multi-omics sequencing, or single-cell multimodal omics sequencing, is to profile in parallel multiple layers of omics information from a single cell. In practice, single-cell multi-omics sequencing actually detects multiple traits such as DNA, RNA, methylation information and/or protein profiles from the same cell for many individuals in a tissue sample. Multi-omics sequencing has been widely applied to systematically unravel interplay mechanisms of key components and pathways in cell. This survey overviews recent developments in single-cell multi-omics sequencing, and their applications to understand complex diseases in particular the COVID-19 pandemic. We also summarize machine learning and bioinformatics techniques used in the analysis of the intercorrelated multilayer heterogeneous data. We observed that variational inference and graph-based learning are popular approaches, and Seurat V3 is a commonly used tool to transfer the missing variables and labels. We also discussed two intensively studied issues relating to data consistency and diversity and commented on currently cared issues surrounding the error correction of data pairs and data imputation methods. The survey is concluded with some open questions and opportunities for this extraordinary field.


Assuntos
COVID-19/genética , Pandemias , Proteômica , SARS-CoV-2/genética , Algoritmos , COVID-19/virologia , Biologia Computacional , Análise de Dados , Genômica , Humanos , Aprendizado de Máquina , SARS-CoV-2/patogenicidade , Análise de Célula Única
14.
BMC Bioinformatics ; 22(Suppl 6): 142, 2021 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-34078284

RESUMO

BACKGROUND: Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. RESULTS: To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). CONCLUSION: Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.


Assuntos
Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genômica , Análise de Sequência de DNA
15.
Phys Rev E ; 103(4-1): 043303, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34005996

RESUMO

Among various algorithms of multifractal analysis (MFA) for complex networks, the sandbox MFA algorithm behaves with the best computational efficiency. However, the existing sandbox algorithm is still computationally expensive for MFA of large-scale networks with tens of millions of nodes. It is also not clear whether MFA results can be improved by a largely increased size of a theoretical network. To tackle these challenges, a computationally efficient sandbox algorithm (CESA) is presented in this paper for MFA of large-scale networks. Distinct from the existing sandbox algorithm that uses the shortest-path distance matrix to obtain the required information for MFA of networks, our CESA employs the compressed sparse row format of the adjacency matrix and the breadth-first search technique to directly search the neighbor nodes of each layer of center nodes, and then to retrieve the required information. A theoretical analysis reveals that the CESA reduces the time complexity of the existing sandbox algorithm from cubic to quadratic, and also improves the space complexity from quadratic to linear. Then the CESA is demonstrated to be effective, efficient, and feasible through the MFA results of (u,v)-flower model networks from the fifth to the 12th generations. It enables us to study the multifractality of networks of the size of about 11 million nodes with a normal desktop computer. Furthermore, we have also found that increasing the size of (u,v)-flower model network does improve the accuracy of MFA results. Finally, our CESA is applied to a few typical real-world networks of large scale.

16.
Environ Pollut ; 271: 116381, 2021 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-33421843

RESUMO

Air quality forecasting for Hong Kong is a challenge. Even taking the advantages of auto-regressive integrated moving average and some state-of-the-art numerical models, a recently-developed hybrid method for one-day (two- and three-day) ahead forecasting performs similarly to (slightly better than) a simple persistence forecasting. Long-term forecasting also remains an important issue, especially for policy decision for better control of air pollution and for evaluation of the long-term impacts on public health. Given the well-recognized negative effects of PM2.5, NO2 and O3 on public health, we study their time series under the multi-scale framework with empirical mode decomposition and nonstationary oscillation resampling to explore the possibility of long-term forecasting and to improve short-term forecasts in Hong Kong. Applied to a dataset from January 2016 to December 2018, the long-term forecasting (with lead time about 100 days) of the multi-scale framework has the root-mean-square error (RMSE) comparable with that of the short-term (with lead time of one or two days) forecasting by the persistence method, while its improvement for short-term forecasting (with lead time of one, two or three days) is quite substantial over the persistence forecasting, with RMSEs reduced by respectively 44%-47%, 30%-45%, and 40%-60% for PM2.5, NO2, and O3. Compared to the hybrid method, it turns out that, for short-term forecasting for the same data, the multi-scale framework can reduce RMSE by about 25% (respectively 30%) for PM2.5 (respectively NO2 and O3). In addition, we find no significant difference in the forecasting performance of the multi-scale framework among different types of stations. The multi-scale framework is feasible for time series forecasting and applicable to other pollutants in other cities.


Assuntos
Poluentes Atmosféricos , Poluição do Ar , Poluentes Atmosféricos/análise , Poluição do Ar/análise , Cidades , Previsões , Hong Kong , Material Particulado/análise
17.
Biochim Biophys Acta Proteins Proteom ; 1869(1): 140539, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-32947024

RESUMO

The mono-methylation of histone H3 on lysine 27 (H3K27me1) plays key roles in the cellular processes. The H3K27me1 interacts with the DNA sequence of the miRNAs and regulates the transcription of miRNAs. Therefore, biological roles of the H3K27me1 are closely related to the downstream miRNAs. We proposed a machine learning-based computational method to predict H3K27me1-associated miRNAs and obtained AUCs of 0.6866 and 0.6849 on the leave-one-out and five-fold cross validation, respectively. We also performed enrichment analysis of the transcript factors, GO terms and pathways of H3K27me1-associated miRNAs. Among the top 10 significantly enriched transcription factors, five were unfavorable prognostic marker in renal cancer. The enrichment analysis of molecular function showed that the H3K27me1-associated miRNAs were linked to RNA binding and protein binding which were involved in the transcription and translation regulation. The enrichment of pathway showed that H3K27me1-associated miRNAs were mainly involved in pathways related to cancers, signaling and virus.


Assuntos
Biologia Computacional/métodos , Histonas/genética , MicroRNAs/genética , Proteínas de Neoplasias/genética , Neoplasias/genética , Fatores de Transcrição/genética , Simulação por Computador , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Histonas/metabolismo , Humanos , Aprendizado de Máquina , Metilação , MicroRNAs/metabolismo , Modelos Genéticos , Proteínas de Neoplasias/metabolismo , Neoplasias/metabolismo , Neoplasias/patologia , Transdução de Sinais , Fatores de Transcrição/metabolismo
18.
Bioinformatics ; 37(6): 750-758, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33063094

RESUMO

MOTIVATION: Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. RESULTS: We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels. AVAILABILITY AND IMPLEMENTATION: https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Infecções por HIV , HIV-1 , Variação Genética , Infecções por HIV/genética , HIV-1/genética , Humanos , Epidemiologia Molecular , Filogenia
19.
Entropy (Basel) ; 22(2)2020 Feb 23.
Artigo em Inglês | MEDLINE | ID: mdl-33286029

RESUMO

HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson-Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.

20.
Entropy (Basel) ; 22(3)2020 Mar 13.
Artigo em Inglês | MEDLINE | ID: mdl-33286103

RESUMO

Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative sure independence screening expectation-maximization Bayesian Lasso (ISIS EM-BLASSO) method. In this work, a modified version of ISIS EM-BLASSO is proposed, which reduces the number of SNPs by a screening methodology based on Pearson correlation and mutual information, then estimates the effects via EM-Bayesian Lasso (EM-BLASSO), and finally detects the true quantitative trait nucleotides (QTNs) through likelihood ratio test. We call our method a two-stage mutual information based Bayesian Lasso (MBLASSO). Under three simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other algorithms. Moreover, MBLASSO performs best on model fitting, the accuracy of detected associations is the highest, and 21 genes can only be detected by MBLASSO in Arabidopsis thaliana datasets.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...