Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
1.
Nature ; 622(7984): 810-817, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37853121

ABSTRACT

Highly pathogenic avian influenza (HPAI) H5N1 activity has intensified globally since 2021, increasingly causing mass mortality in wild birds and poultry and incidental infections in mammals1-3. However, the ecological and virological properties that underscore future mitigation strategies still remain unclear. Using epidemiological, spatial and genomic approaches, we demonstrate changes in the origins of resurgent HPAI H5 and reveal significant shifts in virus ecology and evolution. Outbreak data show key resurgent events in 2016-2017 and 2020-2021, contributing to the emergence and panzootic spread of H5N1 in 2021-2022. Genomic analysis reveals that the 2016-2017 epizootics originated in Asia, where HPAI H5 reservoirs are endemic. In 2020-2021, 2.3.4.4b H5N8 viruses emerged in African poultry, featuring mutations altering HA structure and receptor binding. In 2021-2022, a new H5N1 virus evolved through reassortment in wild birds in Europe, undergoing further reassortment with low-pathogenic avian influenza in wild and domestic birds during global dissemination. These results highlight a shift in the HPAI H5 epicentre beyond Asia and indicate that increasing persistence of HPAI H5 in wild birds is facilitating geographic and host range expansion, accelerating dispersion velocity and increasing reassortment potential. As earlier outbreaks of H5N1 and H5N8 were caused by more stable genomic constellations, these recent changes reflect adaptation across the domestic-bird-wild-bird interface. Elimination strategies in domestic birds therefore remain a high priority to limit future epizootics.


Subject(s)
Birds , Disease Outbreaks , Influenza A Virus, H5N1 Subtype , Influenza in Birds , Internationality , Animals , Africa/epidemiology , Animals, Wild/virology , Asia/epidemiology , Birds/virology , Disease Outbreaks/prevention & control , Disease Outbreaks/statistics & numerical data , Disease Outbreaks/veterinary , Europe/epidemiology , Evolution, Molecular , Host Specificity , Influenza A Virus, H5N1 Subtype/classification , Influenza A Virus, H5N1 Subtype/genetics , Influenza A Virus, H5N1 Subtype/isolation & purification , Influenza A Virus, H5N1 Subtype/pathogenicity , Influenza A Virus, H5N8 Subtype/genetics , Influenza A Virus, H5N8 Subtype/isolation & purification , Influenza in Birds/epidemiology , Influenza in Birds/mortality , Influenza in Birds/transmission , Influenza in Birds/virology , Mammals/virology , Mutation , Phylogeny , Poultry/virology
2.
Nat Commun ; 14(1): 2422, 2023 04 27.
Article in English | MEDLINE | ID: mdl-37105966

ABSTRACT

Hong Kong experienced a surge of Omicron BA.2 infections in early 2022, resulting in one of the highest per-capita death rates of COVID-19. The outbreak occurred in a dense population with low immunity towards natural SARS-CoV-2 infection, high vaccine hesitancy in vulnerable populations, comprehensive disease surveillance and the capacity for stringent public health and social measures (PHSMs). By analyzing genome sequences and epidemiological data, we reconstructed the epidemic trajectory of BA.2 wave and found that the initial BA.2 community transmission emerged from cross-infection within hotel quarantine. The rapid implementation of PHSMs suppressed early epidemic growth but the effective reproduction number (Re) increased again during the Spring festival in early February and remained around 1 until early April. Independent estimates of point prevalence and incidence using phylodynamics also showed extensive superspreading at this time, which likely contributed to the rapid expansion of the epidemic. Discordant inferences based on genomic and epidemiological data underscore the need for research to improve near real-time epidemic growth estimates by combining multiple disparate data sources to better inform outbreak response policy.


Subject(s)
COVID-19 , Humans , COVID-19/epidemiology , Hong Kong/epidemiology , SARS-CoV-2/genetics , Disease Outbreaks , Basic Reproduction Number
3.
Virus Evol ; 8(2): veac062, 2022.
Article in English | MEDLINE | ID: mdl-35919872

ABSTRACT

China experienced a resurgence of seasonal influenza activity throughout 2021 despite intermittent control measures and prolonged international border closure. We show genomic evidence for multiple A(H3N2), A(H1N1), and B/Victoria transmission lineages circulating over 3 years, with the 2021 resurgence mainly driven by two B/Victoria clades. Phylodynamic analysis revealed unsampled ancestry prior to widespread outbreaks in December 2020, showing that influenza lineages can circulate cryptically under non-pharmaceutical interventions enacted against COVID-19. Novel haemagglutinin gene mutations and altered age profiles of infected individuals were observed, and Jiangxi province was identified as a major source for nationwide outbreaks. Following major holiday periods, fluctuations in the effective reproduction number were observed, underscoring the importance of influenza vaccination prior to holiday periods or travel. Extensive heterogeneity in seasonal influenza circulation patterns in China determined by historical strain circulation indicates that a better understanding of demographic patterns is needed for improving effective controls.

4.
Nat Commun ; 13(1): 2884, 2022 05 24.
Article in English | MEDLINE | ID: mdl-35610217

ABSTRACT

Human respiratory syncytial virus (RSV) is an important cause of acute respiratory infection with the most severe disease in the young and elderly. Non-pharmaceutical interventions and travel restrictions for controlling COVID-19 have impacted the circulation of most respiratory viruses including RSV globally, particularly in Australia, where during 2020 the normal winter epidemics were notably absent. However, in late 2020, unprecedented widespread RSV outbreaks occurred, beginning in spring, and extending into summer across two widely separated regions of the Australian continent, New South Wales (NSW) and Australian Capital Territory (ACT) in the east, and Western Australia. Through genomic sequencing we reveal a major reduction in RSV genetic diversity following COVID-19 emergence with two genetically distinct RSV-A clades circulating cryptically, likely localised for several months prior to an epidemic surge in cases upon relaxation of COVID-19 control measures. The NSW/ACT clade subsequently spread to the neighbouring state of Victoria and to cause extensive outbreaks and hospitalisations in early 2021. These findings highlight the need for continued surveillance and sequencing of RSV and other respiratory viruses during and after the COVID-19 pandemic, as mitigation measures may disrupt seasonal patterns, causing larger or more severe outbreaks.


Subject(s)
COVID-19 , Respiratory Syncytial Virus Infections , Respiratory Syncytial Virus, Human , Aged , COVID-19/epidemiology , COVID-19/prevention & control , Humans , Infant , Pandemics/prevention & control , Respiratory Syncytial Virus Infections/epidemiology , Respiratory Syncytial Virus Infections/prevention & control , Respiratory Syncytial Virus, Human/genetics , Seasons , Victoria
5.
Nat Commun ; 13(1): 1721, 2022 03 31.
Article in English | MEDLINE | ID: mdl-35361789

ABSTRACT

Annual epidemics of seasonal influenza cause hundreds of thousands of deaths, high levels of morbidity, and substantial economic loss. Yet, global influenza circulation has been heavily suppressed by public health measures and travel restrictions since the onset of the COVID-19 pandemic. Notably, the influenza B/Yamagata lineage has not been conclusively detected since April 2020, and A(H3N2), A(H1N1), and B/Victoria viruses have since circulated with considerably less genetic diversity. Travel restrictions have largely confined regional outbreaks of A(H3N2) to South and Southeast Asia, B/Victoria to China, and A(H1N1) to West Africa. Seasonal influenza transmission lineages continue to perish globally, except in these select hotspots, which will likely seed future epidemics. Waning population immunity and sporadic case detection will further challenge influenza vaccine strain selection and epidemic control. We offer a perspective on the potential short- and long-term evolutionary dynamics of seasonal influenza and discuss potential consequences and mitigation strategies as global travel gradually returns to pre-pandemic levels.


Subject(s)
COVID-19 , Influenza A Virus, H1N1 Subtype , Influenza Vaccines , Influenza, Human , COVID-19/epidemiology , Humans , Influenza A Virus, H3N2 Subtype , Influenza, Human/epidemiology , Influenza, Human/prevention & control , Pandemics/prevention & control , Seasons
6.
Nat Commun ; 13(1): 736, 2022 02 08.
Article in English | MEDLINE | ID: mdl-35136039

ABSTRACT

Hong Kong employed a strategy of intermittent public health and social measures alongside increasingly stringent travel regulations to eliminate domestic SARS-CoV-2 transmission. By analyzing 1899 genome sequences (>18% of confirmed cases) from 23-January-2020 to 26-January-2021, we reveal the effects of fluctuating control measures on the evolution and epidemiology of SARS-CoV-2 lineages in Hong Kong. Despite numerous importations, only three introductions were responsible for 90% of locally-acquired cases. Community outbreaks were caused by novel introductions rather than a resurgence of circulating strains. Thus, local outbreak prevention requires strong border control and community surveillance, especially during periods of less stringent social restriction. Non-adherence to prolonged preventative measures may explain sustained local transmission observed during wave four in late 2020 and early 2021. We also found that, due to a tight transmission bottleneck, transmission of low-frequency single nucleotide variants between hosts is rare.


Subject(s)
COVID-19/epidemiology , SARS-CoV-2/genetics , COVID-19/transmission , COVID-19/virology , Genomics , Hong Kong/epidemiology , Humans , Public Health , SARS-CoV-2/isolation & purification , SARS-CoV-2/physiology , Travel
7.
J Travel Med ; 28(8)2021 12 29.
Article in English | MEDLINE | ID: mdl-34542623

ABSTRACT

BACKGROUND: A large cluster of 59 cases were linked to a single flight with 146 passengers from New Delhi to Hong Kong in April 2021. This outbreak coincided with early reports of exponential pandemic growth in New Delhi, which reached a peak of > 400 000 newly confirmed cases on 7 May 2021. METHODS: Epidemiological information including date of symptom onset, date of positive-sample detection and travel and contact history for individual cases from this flight were collected. Whole genome sequencing was performed, and sequences were classified based on the dynamic Pango nomenclature system. Maximum-likelihood phylogenetic analysis compared sequences from this flight alongside other cases imported from India to Hong Kong on 26 flights between June 2020 and April 2021, as well as sequences from India or associated with India-related travel from February to April 2021 and 1217 reference sequences. RESULTS: Sequence analysis identified six lineages of SARS-CoV-2 belonging to two variants of concern (Alpha and Delta) and one variant of public health interest (Kappa) involved in this outbreak. Phylogenetic analysis confirmed at least three independent sub-lineages of Alpha with limited onward transmission, a superspreading event comprising 37 cases of Kappa and transmission of Delta to only one passenger. Additional analysis of another 26 flights from India to Hong Kong confirmed widespread circulation of all three variants in India since early March 2021. CONCLUSIONS: The broad spectrum of disease severity and long incubation period of SARS-CoV-2 pose a challenge for surveillance and control. As illustrated by this particular outbreak, opportunistic infections of SARS-CoV-2 can occur irrespective of variant lineage, and requiring a nucleic acid test within 72 hours of departure may be insufficient to prevent importation or in-flight transmission.


Subject(s)
Air Travel , COVID-19 , Travel-Related Illness , COVID-19/epidemiology , COVID-19/transmission , Disease Outbreaks , Hong Kong , Humans , India , Phylogeny
8.
J Virol ; 95(24): e0126721, 2021 11 23.
Article in English | MEDLINE | ID: mdl-34586866

ABSTRACT

Introduction of non-pharmaceutical interventions to control COVID-19 in early 2020 coincided with a global decrease in active influenza circulation. However, between July and November 2020, an influenza A(H3N2) epidemic occurred in Cambodia and in other neighboring countries in the Greater Mekong Subregion in Southeast Asia. We characterized the genetic and antigenic evolution of A(H3N2) in Cambodia and found that the 2020 epidemic comprised genetically and antigenically similar viruses of Clade3C2a1b/131K/94N, but they were distinct from the WHO recommended influenza A(H3N2) vaccine virus components for 2020-2021 Northern Hemisphere season. Phylogenetic analysis revealed multiple virus migration events between Cambodia and bordering countries, with Laos PDR and Vietnam also reporting similar A(H3N2) epidemics immediately following the Cambodia outbreak: however, there was limited circulation of these viruses elsewhere globally. In February 2021, a virus from the Cambodian outbreak was recommended by WHO as the prototype virus for inclusion in the 2021-2022 Northern Hemisphere influenza vaccine. IMPORTANCE The 2019 coronavirus disease (COVID-19) pandemic has significantly altered the circulation patterns of respiratory diseases worldwide and disrupted continued surveillance in many countries. Introduction of control measures in early 2020 against Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) infection has resulted in a remarkable reduction in the circulation of many respiratory diseases. Influenza activity has remained at historically low levels globally since March 2020, even when increased influenza testing was performed in some countries. Maintenance of the influenza surveillance system in Cambodia in 2020 allowed for the detection and response to an influenza A(H3N2) outbreak in late 2020, resulting in the inclusion of this virus in the 2021-2022 Northern Hemisphere influenza vaccine.


Subject(s)
COVID-19/epidemiology , Influenza A Virus, H3N2 Subtype/genetics , Influenza Vaccines/immunology , Influenza, Human/complications , Influenza, Human/immunology , Cambodia/epidemiology , Disease Outbreaks , Humans , Influenza, Human/epidemiology , Influenza, Human/virology , Laos , Likelihood Functions , Phylogeny , SARS-CoV-2 , Vietnam
9.
Emerg Infect Dis ; 27(10): 2666-2668, 2021 10.
Article in English | MEDLINE | ID: mdl-34545799

ABSTRACT

We sequenced 10% of imported severe acute respiratory syndrome coronavirus 2 infections detected in travelers to Hong Kong and revealed the genomic diversity of regions of origin, including lineages not previously reported from those countries. Our results suggest that international or regional travel hubs might be useful surveillance sites to monitor sequence diversity.


Subject(s)
COVID-19 , Communicable Diseases, Imported , Genetic Variation , Hong Kong/epidemiology , Humans , SARS-CoV-2
10.
medRxiv ; 2021 Jun 23.
Article in English | MEDLINE | ID: mdl-34189537

ABSTRACT

Hong Kong utilized an elimination strategy with intermittent use of public health and social measures and increasingly stringent travel regulations to control SARS-CoV-2 transmission. By analyzing >1700 genome sequences representing 17% of confirmed cases from 23-January-2020 to 26-January-2021, we reveal the effects of fluctuating control measures on the evolution and epidemiology of SARS-CoV-2 lineages in Hong Kong. Despite numerous importations, only three introductions were responsible for 90% of locally-acquired cases, two of which circulated cryptically for weeks while less stringent measures were in place. We found that SARS-CoV-2 within-host diversity was most similar among transmission pairs and epidemiological clusters due to a strong transmission bottleneck through which similar genetic background generates similar within-host diversity. ONE SENTENCE SUMMARY: Out of the 170 detected introductions of SARS-CoV-2 in Hong Kong during 2020, three introductions caused 90% of community cases.

11.
Brief Bioinform ; 22(3)2021 05 20.
Article in English | MEDLINE | ID: mdl-32599617

ABSTRACT

Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.


Subject(s)
Bacteria , Bacterial Proteins/genetics , Databases, Protein , Deep Learning , Genome, Bacterial , Virulence Factors/genetics , Bacteria/genetics , Bacteria/pathogenicity
12.
Nucleic Acids Res ; 49(D1): D630-D638, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33137193

ABSTRACT

Anti-CRISPR (Acr) proteins naturally inhibit CRISPR-Cas adaptive immune systems across bacterial and archaeal domains of life. This emerging field has caused a paradigm shift in the way we think about the CRISPR-Cas system, and promises a number of useful applications from gene editing to phage therapy. As the number of verified and predicted Acrs rapidly expands, few online resources have been developed to deal with this wealth of information. To overcome this shortcoming, we developed AcrHub, an integrative database to provide an all-in-one solution for investigating, predicting and mapping Acr proteins. AcrHub catalogs 339 non-redundant experimentally validated Acrs and over 70 000 predicted Acrs extracted from genome sequence data from a diverse range of prokaryotic organisms and their viruses. It integrates state-of-the-art predictors to predict potential Acrs, and incorporates three analytical modules: similarity analysis, phylogenetic analysis and homology network analysis, to analyze their relationships with known Acrs. By interconnecting all modules as a platform, AcrHub presents enriched and in-depth analysis of known and potential Acrs and therefore provides new and exciting insights into the future of Acr discovery and validation. AcrHub is freely available at http://pacrispr.erc.monash.edu/AcrHub/.


Subject(s)
CRISPR-Cas Systems/genetics , Databases, Protein , Data Analysis , Internet
13.
Nucleic Acids Res ; 49(D1): D651-D659, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33084862

ABSTRACT

Gram-negative bacteria utilize secretion systems to export substrates into their surrounding environment or directly into neighboring cells. These substrates are proteins that function to promote bacterial survival: by facilitating nutrient collection, disabling competitor species or, for pathogens, to disable host defenses. Following a rapid development of computational techniques, a growing number of substrates have been discovered and subsequently validated by wet lab experiments. To date, several online databases have been developed to catalogue these substrates but they have limited user options for in-depth analysis, and typically focus on a single type of secreted substrate. We therefore developed a universal platform, BastionHub, that incorporates extensive functional modules to facilitate substrate analysis and integrates the five major Gram-negative secreted substrate types (i.e. from types I-IV and VI secretion systems). To our knowledge, BastionHub is not only the most comprehensive online database available, it is also the first to incorporate substrates secreted by type I or type II secretion systems. By providing the most up-to-date details of secreted substrates and state-of-the-art prediction and visualized relationship analysis tools, BastionHub will be an important platform that can assist biologists in uncovering novel substrates and formulating new hypotheses. BastionHub is freely available at http://bastionhub.erc.monash.edu/.


Subject(s)
Databases as Topic , Gram-Negative Bacteria/metabolism , Data Curation , Molecular Sequence Annotation , Substrate Specificity
14.
Nucleic Acids Res ; 48(W1): W348-W357, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32459325

ABSTRACT

Anti-CRISPRs are widespread amongst bacteriophage and promote bacteriophage infection by inactivating the bacterial host's CRISPR-Cas defence system. Identifying and characterizing anti-CRISPR proteins opens an avenue to explore and control CRISPR-Cas machineries for the development of new CRISPR-Cas based biotechnological and therapeutic tools. Past studies have identified anti-CRISPRs in several model phage genomes, but a challenge exists to comprehensively screen for anti-CRISPRs accurately and efficiently from genome and metagenome sequence data. Here, we have developed an ensemble learning based predictor, PaCRISPR, to accurately identify anti-CRISPRs from protein datasets derived from genome and metagenome sequencing projects. PaCRISPR employs different types of feature recognition united within an ensemble framework. Extensive cross-validation and independent tests show that PaCRISPR achieves a significantly more accurate performance compared with homology-based baseline predictors and an existing toolkit. The performance of PaCRISPR was further validated in discovering anti-CRISPRs that were not part of the training for PaCRISPR, but which were recently demonstrated to function as anti-CRISPRs for phage infections. Data visualization on anti-CRISPR relationships, highlighting sequence similarity and phylogenetic considerations, is part of the output from the PaCRISPR toolkit, which is freely available at http://pacrispr.erc.monash.edu/.


Subject(s)
Bacteriophages , CRISPR-Cas Systems , Software , Viral Proteins/chemistry , Computer Graphics , Machine Learning , Sequence Analysis, Protein
15.
Bioinformatics ; 36(3): 704-712, 2020 02 01.
Article in English | MEDLINE | ID: mdl-31393553

ABSTRACT

MOTIVATION: Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. RESULTS: In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. AVAILABILITY AND IMPLEMENTATION: http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Machine Learning , Computational Biology , Peptides , Proteins
16.
Brief Bioinform ; 20(6): 2185-2199, 2019 11 27.
Article in English | MEDLINE | ID: mdl-30351377

ABSTRACT

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.


Subject(s)
Computational Biology , Lysine/metabolism , Machine Learning , Malonates/metabolism , Animals , Humans
17.
Bioinformatics ; 35(12): 2017-2028, 2019 06 01.
Article in English | MEDLINE | ID: mdl-30388198

ABSTRACT

MOTIVATION: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. RESULTS: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. AVAILABILITY AND IMPLEMENTATION: http://bastion3.erc.monash.edu/. CONTACT: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Machine Learning , Algorithms , Amino Acid Sequence , Bacterial Proteins , Computational Biology , Gram-Negative Bacteria , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...