Search | VHL Regional Portal

TB-ML-a framework for comparing machine learning approaches to predict drug resistance of Mycobacterium tuberculosis.

Libiseller-Egger, Julian; Wang, Linfeng; Deelder, Wouter; Campino, Susana; Clark, Taane G; Phelan, Jody E.

Bioinform Adv ; 3(1): vbad040, 2023.

Article in English | MEDLINE | ID: mdl-37033466

ABSTRACT

Motivation: Machine learning (ML) has shown impressive performance in predicting antimicrobial resistance (AMR) from sequence data, including for Mycobacterium tuberculosis, the causative agent of tuberculosis. However, current ML development and publication practices make it difficult for researchers and clinicians to use, test or reproduce published models. Results: We packaged a number of published and unpublished ML models for predicting AMR of M.tuberculosis into Docker containers. Similarly, the pipelines required for pre-processing genomic data into the formats required by the models were also packaged into separate containers. By following a minimal container I/O standard, we ensured as much interoperability as possible. We also created a command-line application, TB-ML, which can be used to easily combine pre-processing and prediction containers into complete pipelines ready for predicting resistance from novel, raw data with a single command. As long as there is adherence to this minimal standard for the container interface, containers produced by researchers holding new models can likewise be included in these pipelines, making benchmark comparisons of different models simple and facilitating faster uptake in the clinic. Availability and implementation: TB-ML contains a simple Docker API written in Python and is available at https://github.com/jodyphelan/tb-ml. Example Docker containers for resistance prediction and corresponding data pre-processing as well as a tutorial on how to create new containers for TB-ML are available at https://tb-ml.github.io/tb-ml-containers/. Contact: jody.phelan@lshtm.ac.uk.

Geographical classification of malaria parasites through applying machine learning to whole genome sequence data.

Deelder, Wouter; Manko, Emilia; Phelan, Jody E; Campino, Susana; Palla, Luigi; Clark, Taane G.

Sci Rep ; 12(1): 21150, 2022 12 07.

Article in English | MEDLINE | ID: mdl-36476815

ABSTRACT

Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surveillance purposes. Advances in sequencing technologies are helping to generate timely and big genomic datasets, with the prospect of applying Artificial Intelligence analytical techniques (e.g., machine learning) to support programmatic malaria control and elimination. Here, we assess the potential of applying deep learning convolutional neural network approaches to predict the geographic origin of infections (continents, countries, GPS locations) using WGS data of P. falciparum (n = 5957; 27 countries) and P. vivax (n = 659; 13 countries) isolates. Using identified high-quality genome-wide single nucleotide polymorphisms (SNPs) (P. falciparum: 750 k, P. vivax: 588 k), an analysis of population structure and ancestry revealed clustering at the country-level. When predicting locations for both species, classification (compared to regression) methods had the lowest distance errors, and > 90% accuracy at a country level. Our work demonstrates the utility of machine learning approaches for geo-classification of malaria parasites. With timelier WGS data generation across more malaria-affected regions, the performance of machine learning approaches for geo-classification will improve, thereby supporting disease control activities.

Subject(s)

Artificial Intelligence , Genomics , Geography , Machine Learning

COVID-profiler: a webserver for the analysis of SARS-CoV-2 sequencing data.

Phelan, Jody; Deelder, Wouter; Ward, Daniel; Campino, Susana; Hibberd, Martin L; Clark, Taane G.

BMC Bioinformatics ; 23(1): 137, 2022 Apr 15.

Article in English | MEDLINE | ID: mdl-35428185

ABSTRACT

BACKGROUND: SARS-CoV-2 virus sequencing has been applied to track the COVID-19 pandemic spread and assist the development of PCR-based diagnostics, serological assays, and vaccines. With sequencing becoming routine globally, bioinformatic tools are needed to assist in the robust processing of resulting genomic data. RESULTS: We developed a web-based bioinformatic pipeline ("COVID-Profiler") that inputs raw or assembled sequencing data, displays raw alignments for quality control, annotates mutations found and performs phylogenetic analysis. The pipeline software can be applied to other (re-) emerging pathogens. CONCLUSIONS: The webserver is available at http://genomics.lshtm.ac.uk/ . The source code is available at https://github.com/jodyphelan/covid-profiler .

Subject(s)

COVID-19 , SARS-CoV-2 , Genomics , Humans , Pandemics , Phylogeny , SARS-CoV-2/genetics

A modified decision tree approach to improve the prediction and mutation discovery for drug resistance in Mycobacterium tuberculosis.

Deelder, Wouter; Napier, Gary; Campino, Susana; Palla, Luigi; Phelan, Jody; Clark, Taane G.

BMC Genomics ; 23(1): 46, 2022 Jan 11.

Article in English | MEDLINE | ID: mdl-35016609

ABSTRACT

BACKGROUND: Drug resistant Mycobacterium tuberculosis is complicating the effective treatment and control of tuberculosis disease (TB). With the adoption of whole genome sequencing as a diagnostic tool, machine learning approaches are being employed to predict M. tuberculosis resistance and identify underlying genetic mutations. However, machine learning approaches can overfit and fail to identify causal mutations if they are applied out of the box and not adapted to the disease-specific context. We introduce a machine learning approach that is customized to the TB setting, which extracts a library of genomic variants re-occurring across individual studies to improve genotypic profiling. RESULTS: We developed a customized decision tree approach, called Treesist-TB, that performs TB drug resistance prediction by extracting and evaluating genomic variants across multiple studies. The application of Treesist-TB to rifampicin (RIF), isoniazid (INH) and ethambutol (EMB) drugs, for which resistance mutations are known, demonstrated a level of predictive accuracy similar to the widely used TB-Profiler tool (Treesist-TB vs. TB-Profiler tool: RIF 97.5% vs. 97.6%; INH 96.8% vs. 96.5%; EMB 96.8% vs. 95.8%). Application of Treesist-TB to less understood second-line drugs of interest, ethionamide (ETH), cycloserine (CYS) and para-aminosalisylic acid (PAS), led to the identification of new variants (52, 6 and 11, respectively), with a high number absent from the TB-Profiler library (45, 4, and 6, respectively). Thereby, Treesist-TB had improved predictive sensitivity (Treesist-TB vs. TB-Profiler tool: PAS 64.3% vs. 38.8%; CYS 45.3% vs. 30.7%; ETH 72.1% vs. 71.1%). CONCLUSION: Our work reinforces the utility of machine learning for drug resistance prediction, while highlighting the need to customize approaches to the disease-specific context. Through applying a modified decision learning approach (Treesist-TB) across a range of anti-TB drugs, we identified plausible resistance-encoding genomic variants with high predictive ability, whilst potentially overcoming the overfitting challenges that can affect standard machine learning applications.

Subject(s)

Drug Resistance, Multiple, Bacterial/genetics , Mycobacterium tuberculosis , Antitubercular Agents/pharmacology , Decision Trees , Humans , Microbial Sensitivity Tests , Mutation , Mycobacterium tuberculosis/genetics , Tuberculosis, Multidrug-Resistant/diagnosis , Tuberculosis, Multidrug-Resistant/drug therapy

Using deep learning to identify recent positive selection in malaria parasite sequence data.

Deelder, Wouter; Benavente, Ernest Diez; Phelan, Jody; Manko, Emilia; Campino, Susana; Palla, Luigi; Clark, Taane G.

Malar J ; 20(1): 270, 2021 Jun 14.

Article in English | MEDLINE | ID: mdl-34126997

ABSTRACT

BACKGROUND: Malaria, caused by Plasmodium parasites, is a major global public health problem. To assist an understanding of malaria pathogenesis, including drug resistance, there is a need for the timely detection of underlying genetic mutations and their spread. With the increasing use of whole-genome sequencing (WGS) of Plasmodium DNA, the potential of deep learning models to detect loci under recent positive selection, historically signals of drug resistance, was evaluated. METHODS: A deep learning-based approach (called "DeepSweep") was developed, which can be trained on haplotypic images from genetic regions with known sweeps, to identify loci under positive selection. DeepSweep software is available from https://github.com/WDee/Deepsweep . RESULTS: Using simulated genomic data, DeepSweep could detect recent sweeps with high predictive accuracy (areas under ROC curve > 0.95). DeepSweep was applied to Plasmodium falciparum (n = 1125; genome size 23 Mbp) and Plasmodium vivax (n = 368; genome size 29 Mbp) WGS data, and the genes identified overlapped with two established extended haplotype homozygosity methods (within-population iHS, across-population Rsb) (~ 60-75% overlap of hits at P < 0.0001). DeepSweep hits included regions proximal to known drug resistance loci for both P. falciparum (e.g. pfcrt, pfdhps and pfmdr1) and P. vivax (e.g. pvmrp1). CONCLUSION: The deep learning approach can detect positive selection signatures in malaria parasite WGS data. Further, as the approach is generalizable, it may be trained to detect other types of selection. With the ability to rapidly generate WGS data at low cost, machine learning approaches (e.g. DeepSweep) have the potential to assist parasite genome-based surveillance and inform malaria control decision-making.

Subject(s)

Deep Learning/statistics & numerical data , Genome Size , Genome, Protozoan , Plasmodium falciparum/genetics , Plasmodium vivax/genetics , Selection, Genetic , Sequence Analysis, DNA

Machine Learning Predicts Accurately Mycobacterium tuberculosis Drug Resistance From Whole Genome Sequencing Data.

Deelder, Wouter; Christakoudi, Sofia; Phelan, Jody; Benavente, Ernest Diez; Campino, Susana; McNerney, Ruth; Palla, Luigi; Clark, Taane G.

Front Genet ; 10: 922, 2019.

Article in English | MEDLINE | ID: mdl-31616478

ABSTRACT

Background: Tuberculosis disease, caused by Mycobacterium tuberculosis, is a major public health problem. The emergence of M. tuberculosis strains resistant to existing treatments threatens to derail control efforts. Resistance is mainly conferred by mutations in genes coding for drug targets or converting enzymes, but our knowledge of these mutations is incomplete. Whole genome sequencing (WGS) is an increasingly common approach to rapidly characterize isolates and identify mutations predicting antimicrobial resistance and thereby providing a diagnostic tool to assist clinical decision making. Methods: We applied machine learning approaches to 16,688 M. tuberculosis isolates that have undergone WGS and laboratory drug-susceptibility testing (DST) across 14 antituberculosis drugs, with 22.5% of samples being multidrug resistant and 2.1% being extensively drug resistant. We used non-parametric classification-tree and gradient-boosted-tree models to predict drug resistance and uncover any associated novel putative mutations. We fitted separate models for each drug, with and without "co-occurrent resistance" markers known to be causing resistance to drugs other than the one of interest. Predictive performance was measured using sensitivity, specificity, and the area under the receiver operating characteristic curve, assuming DST results as the gold standard. Results: The predictive performance was highest for resistance to first-line drugs, amikacin, kanamycin, ciprofloxacin, moxifloxacin, and multidrug-resistant tuberculosis (area under the receiver operating characteristic curve above 96%), and lowest for third-line drugs such as D-cycloserine and Para-aminosalisylic acid (area under the curve below 85%). The inclusion of co-occurrent resistance markers led to improved performance for some drugs and superior results when compared to similar models in other large-scale studies, which had smaller sample sizes. Overall, the gradient-boosted-tree models performed better than the classification-tree models. The mutation-rank analysis detected no new single nucleotide polymorphisms linked to drug resistance. Discordance between DST and genotypically inferred resistance may be explained by DST errors, novel rare mutations, hetero-resistance, and nongenomic drivers such as efflux-pump upregulation. Conclusion: Our work demonstrates the utility of machine learning as a flexible approach to drug resistance prediction that is able to accommodate a much larger number of predictors and to summarize their predictive ability, thus assisting clinical decision making and single nucleotide polymorphism detection in an era of increasing WGS data generation.

Funding the elimination of viral hepatitis: donors needed.

Gore, Charles; Hicks, Jessica; Deelder, Wouter.

Lancet Gastroenterol Hepatol ; 2(12): 843-845, 2017 12.

Article in English | MEDLINE | ID: mdl-29100843

Subject(s)

Global Health/economics , Healthcare Financing , Hepatitis B, Chronic/prevention & control , Hepatitis C, Chronic/prevention & control , Preventive Health Services/economics , Hepatitis B, Chronic/epidemiology , Hepatitis C, Chronic/epidemiology , Humans

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL