Pesquisa | Portal Regional da BVS (teste)

1.

Comparative Genomic Analysis of Bacterial Data in BV-BRC: An Example Exploring Antimicrobial Resistance.

Wattam, Alice R; Bowers, Nicole; Brettin, Thomas; Conrad, Neal; Cucinell, Clark; Davis, James J; Dickerman, Allan W; Dietrich, Emily M; Kenyon, Ronald W; Machi, Dustin; Mao, Chunhong; Nguyen, Marcus; Olson, Robert D; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D; Shukla, Maulik; Stevens, Rick L; Vonstein, Veronika; Warren, Andrew S.

Methods Mol Biol ; 2802: 547-571, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38819571

RESUMO

As genomic and related data continue to expand, research biologists are often hampered by the computational hurdles required to analyze their data. The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Centers (BRC) to assist researchers with their analysis of genome sequence and other omics-related data. Recently, the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD), and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs merged to form the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) at https://www.bv-brc.org/ . The combined BV-BRC leverages the functionality of the original resources for bacterial and viral research communities with a unified data model, enhanced web-based visualization and analysis tools, and bioinformatics services. Here we demonstrate how antimicrobial resistance data can be analyzed in the new resource.

Assuntos

Bactérias , Biologia Computacional , Bases de Dados Genéticas , Farmacorresistência Bacteriana , Genômica , Genômica/métodos , Biologia Computacional/métodos , Farmacorresistência Bacteriana/genética , Bactérias/genética , Bactérias/efeitos dos fármacos , Humanos , Software , Genoma Bacteriano , Antibacterianos/farmacologia , Navegador , Estados Unidos , National Institute of Allergy and Infectious Diseases (U.S.)

2.

A Comprehensive Investigation of Active Learning Strategies for Conducting Anti-Cancer Drug Screening.

Vasanthakumari, Priyanka; Zhu, Yitan; Brettin, Thomas; Partin, Alexander; Shukla, Maulik; Xia, Fangfang; Narykov, Oleksandr; Weil, Michael Ryan; Stevens, Rick L.

Cancers (Basel) ; 16(3)2024 Jan 26.

Artigo em Inglês | MEDLINE | ID: mdl-38339281

RESUMO

It is well-known that cancers of the same histology type can respond differently to a treatment. Thus, computational drug response prediction is of paramount importance for both preclinical drug screening studies and clinical treatment design. To build drug response prediction models, treatment response data need to be generated through screening experiments and used as input to train the prediction models. In this study, we investigate various active learning strategies of selecting experiments to generate response data for the purposes of (1) improving the performance of drug response prediction models built on the data and (2) identifying effective treatments. Here, we focus on constructing drug-specific response prediction models for cancer cell lines. Various approaches have been designed and applied to select cell lines for screening, including a random, greedy, uncertainty, diversity, combination of greedy and uncertainty, sampling-based hybrid, and iteration-based hybrid approach. All of these approaches are evaluated and compared using two criteria: (1) the number of identified hits that are selected experiments validated to be responsive, and (2) the performance of the response prediction model trained on the data of selected experiments. The analysis was conducted for 57 drugs and the results show a significant improvement on identifying hits using active learning approaches compared with the random and greedy sampling method. Active learning approaches also show an improvement on response prediction performance for some of the drugs and analysis runs compared with the greedy sampling method.

3.

US National Institutes of Health Prioritization of SARS-CoV-2 Variants.

Turner, Sam; Alisoltani, Arghavan; Bratt, Debbie; Cohen-Lavi, Liel; Dearlove, Bethany L; Drosten, Christian; Fischer, Will M; Fouchier, Ron A M; Gonzalez-Reiche, Ana Silvia; Jaroszewski, Lukasz; Khalil, Zain; LeGresley, Eric; Johnson, Marc; Jones, Terry C; Mühlemann, Barbara; O'Connor, David; Sedova, Mayya; Shukla, Maulik; Theiler, James; Wallace, Zachary S; Yoon, Hyejin; Zhang, Yun; van Bakel, Harm; Degrace, Marciela M; Ghedin, Elodie; Godzik, Adam; Hertz, Tomer; Korber, Bette; Lemieux, Jacob; Niewiadomska, Anna M; Post, Diane J; Rolland, Morgane; Scheuermann, Richard; Smith, Derek J.

Emerg Infect Dis ; 29(5)2023 05.

Artigo em Inglês | MEDLINE | ID: mdl-37054986

RESUMO

Since late 2020, SARS-CoV-2 variants have regularly emerged with competitive and phenotypic differences from previously circulating strains, sometimes with the potential to escape from immunity produced by prior exposure and infection. The Early Detection group is one of the constituent groups of the US National Institutes of Health National Institute of Allergy and Infectious Diseases SARS-CoV-2 Assessment of Viral Evolution program. The group uses bioinformatic methods to monitor the emergence, spread, and potential phenotypic properties of emerging and circulating strains to identify the most relevant variants for experimental groups within the program to phenotypically characterize. Since April 2021, the group has prioritized variants monthly. Prioritization successes include rapidly identifying most major variants of SARS-CoV-2 and providing experimental groups within the National Institutes of Health program easy access to regularly updated information on the recent evolution and epidemiology of SARS-CoV-2 that can be used to guide phenotypic investigations.

Assuntos

COVID-19 , SARS-CoV-2 , Estados Unidos/epidemiologia , Humanos , SARS-CoV-2/genética , COVID-19/epidemiologia , National Institutes of Health (U.S.)

4.

Data augmentation and multimodal learning for predicting drug response in patient-derived xenografts from gene expressions and histology images.

Partin, Alexander; Brettin, Thomas; Zhu, Yitan; Dolezal, James M; Kochanny, Sara; Pearson, Alexander T; Shukla, Maulik; Evrard, Yvonne A; Doroshow, James H; Stevens, Rick L.

Front Med (Lausanne) ; 10: 1058919, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36960342

RESUMO

Patient-derived xenografts (PDXs) are an appealing platform for preclinical drug studies. A primary challenge in modeling drug response prediction (DRP) with PDXs and neural networks (NNs) is the limited number of drug response samples. We investigate multimodal neural network (MM-Net) and data augmentation for DRP in PDXs. The MM-Net learns to predict response using drug descriptors, gene expressions (GE), and histology whole-slide images (WSIs). We explore whether combining WSIs with GE improves predictions as compared with models that use GE alone. We propose two data augmentation methods which allow us training multimodal and unimodal NNs without changing architectures with a single larger dataset: 1) combine single-drug and drug-pair treatments by homogenizing drug representations, and 2) augment drug-pairs which doubles the sample size of all drug-pair samples. Unimodal NNs which use GE are compared to assess the contribution of data augmentation. The NN that uses the original and the augmented drug-pair treatments as well as single-drug treatments outperforms NNs that ignore either the augmented drug-pairs or the single-drug treatments. In assessing the multimodal learning based on the MCC metric, MM-Net outperforms all the baselines. Our results show that data augmentation and integration of histology images with GE can improve prediction performance of drug response in PDXs.

5.

Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR.

Olson, Robert D; Assaf, Rida; Brettin, Thomas; Conrad, Neal; Cucinell, Clark; Davis, James J; Dempsey, Donald M; Dickerman, Allan; Dietrich, Emily M; Kenyon, Ronald W; Kuscuoglu, Mehmet; Lefkowitz, Elliot J; Lu, Jian; Machi, Dustin; Macken, Catherine; Mao, Chunhong; Niewiadomska, Anna; Nguyen, Marcus; Olsen, Gary J; Overbeek, Jamie C; Parrello, Bruce; Parrello, Victoria; Porter, Jacob S; Pusch, Gordon D; Shukla, Maulik; Singh, Indresh; Stewart, Lucy; Tan, Gene; Thomas, Chris; VanOeffelen, Margo; Vonstein, Veronika; Wallace, Zachary S; Warren, Andrew S; Wattam, Alice R; Xia, Fangfang; Yoo, Hyunseung; Zhang, Yun; Zmasek, Christian M; Scheuermann, Richard H; Stevens, Rick L.

Nucleic Acids Res ; 51(D1): D678-D689, 2023 01 06.

Artigo em Inglês | MEDLINE | ID: mdl-36350631

RESUMO

The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Center (BRC) program to assist researchers with analyzing the growing body of genome sequence and other omics-related data. In this report, we describe the merger of the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD) and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs to form the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) https://www.bv-brc.org/. The combined BV-BRC leverages the functionality of the bacterial and viral resources to provide a unified data model, enhanced web-based visualization and analysis tools, bioinformatics services, and a powerful suite of command line tools that benefit the bacterial and viral research communities.

Assuntos

Genômica , Software , Vírus , Humanos , Bactérias/genética , Biologia Computacional , Bases de Dados Genéticas , Influenza Humana , Vírus/genética

6.

Integration of Computational Docking into Anti-Cancer Drug Response Prediction Models.

Narykov, Oleksandr; Zhu, Yitan; Brettin, Thomas; Evrard, Yvonne A; Partin, Alexander; Shukla, Maulik; Xia, Fangfang; Clyde, Austin; Vasanthakumari, Priyanka; Doroshow, James H; Stevens, Rick L.

Cancers (Basel) ; 16(1)2023 Dec 21.

Artigo em Inglês | MEDLINE | ID: mdl-38201477

RESUMO

Cancer is a heterogeneous disease in that tumors of the same histology type can respond differently to a treatment. Anti-cancer drug response prediction is of paramount importance for both drug development and patient treatment design. Although various computational methods and data have been used to develop drug response prediction models, it remains a challenging problem due to the complexities of cancer mechanisms and cancer-drug interactions. To better characterize the interaction between cancer and drugs, we investigate the feasibility of integrating computationally derived features of molecular mechanisms of action into prediction models. Specifically, we add docking scores of drug molecules and target proteins in combination with cancer gene expressions and molecular drug descriptors for building response models. The results demonstrate a marginal improvement in drug response prediction performance when adding docking scores as additional features, through tests on large drug screening data. We discuss the limitations of the current approach and provide the research community with a baseline dataset of the large-scale computational docking for anti-cancer drugs.

7.

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.

Zvyagin, Maxim; Brace, Alexander; Hippe, Kyle; Deng, Yuntian; Zhang, Bin; Bohorquez, Cindy Orozco; Clyde, Austin; Kale, Bharat; Perez-Rivera, Danilo; Ma, Heng; Mann, Carla M; Irvin, Michael; Pauloski, J Gregory; Ward, Logan; Hayot-Sasson, Valerie; Emani, Murali; Foreman, Sam; Xie, Zhen; Lin, Diangen; Shukla, Maulik; Nie, Weili; Romero, Josh; Dallago, Christian; Vahdat, Arash; Xiao, Chaowei; Gibbs, Thomas; Foster, Ian; Davis, James J; Papka, Michael E; Brettin, Thomas; Stevens, Rick; Anandkumar, Anima; Vishwanath, Venkatram; Ramanathan, Arvind.

bioRxiv ; 2022 Nov 23.

Artigo em Inglês | MEDLINE | ID: mdl-36451881

RESUMO

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

8.

TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks.

Jones, Sara; Beyers, Matthew; Shukla, Maulik; Xia, Fangfang; Brettin, Thomas; Stevens, Rick; Weil, M Ryan; Ranganathan Ganakammal, Satishkumar.

Cancer Inform ; 21: 11769351221139491, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36507076

RESUMO

Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years. Methods: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models. Results: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types. Conclusions: We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.

9.

Early detection of emerging SARS-CoV-2 variants of interest for experimental evaluation.

Wallace, Zachary S; Davis, James; Niewiadomska, Anna Maria; Olson, Robert D; Shukla, Maulik; Stevens, Rick; Zhang, Yun; Zmasek, Christian M; Scheuermann, Richard H.

Front Bioinform ; 2: 1020189, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36353215

RESUMO

Since the beginning of the COVID-19 pandemic, SARS-CoV-2 has demonstrated its ability to rapidly and continuously evolve, leading to the emergence of thousands of different sequence variants, many with distinctive phenotypic properties. Fortunately, the broad application of next generation sequencing (NGS) across the globe has produced a wealth of SARS-CoV-2 genome sequences, offering a comprehensive picture of how this virus is evolving so that accurate diagnostics, reliable therapeutics, and prophylactic vaccines against COVID-19 can be developed and maintained. The millions of SARS-CoV-2 sequences deposited into genomic sequencing databases, including GenBank, BV-BRC, and GISAID, are annotated with the dates and geographic locations of sample collection, and can be aligned to and compared with the Wuhan-Hu-1 reference genome to extract their constellation of nucleotide and amino acid substitutions. By aggregating these data into concise datasets, the spread of variants through space and time can be assessed. Variant tracking efforts have initially focused on the Spike protein due to its critical role in viral tropism and antibody neutralization. To identify emerging variants of concern as early as possible, we developed a computational pipeline to process the genomic data and assign risk scores based on both epidemiological and functional parameters. Epidemiological dynamics are used to identify variants exhibiting substantial growth over time and spread across geographical regions. Experimental data that quantify Spike protein regions targeted by adaptive immunity and critical for other virus characteristics are used to predict variants with consequential immunogenic and pathogenic impacts. The growth assessment and functional impact scores are combined to produce a Composite Score for any set of Spike substitutions detected. With this systematic method to routinely score and rank emerging variants, we have established an approach to identify threatening variants early and prioritize them for experimental evaluation.

10.

A cross-study analysis of drug response prediction in cancer cell lines.

Xia, Fangfang; Allen, Jonathan; Balaprakash, Prasanna; Brettin, Thomas; Garcia-Cardona, Cristina; Clyde, Austin; Cohn, Judith; Doroshow, James; Duan, Xiaotian; Dubinkina, Veronika; Evrard, Yvonne; Fan, Ya Ju; Gans, Jason; He, Stewart; Lu, Pinyi; Maslov, Sergei; Partin, Alexander; Shukla, Maulik; Stahlberg, Eric; Wozniak, Justin M; Yoo, Hyunseung; Zaki, George; Zhu, Yitan; Stevens, Rick.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34524425

RESUMO

To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine learning models, with a multitasking deep neural network achieving the best cross-study generalizability. By multiple measures, models trained on CTRP yield the most accurate predictions on the remaining testing data, and gCSI is the most predictable among the cell line data sets included in this study. With these experiments and further simulations on partial data, two lessons emerge: (1) differences in viability assays can limit model generalizability across studies and (2) drug diversity, more than tumor diversity, is crucial for raising model generalizability in preclinical screening.

Assuntos

Neoplasias , Algoritmos , Linhagem Celular , Humanos , Aprendizado de Máquina , Neoplasias/tratamento farmacológico , Neoplasias/genética , Redes Neurais de Computação

11.

Analysis of the ARTIC Version 3 and Version 4 SARS-CoV-2 Primers and Their Impact on the Detection of the G142D Amino Acid Substitution in the Spike Protein.

Davis, James J; Long, S Wesley; Christensen, Paul A; Olsen, Randall J; Olson, Robert; Shukla, Maulik; Subedi, Sishir; Stevens, Rick; Musser, James M.

Microbiol Spectr ; 9(3): e0180321, 2021 12 22.

Artigo em Inglês | MEDLINE | ID: mdl-34878296

RESUMO

The ARTIC Network provides a common resource of PCR primer sequences and recommendations for amplifying SARS-CoV-2 genomes. The initial tiling strategy was developed with the reference genome Wuhan-01, and subsequent iterations have addressed areas of low amplification and sequence drop out. Recently, a new version (V4) was released, based on new variant genome sequences, in response to the realization that some V3 primers were located in regions with key mutations. Herein, we compare the performance of the ARTIC V3 and V4 primer sets with a matched set of 663 SARS-CoV-2 clinical samples sequenced with an Illumina NovaSeq 6000 instrument. We observe general improvements in sequencing depth and quality, and improved resolution of the SNP causing the D950N variation in the spike protein. Importantly, we also find nearly universal presence of spike protein substitution G142D in Delta-lineage samples. Due to the prior release and widespread use of the ARTIC V3 primers during the initial surge of the Delta variant, it is likely that the G142D amino acid substitution is substantially underrepresented among early Delta variant genomes deposited in public repositories. In addition to the improved performance of the ARTIC V4 primer set, this study also illustrates the importance of the primer scheme in downstream analyses. IMPORTANCE ARTIC Network primers are commonly used by laboratories worldwide to amplify and sequence SARS-CoV-2 present in clinical samples. As new variants have evolved and spread, it was found that the V3 primer set poorly amplified several key mutations. In this report, we compare the results of sequencing a matched set of samples with the V3 and V4 primer sets. We find that adoption of the ARTIC V4 primer set is critical for accurate sequencing of the SARS-CoV-2 spike region. The absence of metadata describing the primer scheme used will negatively impact the downstream use of publicly available SARS-Cov-2 sequencing reads and assembled genomes.

Assuntos

Substituição de Aminoácidos , COVID-19/virologia , SARS-CoV-2/classificação , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificação , Glicoproteína da Espícula de Coronavírus/genética , Sequência de Bases , Genoma Viral , Humanos , Mutação , Sequenciamento Completo do Genoma

12.

A genomic data resource for predicting antimicrobial resistance from laboratory-derived antimicrobial susceptibility phenotypes.

VanOeffelen, Margo; Nguyen, Marcus; Aytan-Aktug, Derya; Brettin, Thomas; Dietrich, Emily M; Kenyon, Ronald W; Machi, Dustin; Mao, Chunhong; Olson, Robert; Pusch, Gordon D; Shukla, Maulik; Stevens, Rick; Vonstein, Veronika; Warren, Andrew S; Wattam, Alice R; Yoo, Hyunseung; Davis, James J.

Brief Bioinform ; 22(6)2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-34379107

RESUMO

Antimicrobial resistance (AMR) is a major global health threat that affects millions of people each year. Funding agencies worldwide and the global research community have expended considerable capital and effort tracking the evolution and spread of AMR by isolating and sequencing bacterial strains and performing antimicrobial susceptibility testing (AST). For the last several years, we have been capturing these efforts by curating data from the literature and data resources and building a set of assembled bacterial genome sequences that are paired with laboratory-derived AST data. This collection currently contains AST data for over 67 000 genomes encompassing approximately 40 genera and over 100 species. In this paper, we describe the characteristics of this collection, highlighting areas where sampling is comparatively deep or shallow, and showing areas where attention is needed from the research community to improve sampling and tracking efforts. In addition to using the data to track the evolution and spread of AMR, it also serves as a useful starting point for building machine learning models for predicting AMR phenotypes. We demonstrate this by describing two machine learning models that are built from the entire dataset to show where the predictive power is comparatively high or low. This AMR metadata collection is freely available and maintained on the Bacterial and Viral Bioinformatics Center (BV-BRC) FTP site ftp://ftp.bvbrc.org/RELEASE_NOTES/PATRIC_genomes_AMR.txt.

Assuntos

Biologia Computacional/métodos , Bases de Dados Genéticas , Resistência Microbiana a Medicamentos , Genômica/métodos , Testes de Sensibilidade Microbiana , Inteligência Artificial , Bactérias/efeitos dos fármacos , Bactérias/genética , Genoma Bacteriano , Humanos , Laboratórios , Aprendizado de Máquina , Fenótipo

13.

Publisher Correction: Converting tabular data into images for deep learning with convolutional neural networks.

Zhu, Yitan; Brettin, Thomas; Xia, Fangfang; Partin, Alexander; Shukla, Maulik; Yoo, Hyunseung; Evrard, Yvonne A; Doroshow, James H; Stevens, Rick L.

Sci Rep ; 11(1): 14036, 2021 Jul 01.

Artigo em Inglês | MEDLINE | ID: mdl-34211076

14.

Converting tabular data into images for deep learning with convolutional neural networks.

Zhu, Yitan; Brettin, Thomas; Xia, Fangfang; Partin, Alexander; Shukla, Maulik; Yoo, Hyunseung; Evrard, Yvonne A; Doroshow, James H; Stevens, Rick L.

Sci Rep ; 11(1): 11325, 2021 05 31.

Artigo em Inglês | MEDLINE | ID: mdl-34059739

RESUMO

Convolutional neural networks (CNNs) have been successfully used in many applications where important information about data is embedded in the order of features, such as speech and imaging. However, most tabular data do not assume a spatial relationship between features, and thus are unsuitable for modeling using CNNs. To meet this challenge, we develop a novel algorithm, image generator for tabular data (IGTD), to transform tabular data into images by assigning features to pixel positions so that similar features are close to each other in the image. The algorithm searches for an optimized assignment by minimizing the difference between the ranking of distances between features and the ranking of distances between their assigned pixels in the image. We apply IGTD to transform gene expression profiles of cancer cell lines (CCLs) and molecular descriptors of drugs into their respective image representations. Compared with existing transformation methods, IGTD generates compact image representations with better preservation of feature neighborhood structure. Evaluated on benchmark drug screening datasets, CNNs trained on IGTD image representations of CCLs and drugs exhibit a better performance of predicting anti-cancer drug response than both CNNs trained on alternative image representations and prediction models trained on the original tabular data.

Assuntos

Aprendizado Profundo , Processamento de Imagem Assistida por Computador , Software , Linhagem Celular Tumoral , Humanos

15.

Learning curves for drug response prediction in cancer cell lines.

Partin, Alexander; Brettin, Thomas; Evrard, Yvonne A; Zhu, Yitan; Yoo, Hyunseung; Xia, Fangfang; Jiang, Songhao; Clyde, Austin; Shukla, Maulik; Fonstein, Michael; Doroshow, James H; Stevens, Rick L.

BMC Bioinformatics ; 22(1): 252, 2021 May 17.

Artigo em Inglês | MEDLINE | ID: mdl-34001007

RESUMO

BACKGROUND: Motivated by the size and availability of cell line drug sensitivity data, researchers have been developing machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating drug response data, a common question is whether the generalization performance of existing prediction models can be further improved with more training data. METHODS: We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these models. RESULTS: The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, thus suggesting that the actual shape of these curves depends on the unique pair of an ML model and a dataset. The multi-input NN (mNN), in which gene expressions of cancer cells and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training set sizes for two of the tested datasets, whereas the mNN consistently performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate prediction models, providing a broader perspective on the overall data scaling characteristics. CONCLUSIONS: A fitted power law learning curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.

Assuntos

Neoplasias , Preparações Farmacêuticas , Linhagem Celular , Curva de Aprendizado , Aprendizado de Máquina , Neoplasias/tratamento farmacológico , Neoplasias/genética , Estudos Prospectivos

16.

Molecular Architecture of Early Dissemination and Massive Second Wave of the SARS-CoV-2 Virus in a Major Metropolitan Area.

Long, S Wesley; Olsen, Randall J; Christensen, Paul A; Bernard, David W; Davis, James J; Shukla, Maulik; Nguyen, Marcus; Saavedra, Matthew Ojeda; Yerramilli, Prasanti; Pruitt, Layne; Subedi, Sishir; Kuo, Hung-Che; Hendrickson, Heather; Eskandari, Ghazaleh; Nguyen, Hoang A T; Long, J Hunter; Kumaraswami, Muthiah; Goike, Jule; Boutz, Daniel; Gollihar, Jimmy; McLellan, Jason S; Chou, Chia-Wei; Javanmardi, Kamyab; Finkelstein, Ilya J; Musser, James M.

medRxiv ; 2020 Sep 29.

Artigo em Inglês | MEDLINE | ID: mdl-33024977

RESUMO

We sequenced the genomes of 5,085 SARS-CoV-2 strains causing two COVID-19 disease waves in metropolitan Houston, Texas, an ethnically diverse region with seven million residents. The genomes were from viruses recovered in the earliest recognized phase of the pandemic in Houston, and an ongoing massive second wave of infections. The virus was originally introduced into Houston many times independently. Virtually all strains in the second wave have a Gly614 amino acid replacement in the spike protein, a polymorphism that has been linked to increased transmission and infectivity. Patients infected with the Gly614 variant strains had significantly higher virus loads in the nasopharynx on initial diagnosis. We found little evidence of a significant relationship between virus genotypes and altered virulence, stressing the linkage between disease severity, underlying medical conditions, and host genetics. Some regions of the spike protein - the primary target of global vaccine efforts - are replete with amino acid replacements, perhaps indicating the action of selection. We exploited the genomic data to generate defined single amino acid replacements in the receptor binding domain of spike protein that, importantly, produced decreased recognition by the neutralizing monoclonal antibody CR30022. Our study is the first analysis of the molecular architecture of SARS-CoV-2 in two infection waves in a major metropolitan region. The findings will help us to understand the origin, composition, and trajectory of future infection waves, and the potential effect of the host immune response and therapeutic maneuvers on SARS-CoV-2 evolution.

17.

Predicting antimicrobial resistance using conserved genes.

Nguyen, Marcus; Olson, Robert; Shukla, Maulik; VanOeffelen, Margo; Davis, James J.

PLoS Comput Biol ; 16(10): e1008319, 2020 10.

Artigo em Inglês | MEDLINE | ID: mdl-33075053

RESUMO

A growing number of studies are using machine learning models to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. Although these studies are showing promise, the models are typically trained using features derived from comprehensive sets of AMR genes or whole genome sequences and may not be suitable for use when genomes are incomplete. In this study, we explore the possibility of predicting AMR phenotypes using incomplete genome sequence data. Models were built from small sets of randomly-selected core genes after removing the AMR genes. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in cases where the primary AMR mechanisms result from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes, we show that F1 scores and error rates are stable and have little variance between replicates. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes.

Assuntos

Sequência Conservada/genética , Farmacorresistência Bacteriana/genética , Genoma Bacteriano/genética , Genômica/métodos , Aprendizado de Máquina , Algoritmos , Antibacterianos/farmacologia , Bactérias/efeitos dos fármacos , Bactérias/genética , Fenótipo

18.

Molecular Architecture of Early Dissemination and Massive Second Wave of the SARS-CoV-2 Virus in a Major Metropolitan Area.

Long, S Wesley; Olsen, Randall J; Christensen, Paul A; Bernard, David W; Davis, James J; Shukla, Maulik; Nguyen, Marcus; Saavedra, Matthew Ojeda; Yerramilli, Prasanti; Pruitt, Layne; Subedi, Sishir; Kuo, Hung-Che; Hendrickson, Heather; Eskandari, Ghazaleh; Nguyen, Hoang A T; Long, J Hunter; Kumaraswami, Muthiah; Goike, Jule; Boutz, Daniel; Gollihar, Jimmy; McLellan, Jason S; Chou, Chia-Wei; Javanmardi, Kamyab; Finkelstein, Ilya J; Musser, James M.

mBio ; 11(6)2020 10 30.

Artigo em Inglês | MEDLINE | ID: mdl-33127862

RESUMO

We sequenced the genomes of 5,085 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains causing two coronavirus disease 2019 (COVID-19) disease waves in metropolitan Houston, TX, an ethnically diverse region with 7 million residents. The genomes were from viruses recovered in the earliest recognized phase of the pandemic in Houston and from viruses recovered in an ongoing massive second wave of infections. The virus was originally introduced into Houston many times independently. Virtually all strains in the second wave have a Gly614 amino acid replacement in the spike protein, a polymorphism that has been linked to increased transmission and infectivity. Patients infected with the Gly614 variant strains had significantly higher virus loads in the nasopharynx on initial diagnosis. We found little evidence of a significant relationship between virus genotype and altered virulence, stressing the linkage between disease severity, underlying medical conditions, and host genetics. Some regions of the spike protein-the primary target of global vaccine efforts-are replete with amino acid replacements, perhaps indicating the action of selection. We exploited the genomic data to generate defined single amino acid replacements in the receptor binding domain of spike protein that, importantly, produced decreased recognition by the neutralizing monoclonal antibody CR3022. Our report represents the first analysis of the molecular architecture of SARS-CoV-2 in two infection waves in a major metropolitan region. The findings will help us to understand the origin, composition, and trajectory of future infection waves and the potential effect of the host immune response and therapeutic maneuvers on SARS-CoV-2 evolution.IMPORTANCE There is concern about second and subsequent waves of COVID-19 caused by the SARS-CoV-2 coronavirus occurring in communities globally that had an initial disease wave. Metropolitan Houston, TX, with a population of 7 million, is experiencing a massive second disease wave that began in late May 2020. To understand SARS-CoV-2 molecular population genomic architecture and evolution and the relationship between virus genotypes and patient features, we sequenced the genomes of 5,085 SARS-CoV-2 strains from these two waves. Our report provides the first molecular characterization of SARS-CoV-2 strains causing two distinct COVID-19 disease waves.

Assuntos

Betacoronavirus/genética , Infecções por Coronavirus/virologia , Pneumonia Viral/virologia , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Sequência de Aminoácidos , Substituição de Aminoácidos , Anticorpos Neutralizantes/imunologia , Sequência de Bases , Betacoronavirus/imunologia , COVID-19 , Teste para COVID-19 , Técnicas de Laboratório Clínico , Infecções por Coronavirus/diagnóstico , Infecções por Coronavirus/epidemiologia , Infecções por Coronavirus/imunologia , RNA-Polimerase RNA-Dependente de Coronavírus , Genoma Viral , Genótipo , Humanos , Aprendizado de Máquina , Modelos Moleculares , Técnicas de Diagnóstico Molecular , Pandemias , Filogenia , Pneumonia Viral/epidemiologia , Pneumonia Viral/imunologia , RNA Polimerase Dependente de RNA/química , RNA Polimerase Dependente de RNA/genética , SARS-CoV-2 , Análise de Sequência de Proteína , Glicoproteína da Espícula de Coronavírus/imunologia , Texas/epidemiologia , Proteínas não Estruturais Virais/química , Proteínas não Estruturais Virais/genética

19.

Ensemble transfer learning for the prediction of anti-cancer drug response.

Zhu, Yitan; Brettin, Thomas; Evrard, Yvonne A; Partin, Alexander; Xia, Fangfang; Shukla, Maulik; Yoo, Hyunseung; Doroshow, James H; Stevens, Rick L.

Sci Rep ; 10(1): 18040, 2020 10 22.

Artigo em Inglês | MEDLINE | ID: mdl-33093487

RESUMO

Transfer learning, which transfers patterns learned on a source dataset to a related target dataset for constructing prediction models, has been shown effective in many applications. In this paper, we investigate whether transfer learning can be used to improve the performance of anti-cancer drug response prediction models. Previous transfer learning studies for drug response prediction focused on building models to predict the response of tumor cells to a specific drug treatment. We target the more challenging task of building general prediction models that can make predictions for both new tumor cells and new drugs. Uniquely, we investigate the power of transfer learning for three drug response prediction applications including drug repurposing, precision oncology, and new drug development, through different data partition schemes in cross-validation. We extend the classic transfer learning framework through ensemble and demonstrate its general utility with three representative prediction algorithms including a gradient boosting model and two deep neural networks. The ensemble transfer learning framework is tested on benchmark in vitro drug screening datasets. The results demonstrate that our framework broadly improves the prediction performance in all three drug response prediction applications with all three prediction algorithms.

Assuntos

Antineoplásicos/farmacologia , Conjuntos de Dados como Assunto , Aprendizado Profundo , Ensaios de Seleção de Medicamentos Antitumorais , Neoplasias/tratamento farmacológico , Neoplasias/patologia , Algoritmos , Antineoplásicos/uso terapêutico , Desenvolvimento de Medicamentos , Reposicionamento de Medicamentos , Humanos , Modelos Biológicos , Redes Neurais de Computação , Medicina de Precisão

20.

Enhanced Co-Expression Extrapolation (COXEN) Gene Selection Method for Building Anti-Cancer Drug Response Prediction Models.

Zhu, Yitan; Brettin, Thomas; Evrard, Yvonne A; Xia, Fangfang; Partin, Alexander; Shukla, Maulik; Yoo, Hyunseung; Doroshow, James H; Stevens, Rick L.

Genes (Basel) ; 11(9)2020 09 11.

Artigo em Inglês | MEDLINE | ID: mdl-32933072

RESUMO

The co-expression extrapolation (COXEN) method has been successfully used in multiple studies to select genes for predicting the response of tumor cells to a specific drug treatment. Here, we enhance the COXEN method to select genes that are predictive of the efficacies of multiple drugs for building general drug response prediction models that are not specific to a particular drug. The enhanced COXEN method first ranks the genes according to their prediction power for each individual drug and then takes a union of top predictive genes of all the drugs, among which the algorithm further selects genes whose co-expression patterns are well preserved between cancer cases for building prediction models. We apply the proposed method on benchmark in vitro drug screening datasets and compare the performance of prediction models built based on the genes selected by the enhanced COXEN method to that of models built on genes selected by the original COXEN method and randomly picked genes. Models built with the enhanced COXEN method always present a statistically significantly improved prediction performance (adjusted p-value ≤ 0.05). Our results demonstrate the enhanced COXEN method can dramatically increase the power of gene expression data for predicting drug response.

Assuntos

Antineoplásicos/farmacologia , Biomarcadores Tumorais/genética , Ensaios de Seleção de Medicamentos Antitumorais/métodos , Perfilação da Expressão Gênica/métodos , Modelos Estatísticos , Neoplasias/tratamento farmacológico , Neoplasias/genética , Algoritmos , Humanos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA