Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Front Public Health ; 11: 1150228, 2023.
Article in English | MEDLINE | ID: mdl-37920576

ABSTRACT

Introduction: Dog-mediated rabies is enzootic in Vietnam, resulting in at least 70 reported human deaths and 500,000 human rabies exposures annually. In 2016, an integrated bite cases management (IBCM) based surveillance program was developed to improve knowledge of the dog-mediated rabies burden in Phu Tho Province of Vietnam. Methods: The Vietnam Animal Rabies Surveillance Program (VARSP) was established in four stages: (1) Laboratory development, (2) Training of community One Health workers, (3) Paper-based-reporting (VARSP 1.0), and (4) Electronic case reporting (VARSP 2.0). Investigation and diagnostic data collected from March 2016 to December 2019 were compared with historical records of animal rabies cases dating back to January 2012. A risk analysis was conducted to evaluate the probability of a rabies exposure resulting in death after a dog bite, based on data collected over the course of an IBCM investigation. Results: Prior to the implementation of VARSP, between 2012 and 2015, there was an average of one rabies investigation per year, resulting in two confirmed and two probable animal rabies cases. During the 46 months that VARSP was operational (2016 - 2019), 1048 animal investigations were conducted, which identified 79 (8%) laboratory-confirmed rabies cases and 233 (22%) clinically-confirmed(probable) cases. VARSP produced a 78-fold increase in annual animal rabies case detection (one cases detected per year pre-VARSP vs 78 cases per year under VARSP). The risk of succumbing to rabies for bite victims of apparently healthy dogs available for home quarantine, was three deaths for every 10,000 untreated exposures. Discussion: A pilot IBCM model used in Phu Tho Province showed promising results for improving rabies surveillance, with a 26-fold increase in annual case detection after implementation of a One Health model. The risk for a person bitten by an apparently healthy dog to develop rabies in the absence of rabies PEP was very low, which supports the WHO recommendations to delay PEP for this category of bite victims, when trained animal assessors are available and routinely communicate with the medical sector. Recent adoption of an electronic IBCM system is likely to expedite adoption of VARSP 2.0 to other Provinces and improve accuracy of field decisions and data collection.


Subject(s)
Bites and Stings , Rabies , Humans , Dogs , Animals , Rabies/epidemiology , Rabies/therapy , Rabies/veterinary , Case Management , Vietnam/epidemiology , Patient Acceptance of Health Care , Risk Assessment , Bites and Stings/epidemiology
2.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1020-1029, 2023.
Article in English | MEDLINE | ID: mdl-35820003

ABSTRACT

Many high-performance DTA deep learning models have been proposed, but they are mostly black-box and thus lack human interpretability. Explainable AI (XAI) can make DTA models more trustworthy, and allows to distill biological knowledge from the models. Counterfactual explanation is one popular approach to explaining the behaviour of a deep neural network, which works by systematically answering the question "How would the model output change if the inputs were changed in this way?". We propose a multi-agent reinforcement learning framework, Multi-Agent Counterfactual Drug-target binding Affinity (MACDA), to generate counterfactual explanations for the drug-protein complex. Our proposed framework provides human-interpretable counterfactual instances while optimizing both the input drug and target for counterfactual generation at the same time. We benchmark the proposed MACDA framework using the Davis and PDBBind dataset and find that our framework produces more parsimonious explanations with no loss in explanation validity, as measured by encoding similarity. We then present a case study involving ABL1 and Nilotinib to demonstrate how MACDA can explain the behaviour of a DTA model in the underlying substructure interaction between inputs in its prediction, revealing mechanisms that align with prior domain knowledge.


Subject(s)
Benchmarking , Neural Networks, Computer , Humans , Drug Development
3.
Int J Data Sci Anal ; : 1-16, 2022 Nov 18.
Article in English | MEDLINE | ID: mdl-36440369

ABSTRACT

Discovering new medicines is the hallmark of the human endeavor to live a better and longer life. Yet the pace of discovery has slowed down as we need to venture into more wildly unexplored biomedical space to find one that matches today's high standard. Modern AI-enabled by powerful computing, large biomedical databases, and breakthroughs in deep learning offers a new hope to break this loop as AI is rapidly maturing, ready to make a huge impact in the area. In this paper, we review recent advances in AI methodologies that aim to crack this challenge. We organize the vast and rapidly growing literature on AI for drug discovery into three relatively stable sub-areas: (a) representation learning over molecular sequences and geometric graphs; (b) data-driven reasoning where we predict molecular properties and their binding, optimize existing compounds, generate de novo molecules, and plan the synthesis of target molecules; and (c) knowledge-based reasoning where we discuss the construction and reasoning over biomedical knowledge graphs. We will also identify open challenges and chart possible research directions for the years to come.

4.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35788823

ABSTRACT

Predicting the drug-target interaction is crucial for drug discovery as well as drug repurposing. Machine learning is commonly used in drug-target affinity (DTA) problem. However, the machine learning model faces the cold-start problem where the model performance drops when predicting the interaction of a novel drug or target. Previous works try to solve the cold start problem by learning the drug or target representation using unsupervised learning. While the drug or target representation can be learned in an unsupervised manner, it still lacks the interaction information, which is critical in drug-target interaction. To incorporate the interaction information into the drug and protein interaction, we proposed using transfer learning from chemical-chemical interaction (CCI) and protein-protein interaction (PPI) task to drug-target interaction task. The representation learned by CCI and PPI tasks can be transferred smoothly to the DTA task due to the similar nature of the tasks. The result on the DTA datasets shows that our proposed method has advantages compared to other pre-training methods in the DTA task.


Subject(s)
Drug Development , Machine Learning , Drug Discovery/methods , Drug Repositioning
5.
Article in English | MEDLINE | ID: mdl-33606633

ABSTRACT

BACKGROUND: Drug response prediction is an important problem in computational personalized medicine. Many machine-learning-based methods, especially deep learning-based ones, have been proposed for this task. However, these methods often represent the drugs as strings, which are not a natural way to depict molecules. Also, interpretation (e.g., what are the mutation or copy number aberration contributing to the drug response) has not been considered thoroughly. METHODS: In this study, we propose a novel method, GraphDRP, based on graph convolutional network for the problem. In GraphDRP, drugs were represented in molecular graphs directly capturing the bonds among atoms, meanwhile cell lines were depicted as binary vectors of genomic aberrations. Representative features of drugs and cell lines were learned by convolution layers, then combined to represent for each drug-cell line pair. Finally, the response value of each drug-cell line pair was predicted by a fully-connected neural network. Four variants of graph convolutional networks were used for learning the features of drugs. RESULTS: We found that GraphDRP outperforms tCNNS in all performance measures for all experiments. Also, through saliency maps of the resulting GraphDRP models, we discovered the contribution of the genomic aberrations to the responses. CONCLUSION: Representing drugs as graphs can improve the performance of drug response prediction. Availability of data and materials: Data and source code can be downloaded athttps://github.com/hauldhut/GraphDRP.


Subject(s)
Neural Networks, Computer , Pharmaceutical Preparations , Genomics , Machine Learning , Software
6.
Article in English | MEDLINE | ID: mdl-34197324

ABSTRACT

Predicting the interaction between a compound and a target is crucial for rapid drug repurposing. Deep learning has been successfully applied in drug-target affinity (DTA)problem. However, previous deep learning-based methods ignore modeling the direct interactions between drug and protein residues. This would lead to inaccurate learning of target representation which may change due to the drug binding effects. In addition, previous DTA methods learn protein representation solely based on a small number of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets. We propose GEFA (Graph Early Fusion Affinity), a novel graph-in-graph neural network with attention mechanism to address the changes in target representation because of the binding effects. Specifically, a drug is modeled as a graph of atoms, which then serves as a node in a larger graph of residues-drug complex. The resulting model is an expressive deep nested graph neural network. We also use pre-trained protein representation powered by the recent effort of learning contextualized protein representation. The experiments are conducted under different settings to evaluate scenarios such as novel drugs or targets. The results demonstrate the effectiveness of the pre-trained protein embedding and the advantages our GEFA in modeling the nested graph for drug-target interaction.


Subject(s)
Drug Development , Neural Networks, Computer , Amino Acid Sequence , Drug Development/methods , Drug Repositioning , Proteins/chemistry
7.
Brief Bioinform ; 22(3)2021 05 20.
Article in English | MEDLINE | ID: mdl-34020545

ABSTRACT

MOTIVATION: Predicting cell locations is important since with the understanding of cell locations, we may estimate the function of cells and their integration with the spatial environment. Thus, the DREAM challenge on single-cell transcriptomics required participants to predict the locations of single cells in the Drosophila embryo using single-cell transcriptomic data. RESULTS: We have developed over 50 pipelines by combining different ways of preprocessing the RNA-seq data, selecting the genes, predicting the cell locations and validating predicted cell locations, resulting in the winning methods which were ranked second in sub-challenge 1, first in sub-challenge 2 and third in sub-challenge 3. In this paper, we present an R package, SCTCwhatateam, which includes all the methods we developed and the Shiny web application to facilitate the research on single-cell spatial reconstruction. All the data and the example use cases are available in the Supplementary data.


Subject(s)
Single-Cell Analysis/methods , Transcriptome , Algorithms , Animals , Computational Biology/methods , Drosophila/embryology , Sequence Analysis, RNA/methods
8.
PLoS One ; 16(5): e0251787, 2021.
Article in English | MEDLINE | ID: mdl-34010314

ABSTRACT

Data generated within social media platforms may present a new way to identify individuals who are experiencing mental illness. This study aimed to investigate the associations between linguistic features in individuals' blog data and their symptoms of depression, generalised anxiety, and suicidal ideation. Individuals who blogged were invited to participate in a longitudinal study in which they completed fortnightly symptom scales for depression and anxiety (PHQ-9, GAD-7) for a period of 36 weeks. Blog data published in the same period was also collected, and linguistic features were analysed using the LIWC tool. Bivariate and multivariate analyses were performed to investigate the correlations between the linguistic features and symptoms between subjects. Multivariate regression models were used to predict longitudinal changes in symptoms within subjects. A total of 153 participants consented to the study. The final sample consisted of the 38 participants who completed the required number of symptom scales and generated blog data during the study period. Between-subject analysis revealed that the linguistic features "tentativeness" and "non-fluencies" were significantly correlated with symptoms of depression and anxiety, but not suicidal thoughts. Within-subject analysis showed no robust correlations between linguistic features and changes in symptoms. The findings may provide evidence of a relationship between some linguistic features in social media data and mental health; however, the study was limited by missing data and other important considerations. The findings also suggest that linguistic features observed at the group level may not generalise to, or be useful for, detecting individual symptom change over time.


Subject(s)
Anxiety/psychology , Depression/psychology , Mental Health , Social Media , Suicidal Ideation , Adolescent , Adult , Aged , Female , Humans , Language , Longitudinal Studies , Male , Middle Aged , Patient Health Questionnaire , Risk Factors
9.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2841-2847, 2021.
Article in English | MEDLINE | ID: mdl-33909569

ABSTRACT

The classification of clinical samples based on gene expression data is an important part of precision medicine. In this manuscript, we show how transforming gene expression data into a set of personalized (sample-specific) networks can allow us to harness existing graph-based methods to improve classifier performance. Existing approaches to personalized gene networks have the limitation that they depend on other samples in the data and must get re-computed whenever a new sample is introduced. Here, we propose a novel method, called Personalized Annotation-based Networks (PAN), that avoids this limitation by using curated annotation databases to transform gene expression data into a graph. Unlike competing methods, PANs are calculated for each sample independent of the population, making it a more efficient way to obtain single-sample networks. Using three breast cancer datasets as a case study, we show that PAN classifiers not only predict cancer relapse better than gene features alone, but also outperform PPI (protein-protein interactions) and population-level graph-based classifiers. This work demonstrates the practical advantages of graph-based classification for high-dimensional genomic data, while offering a new approach to making sample-specific networks. Supplementary information: PAN and the baselines are implemented in Python. Source code and data are available at https://github.com/thinng/PAN.


Subject(s)
Breast Neoplasms , Genomics/methods , Molecular Sequence Annotation/methods , Neoplasm Recurrence, Local , Precision Medicine/methods , Algorithms , Breast Neoplasms/diagnosis , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Databases, Genetic , Female , Humans , Neoplasm Recurrence, Local/diagnosis , Neoplasm Recurrence, Local/genetics , Neoplasm Recurrence, Local/metabolism , Neoplasm Recurrence, Local/pathology , Protein Interaction Maps/genetics , Software , Transcriptome/genetics
10.
Bioinformatics ; 37(19): 3285-3292, 2021 Oct 11.
Article in English | MEDLINE | ID: mdl-33904576

ABSTRACT

MOTIVATION: Unravelling cancer driver genes is important in cancer research. Although computational methods have been developed to identify cancer drivers, most of them detect cancer drivers at population level. However, two patients who have the same cancer type and receive the same treatment may have different outcomes because each patient has a different genome and their disease might be driven by different driver genes. Therefore new methods are being developed for discovering cancer drivers at individual level, but existing personalized methods only focus on coding drivers while microRNAs (miRNAs) have been shown to drive cancer progression as well. Thus, novel methods are required to discover both coding and miRNA cancer drivers at individual level. RESULTS: We propose the novel method, pDriver, to discover personalized cancer drivers. pDriver includes two stages: (i) constructing gene networks for each cancer patient and (ii) discovering cancer drivers for each patient based on the constructed gene networks. To demonstrate the effectiveness of pDriver, we have applied it to five TCGA cancer datasets and compared it with the state-of-the-art methods. The result indicates that pDriver is more effective than other methods. Furthermore, pDriver can also detect miRNA cancer drivers and most of them have been confirmed to be associated with cancer by literature. We further analyze the predicted personalized drivers for breast cancer patients and the result shows that they are significantly enriched in many GO processes and KEGG pathways involved in breast cancer. AVAILABILITY AND IMPLEMENTATION: pDriver is available at https://github.com/pvvhoang/pDriver. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
Sci Rep ; 11(1): 3487, 2021 02 10.
Article in English | MEDLINE | ID: mdl-33568759

ABSTRACT

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development.


Subject(s)
COVID-19/virology , Genome, Viral , Mutation , Protein Structure, Secondary , SARS-CoV-2/genetics , DNA, Viral , Genomics , Humans , SARS-CoV-2/metabolism , Spike Glycoprotein, Coronavirus/genetics , Spike Glycoprotein, Coronavirus/metabolism
12.
Bioinformatics ; 37(8): 1140-1147, 2021 05 23.
Article in English | MEDLINE | ID: mdl-33119053

ABSTRACT

SUMMARY: The development of new drugs is costly, time consuming and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug-target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug-target affinity. We show that graph neural networks not only predict drug-target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug-target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. AVAILABILITY OF IMPLEMENTATION: The proposed models are implemented in Python. Related data, pre-trained models and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , Pharmaceutical Preparations , Drug Repositioning , Proteins , Software
13.
Multimed Tools Appl ; 80(5): 7187-7204, 2021.
Article in English | MEDLINE | ID: mdl-33132740

ABSTRACT

We propose in this work a graph-based approach for automatic public health analysis using social media. In our approach, graphs are created to model the interactions between features and between tweets in social media. We investigated different graph properties and methods in constructing graph-based representations for population health analysis. The proposed approach is applied in two case studies: (1) estimating health indices, and (2) classifying health situation of counties in the US. We evaluate our approach on a dataset including more than one billion tweets collected in three years 2014, 2015, and 2016, and the health surveys from the Behavioral Risk Factor Surveillance System. We conducted realistic and large-scale experiments on various textual features and graph-based representations. Experimental results verified the robustness of the proposed approach and its superiority over existing ones in both case studies, confirming the potential of graph-based approach for modeling interactions in social networks for population health analysis.

14.
Crit Care Resusc ; 2020 Sep 24.
Article in English | MEDLINE | ID: mdl-33105920

ABSTRACT

Using geotagged Twitter data in Victoria, we created a mobility index and studied the changes during the staged restrictions during the coronavirus disease 2019 (COVID-19) pandemic. We describe preliminary evidence that geotagged Twitter data may be used to provide real-time population mobility data and information on the impact of restrictions on such mobility.

15.
Life Sci Alliance ; 3(11)2020 11.
Article in English | MEDLINE | ID: mdl-32972997

ABSTRACT

Single-cell RNA-sequencing (scRNAseq) technologies are rapidly evolving. Although very informative, in standard scRNAseq experiments, the spatial organization of the cells in the tissue of origin is lost. Conversely, spatial RNA-seq technologies designed to maintain cell localization have limited throughput and gene coverage. Mapping scRNAseq to genes with spatial information increases coverage while providing spatial location. However, methods to perform such mapping have not yet been benchmarked. To fill this gap, we organized the DREAM Single-Cell Transcriptomics challenge focused on the spatial reconstruction of cells from the Drosophila embryo from scRNAseq data, leveraging as silver standard, genes with in situ hybridization data from the Berkeley Drosophila Transcription Network Project reference atlas. The 34 participating teams used diverse algorithms for gene selection and location prediction, while being able to correctly localize clusters of cells. Selection of predictor genes was essential for this task. Predictor genes showed a relatively high expression entropy, high spatial clustering and included prominent developmental genes such as gap and pair-rule genes and tissue markers. Application of the top 10 methods to a zebra fish embryo dataset yielded similar performance and statistical properties of the selected genes than in the Drosophila data. This suggests that methods developed in this challenge are able to extract generalizable properties of genes that are useful to accurately reconstruct the spatial arrangement of cells in tissues.


Subject(s)
Computational Biology/methods , Gene Expression Profiling/methods , Single-Cell Analysis/methods , Spatial Analysis , Algorithms , Animals , Databases, Genetic , Drosophila/genetics , Forecasting/methods , Gene Expression Regulation, Developmental/genetics , Gene Regulatory Networks/genetics , Sequence Analysis, RNA/methods , Transcriptome/genetics , Zebrafish/genetics
16.
Crit Care Resusc ; 22(4): 293-294, 2020 Dec.
Article in English | MEDLINE | ID: mdl-38046867

ABSTRACT

Using geotagged Twitter data in Victoria, we created a mobility index and studied the changes during the staged restrictions during the coronavirus disease 2019 (COVID-19) pandemic. We describe preliminary evidence that geotagged Twitter data may be used to provide real-time population mobility data and information on the impact of restrictions on such mobility.

17.
J Biomed Inform ; 99: 103277, 2019 11.
Article in English | MEDLINE | ID: mdl-31521858

ABSTRACT

Public health measurement is important for government administration as it provides indicators and implications to public healthcare strategies. The measurement of health status has been traditionally conducted via surveys in the forms of pre-designed questionnaires to collect responses from targeted participants. Apart from benefits, traditional approach is costly, time-consuming, and not scalable. These limitations make a major obstacle to policy makers to develop up-to-date healthcare programs. This paper studies the use of health-related information conveyed in user-generated content from social media for prediction of health outcomes at population level. Specifically, we investigate linguistic features for analysing textual data. We propose the use of visual features learnt from deep neural networks for understanding visual data. We introduce collective social capital information from location-based social media data. We conducted extensive experiments on large-scale datasets collected from two online social networks: Foursquare and Flickr, against the task of prediction of the U.S. county health indices. Experimental results showed that visual and collective social capital data achieved comparable prediction performance and outperformed textual information. These promising results also suggest the potential of social media for health analysis at population scales.


Subject(s)
Health Status , Population Health/statistics & numerical data , Social Media , Data Visualization , Empirical Research , Humans , Neural Networks, Computer , Psycholinguistics , Public Health , Surveys and Questionnaires
18.
Mol Biol Rep ; 46(6): 5919-5930, 2019 Dec.
Article in English | MEDLINE | ID: mdl-31410687

ABSTRACT

In the progression of cancer, cells acquire genetic mutations that cause uncontrolled growth. Over time, the primary tumour may undergo additional mutations that allow for the cancerous cells to spread throughout the body as metastases. Since metastatic development typically results in markedly worse patient outcomes, research into the identity and function of metastasis-associated biomarkers could eventually translate into clinical diagnostics or novel therapeutics. Although the general processes underpinning metastatic progression are understood, no clear cross-cancer biomarker profile has emerged. However, the literature suggests that some microRNAs (miRNAs) may play an important role in the metastatic progression of several cancer types. Using a subset of The Cancer Genome Atlas (TCGA) data, we performed an integrated analysis of mRNA and miRNA expression with paired metastatic and primary tumour samples to interrogate how the miRNA-mRNA regulatory axis influences metastatic progression. From this, we successfully built mRNA- and miRNA-specific classifiers that can discriminate pairs of metastatic and primary samples across 11 cancer types. In addition, we identified a number of miRNAs whose metastasis-associated dysregulation could predict mRNA metastasis-associated dysregulation. Among the most predictive miRNAs, we found several previously implicated in cancer progression, including miR-301b, miR-1296, and miR-423. Taken together, our results suggest that metastatic samples have a common cross-cancer signature when compared with their primary tumour pair, and that these miRNA biomarkers can be used to predict metastatic status as well as mRNA expression.


Subject(s)
Gene Expression Regulation, Neoplastic/genetics , Neoplasm Metastasis/genetics , Neoplasms/genetics , Biomarkers, Tumor/genetics , Databases, Genetic , Forecasting/methods , Gene Expression Profiling/methods , Humans , MicroRNAs/genetics , Oligonucleotide Array Sequence Analysis/methods , RNA, Messenger/genetics
19.
Front Genet ; 10: 599, 2019.
Article in English | MEDLINE | ID: mdl-31312210

ABSTRACT

Since the turn of the century, researchers have sought to diagnose cancer based on gene expression signatures measured from the blood or biopsy as biomarkers. This task, known as classification, is typically solved using a suite of algorithms that learn a mathematical rule capable of discriminating one group ("cases") from another ("controls"). However, discriminatory methods can only identify cancerous samples that resemble those that the algorithm already saw during training. As such, discriminatory methods may be ill-suited for the classification of cancer: because the possibility space of cancer is definitively large, the existence of a one-of-a-kind gene expression signature is likely. Instead, we propose using an established surveillance method that detects anomalous samples based on their deviation from a learned normal steady-state structure. By transferring this method to transcriptomic data, we can create an anomaly detector for tissue transcriptomes, a "tissue detector," that is capable of identifying cancer without ever seeing a single cancer example. As a proof-of-concept, we train a "tissue detector" on normal GTEx samples that can classify TCGA samples with >90% AUC for 3 out of 6 tissues. Importantly, we find that the classification accuracy is improved simply by adding more healthy samples. We conclude this report by emphasizing the conceptual advantages of anomaly detection and by highlighting future directions for this field of study.

20.
Am J Med Genet B Neuropsychiatr Genet ; 180(7): 508-518, 2019 10.
Article in English | MEDLINE | ID: mdl-31025483

ABSTRACT

Although neuropsychiatric disorders have an established genetic background, their molecular foundations remain elusive. This has prompted many investigators to search for explanatory biomarkers that can predict clinical outcomes. One approach uses machine learning to classify patients based on blood mRNA expression. However, these endeavors typically fail to achieve the high level of performance, stability, and generalizability required for clinical translation. Moreover, these classifiers can lack interpretability because not all genes have relevance to researchers. For this study, we hypothesized that annotation-based classifiers can improve classification performance, stability, generalizability, and interpretability. To this end, we evaluated the models of four classification algorithms on six neuropsychiatric data sets using four annotation databases. Our results suggest that the Gene Ontology Biological Process database can transform gene expression into an annotation-based feature space that is accurate and stable. We also show how annotation features can improve the interpretability of classifiers: as annotations are used to assign biological importance to genes, the biological importance of annotation-based features are the features themselves. In evaluating the annotation features, we find that top ranked annotations tend contain top ranked genes, suggesting that the most predictive annotations are a superset of the most predictive genes. Based on this, and the fact that annotations are used routinely to assign biological importance to genetic data, we recommend transforming gene-level expression into annotation-level expression prior to the classification of neuropsychiatric conditions.


Subject(s)
Mental Disorders/classification , Nervous System Diseases/classification , Neuropsychiatry/methods , Algorithms , Computational Biology/methods , Databases, Genetic , Gene Ontology , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...