Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Front Public Health ; 11: 1150228, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37920576

RESUMO

Introduction: Dog-mediated rabies is enzootic in Vietnam, resulting in at least 70 reported human deaths and 500,000 human rabies exposures annually. In 2016, an integrated bite cases management (IBCM) based surveillance program was developed to improve knowledge of the dog-mediated rabies burden in Phu Tho Province of Vietnam. Methods: The Vietnam Animal Rabies Surveillance Program (VARSP) was established in four stages: (1) Laboratory development, (2) Training of community One Health workers, (3) Paper-based-reporting (VARSP 1.0), and (4) Electronic case reporting (VARSP 2.0). Investigation and diagnostic data collected from March 2016 to December 2019 were compared with historical records of animal rabies cases dating back to January 2012. A risk analysis was conducted to evaluate the probability of a rabies exposure resulting in death after a dog bite, based on data collected over the course of an IBCM investigation. Results: Prior to the implementation of VARSP, between 2012 and 2015, there was an average of one rabies investigation per year, resulting in two confirmed and two probable animal rabies cases. During the 46 months that VARSP was operational (2016 - 2019), 1048 animal investigations were conducted, which identified 79 (8%) laboratory-confirmed rabies cases and 233 (22%) clinically-confirmed(probable) cases. VARSP produced a 78-fold increase in annual animal rabies case detection (one cases detected per year pre-VARSP vs 78 cases per year under VARSP). The risk of succumbing to rabies for bite victims of apparently healthy dogs available for home quarantine, was three deaths for every 10,000 untreated exposures. Discussion: A pilot IBCM model used in Phu Tho Province showed promising results for improving rabies surveillance, with a 26-fold increase in annual case detection after implementation of a One Health model. The risk for a person bitten by an apparently healthy dog to develop rabies in the absence of rabies PEP was very low, which supports the WHO recommendations to delay PEP for this category of bite victims, when trained animal assessors are available and routinely communicate with the medical sector. Recent adoption of an electronic IBCM system is likely to expedite adoption of VARSP 2.0 to other Provinces and improve accuracy of field decisions and data collection.


Assuntos
Mordeduras e Picadas , Raiva , Humanos , Cães , Animais , Raiva/epidemiologia , Raiva/terapia , Raiva/veterinária , Administração de Caso , Vietnã/epidemiologia , Aceitação pelo Paciente de Cuidados de Saúde , Medição de Risco , Mordeduras e Picadas/epidemiologia
2.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1020-1029, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35820003

RESUMO

Many high-performance DTA deep learning models have been proposed, but they are mostly black-box and thus lack human interpretability. Explainable AI (XAI) can make DTA models more trustworthy, and allows to distill biological knowledge from the models. Counterfactual explanation is one popular approach to explaining the behaviour of a deep neural network, which works by systematically answering the question "How would the model output change if the inputs were changed in this way?". We propose a multi-agent reinforcement learning framework, Multi-Agent Counterfactual Drug-target binding Affinity (MACDA), to generate counterfactual explanations for the drug-protein complex. Our proposed framework provides human-interpretable counterfactual instances while optimizing both the input drug and target for counterfactual generation at the same time. We benchmark the proposed MACDA framework using the Davis and PDBBind dataset and find that our framework produces more parsimonious explanations with no loss in explanation validity, as measured by encoding similarity. We then present a case study involving ABL1 and Nilotinib to demonstrate how MACDA can explain the behaviour of a DTA model in the underlying substructure interaction between inputs in its prediction, revealing mechanisms that align with prior domain knowledge.


Assuntos
Benchmarking , Redes Neurais de Computação , Humanos , Desenvolvimento de Medicamentos
3.
Int J Data Sci Anal ; : 1-16, 2022 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-36440369

RESUMO

Discovering new medicines is the hallmark of the human endeavor to live a better and longer life. Yet the pace of discovery has slowed down as we need to venture into more wildly unexplored biomedical space to find one that matches today's high standard. Modern AI-enabled by powerful computing, large biomedical databases, and breakthroughs in deep learning offers a new hope to break this loop as AI is rapidly maturing, ready to make a huge impact in the area. In this paper, we review recent advances in AI methodologies that aim to crack this challenge. We organize the vast and rapidly growing literature on AI for drug discovery into three relatively stable sub-areas: (a) representation learning over molecular sequences and geometric graphs; (b) data-driven reasoning where we predict molecular properties and their binding, optimize existing compounds, generate de novo molecules, and plan the synthesis of target molecules; and (c) knowledge-based reasoning where we discuss the construction and reasoning over biomedical knowledge graphs. We will also identify open challenges and chart possible research directions for the years to come.

4.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35788823

RESUMO

Predicting the drug-target interaction is crucial for drug discovery as well as drug repurposing. Machine learning is commonly used in drug-target affinity (DTA) problem. However, the machine learning model faces the cold-start problem where the model performance drops when predicting the interaction of a novel drug or target. Previous works try to solve the cold start problem by learning the drug or target representation using unsupervised learning. While the drug or target representation can be learned in an unsupervised manner, it still lacks the interaction information, which is critical in drug-target interaction. To incorporate the interaction information into the drug and protein interaction, we proposed using transfer learning from chemical-chemical interaction (CCI) and protein-protein interaction (PPI) task to drug-target interaction task. The representation learned by CCI and PPI tasks can be transferred smoothly to the DTA task due to the similar nature of the tasks. The result on the DTA datasets shows that our proposed method has advantages compared to other pre-training methods in the DTA task.


Assuntos
Desenvolvimento de Medicamentos , Aprendizado de Máquina , Descoberta de Drogas/métodos , Reposicionamento de Medicamentos
5.
Artigo em Inglês | MEDLINE | ID: mdl-33606633

RESUMO

BACKGROUND: Drug response prediction is an important problem in computational personalized medicine. Many machine-learning-based methods, especially deep learning-based ones, have been proposed for this task. However, these methods often represent the drugs as strings, which are not a natural way to depict molecules. Also, interpretation (e.g., what are the mutation or copy number aberration contributing to the drug response) has not been considered thoroughly. METHODS: In this study, we propose a novel method, GraphDRP, based on graph convolutional network for the problem. In GraphDRP, drugs were represented in molecular graphs directly capturing the bonds among atoms, meanwhile cell lines were depicted as binary vectors of genomic aberrations. Representative features of drugs and cell lines were learned by convolution layers, then combined to represent for each drug-cell line pair. Finally, the response value of each drug-cell line pair was predicted by a fully-connected neural network. Four variants of graph convolutional networks were used for learning the features of drugs. RESULTS: We found that GraphDRP outperforms tCNNS in all performance measures for all experiments. Also, through saliency maps of the resulting GraphDRP models, we discovered the contribution of the genomic aberrations to the responses. CONCLUSION: Representing drugs as graphs can improve the performance of drug response prediction. Availability of data and materials: Data and source code can be downloaded athttps://github.com/hauldhut/GraphDRP.


Assuntos
Redes Neurais de Computação , Preparações Farmacêuticas , Genômica , Aprendizado de Máquina , Software
6.
Artigo em Inglês | MEDLINE | ID: mdl-34197324

RESUMO

Predicting the interaction between a compound and a target is crucial for rapid drug repurposing. Deep learning has been successfully applied in drug-target affinity (DTA)problem. However, previous deep learning-based methods ignore modeling the direct interactions between drug and protein residues. This would lead to inaccurate learning of target representation which may change due to the drug binding effects. In addition, previous DTA methods learn protein representation solely based on a small number of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets. We propose GEFA (Graph Early Fusion Affinity), a novel graph-in-graph neural network with attention mechanism to address the changes in target representation because of the binding effects. Specifically, a drug is modeled as a graph of atoms, which then serves as a node in a larger graph of residues-drug complex. The resulting model is an expressive deep nested graph neural network. We also use pre-trained protein representation powered by the recent effort of learning contextualized protein representation. The experiments are conducted under different settings to evaluate scenarios such as novel drugs or targets. The results demonstrate the effectiveness of the pre-trained protein embedding and the advantages our GEFA in modeling the nested graph for drug-target interaction.


Assuntos
Desenvolvimento de Medicamentos , Redes Neurais de Computação , Sequência de Aminoácidos , Desenvolvimento de Medicamentos/métodos , Reposicionamento de Medicamentos , Proteínas/química
7.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-34020545

RESUMO

MOTIVATION: Predicting cell locations is important since with the understanding of cell locations, we may estimate the function of cells and their integration with the spatial environment. Thus, the DREAM challenge on single-cell transcriptomics required participants to predict the locations of single cells in the Drosophila embryo using single-cell transcriptomic data. RESULTS: We have developed over 50 pipelines by combining different ways of preprocessing the RNA-seq data, selecting the genes, predicting the cell locations and validating predicted cell locations, resulting in the winning methods which were ranked second in sub-challenge 1, first in sub-challenge 2 and third in sub-challenge 3. In this paper, we present an R package, SCTCwhatateam, which includes all the methods we developed and the Shiny web application to facilitate the research on single-cell spatial reconstruction. All the data and the example use cases are available in the Supplementary data.


Assuntos
Análise de Célula Única/métodos , Transcriptoma , Algoritmos , Animais , Biologia Computacional/métodos , Drosophila/embriologia , Análise de Sequência de RNA/métodos
8.
PLoS One ; 16(5): e0251787, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34010314

RESUMO

Data generated within social media platforms may present a new way to identify individuals who are experiencing mental illness. This study aimed to investigate the associations between linguistic features in individuals' blog data and their symptoms of depression, generalised anxiety, and suicidal ideation. Individuals who blogged were invited to participate in a longitudinal study in which they completed fortnightly symptom scales for depression and anxiety (PHQ-9, GAD-7) for a period of 36 weeks. Blog data published in the same period was also collected, and linguistic features were analysed using the LIWC tool. Bivariate and multivariate analyses were performed to investigate the correlations between the linguistic features and symptoms between subjects. Multivariate regression models were used to predict longitudinal changes in symptoms within subjects. A total of 153 participants consented to the study. The final sample consisted of the 38 participants who completed the required number of symptom scales and generated blog data during the study period. Between-subject analysis revealed that the linguistic features "tentativeness" and "non-fluencies" were significantly correlated with symptoms of depression and anxiety, but not suicidal thoughts. Within-subject analysis showed no robust correlations between linguistic features and changes in symptoms. The findings may provide evidence of a relationship between some linguistic features in social media data and mental health; however, the study was limited by missing data and other important considerations. The findings also suggest that linguistic features observed at the group level may not generalise to, or be useful for, detecting individual symptom change over time.


Assuntos
Ansiedade/psicologia , Depressão/psicologia , Saúde Mental , Mídias Sociais , Ideação Suicida , Adolescente , Adulto , Idoso , Feminino , Humanos , Idioma , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Questionário de Saúde do Paciente , Fatores de Risco
9.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2841-2847, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33909569

RESUMO

The classification of clinical samples based on gene expression data is an important part of precision medicine. In this manuscript, we show how transforming gene expression data into a set of personalized (sample-specific) networks can allow us to harness existing graph-based methods to improve classifier performance. Existing approaches to personalized gene networks have the limitation that they depend on other samples in the data and must get re-computed whenever a new sample is introduced. Here, we propose a novel method, called Personalized Annotation-based Networks (PAN), that avoids this limitation by using curated annotation databases to transform gene expression data into a graph. Unlike competing methods, PANs are calculated for each sample independent of the population, making it a more efficient way to obtain single-sample networks. Using three breast cancer datasets as a case study, we show that PAN classifiers not only predict cancer relapse better than gene features alone, but also outperform PPI (protein-protein interactions) and population-level graph-based classifiers. This work demonstrates the practical advantages of graph-based classification for high-dimensional genomic data, while offering a new approach to making sample-specific networks. Supplementary information: PAN and the baselines are implemented in Python. Source code and data are available at https://github.com/thinng/PAN.


Assuntos
Neoplasias da Mama , Genômica/métodos , Anotação de Sequência Molecular/métodos , Recidiva Local de Neoplasia , Medicina de Precisão/métodos , Algoritmos , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Neoplasias da Mama/patologia , Bases de Dados Genéticas , Feminino , Humanos , Recidiva Local de Neoplasia/diagnóstico , Recidiva Local de Neoplasia/genética , Recidiva Local de Neoplasia/metabolismo , Recidiva Local de Neoplasia/patologia , Mapas de Interação de Proteínas/genética , Software , Transcriptoma/genética
10.
Bioinformatics ; 37(19): 3285-3292, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33904576

RESUMO

MOTIVATION: Unravelling cancer driver genes is important in cancer research. Although computational methods have been developed to identify cancer drivers, most of them detect cancer drivers at population level. However, two patients who have the same cancer type and receive the same treatment may have different outcomes because each patient has a different genome and their disease might be driven by different driver genes. Therefore new methods are being developed for discovering cancer drivers at individual level, but existing personalized methods only focus on coding drivers while microRNAs (miRNAs) have been shown to drive cancer progression as well. Thus, novel methods are required to discover both coding and miRNA cancer drivers at individual level. RESULTS: We propose the novel method, pDriver, to discover personalized cancer drivers. pDriver includes two stages: (i) constructing gene networks for each cancer patient and (ii) discovering cancer drivers for each patient based on the constructed gene networks. To demonstrate the effectiveness of pDriver, we have applied it to five TCGA cancer datasets and compared it with the state-of-the-art methods. The result indicates that pDriver is more effective than other methods. Furthermore, pDriver can also detect miRNA cancer drivers and most of them have been confirmed to be associated with cancer by literature. We further analyze the predicted personalized drivers for breast cancer patients and the result shows that they are significantly enriched in many GO processes and KEGG pathways involved in breast cancer. AVAILABILITY AND IMPLEMENTATION: pDriver is available at https://github.com/pvvhoang/pDriver. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
Sci Rep ; 11(1): 3487, 2021 02 10.
Artigo em Inglês | MEDLINE | ID: mdl-33568759

RESUMO

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development.


Assuntos
COVID-19/virologia , Genoma Viral , Mutação , Estrutura Secundária de Proteína , SARS-CoV-2/genética , DNA Viral , Genômica , Humanos , SARS-CoV-2/metabolismo , Glicoproteína da Espícula de Coronavírus/genética , Glicoproteína da Espícula de Coronavírus/metabolismo
12.
Bioinformatics ; 37(8): 1140-1147, 2021 05 23.
Artigo em Inglês | MEDLINE | ID: mdl-33119053

RESUMO

SUMMARY: The development of new drugs is costly, time consuming and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug-target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug-target affinity. We show that graph neural networks not only predict drug-target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug-target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. AVAILABILITY OF IMPLEMENTATION: The proposed models are implemented in Python. Related data, pre-trained models and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Preparações Farmacêuticas , Reposicionamento de Medicamentos , Proteínas , Software
13.
Multimed Tools Appl ; 80(5): 7187-7204, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33132740

RESUMO

We propose in this work a graph-based approach for automatic public health analysis using social media. In our approach, graphs are created to model the interactions between features and between tweets in social media. We investigated different graph properties and methods in constructing graph-based representations for population health analysis. The proposed approach is applied in two case studies: (1) estimating health indices, and (2) classifying health situation of counties in the US. We evaluate our approach on a dataset including more than one billion tweets collected in three years 2014, 2015, and 2016, and the health surveys from the Behavioral Risk Factor Surveillance System. We conducted realistic and large-scale experiments on various textual features and graph-based representations. Experimental results verified the robustness of the proposed approach and its superiority over existing ones in both case studies, confirming the potential of graph-based approach for modeling interactions in social networks for population health analysis.

14.
Crit Care Resusc ; 2020 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-33105920

RESUMO

Using geotagged Twitter data in Victoria, we created a mobility index and studied the changes during the staged restrictions during the coronavirus disease 2019 (COVID-19) pandemic. We describe preliminary evidence that geotagged Twitter data may be used to provide real-time population mobility data and information on the impact of restrictions on such mobility.

15.
Life Sci Alliance ; 3(11)2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32972997

RESUMO

Single-cell RNA-sequencing (scRNAseq) technologies are rapidly evolving. Although very informative, in standard scRNAseq experiments, the spatial organization of the cells in the tissue of origin is lost. Conversely, spatial RNA-seq technologies designed to maintain cell localization have limited throughput and gene coverage. Mapping scRNAseq to genes with spatial information increases coverage while providing spatial location. However, methods to perform such mapping have not yet been benchmarked. To fill this gap, we organized the DREAM Single-Cell Transcriptomics challenge focused on the spatial reconstruction of cells from the Drosophila embryo from scRNAseq data, leveraging as silver standard, genes with in situ hybridization data from the Berkeley Drosophila Transcription Network Project reference atlas. The 34 participating teams used diverse algorithms for gene selection and location prediction, while being able to correctly localize clusters of cells. Selection of predictor genes was essential for this task. Predictor genes showed a relatively high expression entropy, high spatial clustering and included prominent developmental genes such as gap and pair-rule genes and tissue markers. Application of the top 10 methods to a zebra fish embryo dataset yielded similar performance and statistical properties of the selected genes than in the Drosophila data. This suggests that methods developed in this challenge are able to extract generalizable properties of genes that are useful to accurately reconstruct the spatial arrangement of cells in tissues.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Análise de Célula Única/métodos , Análise Espacial , Algoritmos , Animais , Bases de Dados Genéticas , Drosophila/genética , Previsões/métodos , Regulação da Expressão Gênica no Desenvolvimento/genética , Redes Reguladoras de Genes/genética , Análise de Sequência de RNA/métodos , Transcriptoma/genética , Peixe-Zebra/genética
16.
Crit Care Resusc ; 22(4): 293-294, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38046867

RESUMO

Using geotagged Twitter data in Victoria, we created a mobility index and studied the changes during the staged restrictions during the coronavirus disease 2019 (COVID-19) pandemic. We describe preliminary evidence that geotagged Twitter data may be used to provide real-time population mobility data and information on the impact of restrictions on such mobility.

17.
J Biomed Inform ; 99: 103277, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31521858

RESUMO

Public health measurement is important for government administration as it provides indicators and implications to public healthcare strategies. The measurement of health status has been traditionally conducted via surveys in the forms of pre-designed questionnaires to collect responses from targeted participants. Apart from benefits, traditional approach is costly, time-consuming, and not scalable. These limitations make a major obstacle to policy makers to develop up-to-date healthcare programs. This paper studies the use of health-related information conveyed in user-generated content from social media for prediction of health outcomes at population level. Specifically, we investigate linguistic features for analysing textual data. We propose the use of visual features learnt from deep neural networks for understanding visual data. We introduce collective social capital information from location-based social media data. We conducted extensive experiments on large-scale datasets collected from two online social networks: Foursquare and Flickr, against the task of prediction of the U.S. county health indices. Experimental results showed that visual and collective social capital data achieved comparable prediction performance and outperformed textual information. These promising results also suggest the potential of social media for health analysis at population scales.


Assuntos
Nível de Saúde , Saúde da População/estatística & dados numéricos , Mídias Sociais , Visualização de Dados , Pesquisa Empírica , Humanos , Redes Neurais de Computação , Psicolinguística , Saúde Pública , Inquéritos e Questionários
18.
Mol Biol Rep ; 46(6): 5919-5930, 2019 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-31410687

RESUMO

In the progression of cancer, cells acquire genetic mutations that cause uncontrolled growth. Over time, the primary tumour may undergo additional mutations that allow for the cancerous cells to spread throughout the body as metastases. Since metastatic development typically results in markedly worse patient outcomes, research into the identity and function of metastasis-associated biomarkers could eventually translate into clinical diagnostics or novel therapeutics. Although the general processes underpinning metastatic progression are understood, no clear cross-cancer biomarker profile has emerged. However, the literature suggests that some microRNAs (miRNAs) may play an important role in the metastatic progression of several cancer types. Using a subset of The Cancer Genome Atlas (TCGA) data, we performed an integrated analysis of mRNA and miRNA expression with paired metastatic and primary tumour samples to interrogate how the miRNA-mRNA regulatory axis influences metastatic progression. From this, we successfully built mRNA- and miRNA-specific classifiers that can discriminate pairs of metastatic and primary samples across 11 cancer types. In addition, we identified a number of miRNAs whose metastasis-associated dysregulation could predict mRNA metastasis-associated dysregulation. Among the most predictive miRNAs, we found several previously implicated in cancer progression, including miR-301b, miR-1296, and miR-423. Taken together, our results suggest that metastatic samples have a common cross-cancer signature when compared with their primary tumour pair, and that these miRNA biomarkers can be used to predict metastatic status as well as mRNA expression.


Assuntos
Regulação Neoplásica da Expressão Gênica/genética , Metástase Neoplásica/genética , Neoplasias/genética , Biomarcadores Tumorais/genética , Bases de Dados Genéticas , Previsões/métodos , Perfilação da Expressão Gênica/métodos , Humanos , MicroRNAs/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , RNA Mensageiro/genética
19.
Front Genet ; 10: 599, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31312210

RESUMO

Since the turn of the century, researchers have sought to diagnose cancer based on gene expression signatures measured from the blood or biopsy as biomarkers. This task, known as classification, is typically solved using a suite of algorithms that learn a mathematical rule capable of discriminating one group ("cases") from another ("controls"). However, discriminatory methods can only identify cancerous samples that resemble those that the algorithm already saw during training. As such, discriminatory methods may be ill-suited for the classification of cancer: because the possibility space of cancer is definitively large, the existence of a one-of-a-kind gene expression signature is likely. Instead, we propose using an established surveillance method that detects anomalous samples based on their deviation from a learned normal steady-state structure. By transferring this method to transcriptomic data, we can create an anomaly detector for tissue transcriptomes, a "tissue detector," that is capable of identifying cancer without ever seeing a single cancer example. As a proof-of-concept, we train a "tissue detector" on normal GTEx samples that can classify TCGA samples with >90% AUC for 3 out of 6 tissues. Importantly, we find that the classification accuracy is improved simply by adding more healthy samples. We conclude this report by emphasizing the conceptual advantages of anomaly detection and by highlighting future directions for this field of study.

20.
Am J Med Genet B Neuropsychiatr Genet ; 180(7): 508-518, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31025483

RESUMO

Although neuropsychiatric disorders have an established genetic background, their molecular foundations remain elusive. This has prompted many investigators to search for explanatory biomarkers that can predict clinical outcomes. One approach uses machine learning to classify patients based on blood mRNA expression. However, these endeavors typically fail to achieve the high level of performance, stability, and generalizability required for clinical translation. Moreover, these classifiers can lack interpretability because not all genes have relevance to researchers. For this study, we hypothesized that annotation-based classifiers can improve classification performance, stability, generalizability, and interpretability. To this end, we evaluated the models of four classification algorithms on six neuropsychiatric data sets using four annotation databases. Our results suggest that the Gene Ontology Biological Process database can transform gene expression into an annotation-based feature space that is accurate and stable. We also show how annotation features can improve the interpretability of classifiers: as annotations are used to assign biological importance to genes, the biological importance of annotation-based features are the features themselves. In evaluating the annotation features, we find that top ranked annotations tend contain top ranked genes, suggesting that the most predictive annotations are a superset of the most predictive genes. Based on this, and the fact that annotations are used routinely to assign biological importance to genetic data, we recommend transforming gene-level expression into annotation-level expression prior to the classification of neuropsychiatric conditions.


Assuntos
Transtornos Mentais/classificação , Doenças do Sistema Nervoso/classificação , Neuropsiquiatria/métodos , Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , Ontologia Genética , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...