Who's your data? Primary immune deficiency differential diagnosis prediction via machine learning and data mining of the USIDNET registry.

Méndez Barrera, Jose Alfredo; Rocha Guzmán, Samuel; Hierro Cascajares, Elisa; Garabedian, Elizabeth K; Fuleihan, Ramsay L; Sullivan, Kathleen E; Lugo Reyes, Saul O

Méndez Barrera, Jose Alfredo; Rocha Guzmán, Samuel; Hierro Cascajares, Elisa; Garabedian, Elizabeth K; Fuleihan, Ramsay L; Sullivan, Kathleen E; Lugo Reyes, Saul O.

Afiliação

Méndez Barrera JA; Data Science Department, Autonomous Technological Institute of Mexico, Mexico City, Mexico.
Rocha Guzmán S; Data Science Department, Autonomous Technological Institute of Mexico, Mexico City, Mexico.
Hierro Cascajares E; Immune deficiencies Lab, National Institute of Pediatrics, Secretariat of Health, Mexico City, Mexico.
Garabedian EK; National Institutes of Health, Bethesda, MD, USA.
Fuleihan RL; Division of Pediatric Allergy, Immunology and Rheumatology at Columbia University, New York City, NY, USA.
Sullivan KE; Children's Hospital of Philadelphia, PA, USA.
Lugo Reyes SO; Immune deficiencies Lab, National Institute of Pediatrics, Secretariat of Health, Mexico City, Mexico. Electronic address: dr.lugo.reyes@gmail.com.

Clin Immunol ; 255: 109759, 2023 10.

Article em En | MEDLINE | ID: mdl-37678719

RESUMO

PURPOSE: There are currently more than 480 primary immune deficiency (PID) diseases and about 7000 rare diseases that together afflict around 1 in every 17 humans. Computational aids based on data mining and machine learning might facilitate the diagnostic task by extracting rules from large datasets and making predictions when faced with new problem cases. In a proof-of-concept data mining study, we aimed to predict PID diagnoses using a supervised machine learning algorithm based on classification tree boosting. METHODS: Through a data query at the USIDNET registry we obtained a database of 2396 patients with common diagnoses of PID, including their clinical and laboratory features. We kept 286 features and all 12 diagnoses to include in the model. We used the XGBoost package with parallel tree boosting for the supervised classification model, and SHAP for variable importance interpretation, on Python v3.7. The patient database was split into training and testing subsets, and after boosting through gradient descent, the predictive model provides measures of diagnostic prediction accuracy and individual feature importance. After a baseline performance test, we used the Class Weighting Hyperparameter, or scale_pos_weight to correct for imbalanced classification. RESULTS: The twelve PID diagnoses were CVID (1098 patients), DiGeorge syndrome, Chronic granulomatous disease, Congenital agammaglobulinemia, PID not otherwise classified, Specific antibody deficiency, Complement deficiency, Hyper-IgM, Leukocyte adhesion deficiency, ectodermal dysplasia with immune deficiency, Severe combined immune deficiency, and Wiskott-Aldrich syndrome. For CVID, the model found an accuracy on the train sample of 0.80, with an area under the ROC curve (AUC) of 0.80, and a Gini coefficient of 0.60. In the test subset, accuracy was 0.76, AUC 0.75, and Gini 0.51. The positive feature value to predict CVID was highest for upper respiratory infections, asthma, autoimmunity and hypogammaglobulinemia. Features with the highest negative predictive value were high IgE, growth delay, abscess, lymphopenia, and congenital heart disease. For the rest of the diagnoses, accuracy stayed between 0.75 and 0.99, AUC 0.46-0.87, Gini 0.07-0.75, and LogLoss 0.09-8.55. DISCUSSION: Clinicians should remember to consider the negative predictive features together with the positives. We are calling this a proof-of-concept study to continue with our explorations. A good performance is encouraging, and feature importance might aid feature selection for future endeavors. In the meantime, we can learn from the rules derived by the model and build a user-friendly decision tree to generate differential diagnoses.

Assuntos

Doenças da Imunodeficiência Primária; Síndrome de Wiskott-Aldrich; Humanos; Diagnóstico Diferencial; Aprendizado de Máquina; Mineração de Dados

Palavras-chave

Classification; Data mining; Diagnosis prediction; Extreme gradient boosting; Inborn errors of immunity; Machine learning; Primary immune deficiencies; Rare diseases; Registry

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Síndrome de Wiskott-Aldrich / Doenças da Imunodeficiência Primária Tipo de estudo: Diagnostic_studies / Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: Clin Immunol Assunto da revista: ALERGIA E IMUNOLOGIA Ano de publicação: 2023 Tipo de documento: Article País de afiliação: México País de publicação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google