ABSTRACT
Genomic selection (GS) is changing plant breeding by significantly reducing the resources needed for phenotyping. However, its accuracy can be compromised by mismatches between training and testing sets, which impact efficiency when the predictive model does not adequately reflect the genetic and environmental conditions of the target population. To address this challenge, this study introduces a straightforward method using binary-Lasso regression to estimate ß coefficients. In this approach, the response variable assigns 1 to testing set inputs and 0 to training set inputs. Subsequently, Lasso, Ridge, and Elastic Net regression models use the inverse of these ß coefficients (in absolute values) as weights during training (WLasso, WRidge, and WElastic Net). This weighting method gives less importance to features that discriminate more between training and testing sets. The effectiveness of this method is evaluated across six datasets, demonstrating consistent improvements in terms of the normalized root mean square error. Importantly, the model's implementation is facilitated using the glmnet library, which supports straightforward integration for weighting ß coefficients.
Subject(s)
Genomics , Models, Genetic , Plant Breeding , Genomics/methods , Plant Breeding/methods , Genome, Plant , Selection, Genetic , Phenotype , Regression AnalysisABSTRACT
OBJECTIVE: Pre-eclampsia (PE) is a serious complication of pregnancy associated with maternal and fetal morbidity and mortality. As current prediction models have limitations and may not be applicable in resource-limited settings, we aimed to develop a machine-learning (ML) algorithm that offers a potential solution for developing accurate and efficient first-trimester prediction of PE. METHODS: We conducted a prospective cohort study in Mexico City, Mexico to develop a first-trimester prediction model for preterm PE (pPE) using ML. Maternal characteristics and locally derived multiples of the median (MoM) values for mean arterial pressure, uterine artery pulsatility index and serum placental growth factor were used for variable selection. The dataset was split into training, validation and test sets. An elastic-net method was employed for predictor selection, and model performance was evaluated using area under the receiver-operating-characteristics curve (AUC) and detection rates (DR) at 10% false-positive rates (FPR). RESULTS: The final analysis included 3050 pregnant women, of whom 124 (4.07%) developed PE. The ML model showed good performance, with AUCs of 0.897, 0.963 and 0.778 for pPE, early-onset PE (ePE) and any type of PE (all-PE), respectively. The DRs at 10% FPR were 76.5%, 88.2% and 50.1% for pPE, ePE and all-PE, respectively. CONCLUSIONS: Our ML model demonstrated high accuracy in predicting pPE and ePE using first-trimester maternal characteristics and locally derived MoM. The model may provide an efficient and accessible tool for early prediction of PE, facilitating timely intervention and improved maternal and fetal outcome. © 2023 The Authors. Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.
Eficiencia de un enfoque de aprendizaje automático para la predicción de la preeclampsia en un país de ingresos medios OBJETIVO: La preeclampsia (PE) es una complicación grave del embarazo asociada a morbilidad y mortalidad materna y del feto. Dado que los modelos de predicción actuales tienen limitaciones y pueden no ser aplicables en situaciones con recursos limitados, se propuso desarrollar un algoritmo de aprendizaje automático (AA) que ofrezca una solución con potencial para desarrollar una predicción precisa y eficiente de la PE en el primer trimestre. MÉTODOS: Se realizó un estudio de cohorte prospectivo en Ciudad de México para desarrollar un modelo de predicción de la PE pretérmino (PEp) en el primer trimestre utilizando AA. Para la selección de variables se utilizaron las características maternas y los múltiplos de la mediana (MdM) obtenidos localmente para la presión arterial media, el índice de pulsatilidad de la arteria uterina y el factor de crecimiento placentario sérico. El conjunto de datos se dividió en subconjuntos de datos de entrenamiento, de validación y de test estadístico. Se empleó un método de red elástica para la selección de predictores, y el rendimiento del modelo se evaluó mediante el área bajo la curva de características operativas del receptor (ABC) y las tasas de detección (TD) con tasas de falsos positivos (TFP) del 10%. RESULTADOS: El análisis final incluyó a 3050 mujeres embarazadas, de las cuales 124 (4,07%) desarrollaron PE. El modelo de AA mostró una buena eficiencia, con un ABC de 0,897, 0,963 y 0,778 para la PEp, la PE de aparición temprana (PEat) y cualquier tipo de PE (todas las PE), respectivamente. Las TD con TFP del 10% fueron del 76,5%, 88,2% y 50,1% para la PEp, PEat y todas las PE, respectivamente. CONCLUSIONES: Nuestro modelo de AA demostró una alta precisión en la predicción de la PEp y la PEat utilizando características maternas del primer trimestre y MdM calculados localmente. El modelo puede proporcionar una herramienta eficiente y accesible para la predicción temprana de la PE, facilitando la intervención oportuna y la mejora de los resultados maternos y del feto.
Subject(s)
Pre-Eclampsia , Infant, Newborn , Pregnancy , Female , Humans , Pre-Eclampsia/diagnosis , Placenta Growth Factor , Prospective Studies , Biomarkers , Pregnancy Trimester, FirstABSTRACT
Background and Objectives: We developed a predictive statistical model to identify donor-recipient characteristics related to kidney graft survival in the Chilean population. Given the large number of potential predictors relative to the sample size, we implemented an automated variable selection mechanism that could be revised in future studies as more national data is collected. Materials and Methods: A retrospective multicenter study was conducted to analyze data from 822 adult kidney transplant recipients from adult donors between 1998 and 2018. To the best of our knowledge, this is the largest kidney transplant database to date in Chile. A procedure based on a cross-validated regularized Cox regression using the Elastic Net penalty was applied to objectively identify predictors of death-censored graft failure. Hazard ratios were estimated by adjusting a multivariate Cox regression with the selected predictors. Results: Seven variables were associated with the risk of death-censored graft failure; four from the donor: age (HR = 1.02, 95% CI: 1.00-1.03), male sex (HR = 0.64, 95% CI: 0.46-0.90), history of hypertension (HR = 1.49, 95% CI: 0.98-2.28), and history of diabetes (HR = 2.04, 95% CI: 0.97-4.29); two from the recipient: years on dialysis log-transformation (HR = 1.29, 95% CI: 0.99-1.67) and history of previous solid organ transplantation (HR = 2.02, 95% CI: 1.18-3.47); and one from the transplant: number of HLA mismatches (HR = 1.13, 95% CI: 0.99-1.28). Only the latter is considered for patient prioritization in deceased kidney allocation in Chile. Conclusions: A risk model for kidney graft failure was developed and trained for the Chilean population, providing objective criteria which can be used to improve efficiency in deceased kidney allocation.
Subject(s)
Graft Survival , Kidney Transplantation , Adult , Male , Humans , Chile/epidemiology , Renal Dialysis , Kidney Transplantation/methods , Kidney , Retrospective Studies , Graft Rejection , Risk FactorsABSTRACT
BACKGROUND: Patients with obsessive-compulsive disorder (OCD) are at increased risk for suicide attempt (SA) compared to the general population. However, the significant risk factors for SA in this population remains unclear - whether these factors are associated with the disorder itself or related to extrinsic factors, such as comorbidities and sociodemographic variables. This study aimed to identify predictors of SA in OCD patients using a machine learning algorithm. METHODS: A total of 959 outpatients with OCD were included. An elastic net model was performed to recognize the predictors of SA among OCD patients, using clinical and sociodemographic variables. RESULTS: The prevalence of SA in our sample was 10.8%. Relevant predictors of SA founded by the elastic net algorithm were the following: previous suicide planning, previous suicide thoughts, lifetime depressive episode, and intermittent explosive disorder. Our elastic net model had a good performance and found an area under the curve of 0.95. CONCLUSIONS: This is the first study to evaluate risk factors for SA among OCD patients using machine learning algorithms. Our results demonstrate an accurate risk algorithm can be created using clinical and sociodemographic variables. All aspects of suicidal phenomena need to be carefully investigated by clinicians in every evaluation of OCD patients. Particular attention should be given to comorbidity with depressive symptoms.
Subject(s)
Obsessive-Compulsive Disorder , Suicide, Attempted , Comorbidity , Humans , Machine Learning , Obsessive-Compulsive Disorder/diagnosis , Obsessive-Compulsive Disorder/epidemiology , Prevalence , Suicidal IdeationABSTRACT
Breast cancer is a disease that exhibits heterogeneity that goes from the genomic to the clinical levels. This heterogeneity is thought to be captured (at least partially) by the so-called breast cancer molecular subtypes. These molecular subtypes were initially defined based on the unsupervised clustering of gene expression and its correlate with histological, morphological, phenotypic and clinical features already known. Later, a 50-gene signature, PAM50, was defined in order to identify the biological subtype of a given sample within the clinical setting. The PAM50 signature was obtained by the use of unsupervised statistical methods, and therefore no limitation was set on the biological relevance (or lack of) of the selected genes beyond its predictive capacity. An open question that remains is what are the regulatory elements that drive the various expression behaviors of this set of genes in the different molecular subtypes. This question becomes more relevant as the measurement of more biological layers of regulation becomes accessible. In this work, we analyzed the gene expression regulation of the 50 genes in the PAM50 signature, in terms of (a) gene co-expression, (b) transcription factors, (c) micro-RNAs, and (d) methylation. Using data from the Cancer Genome Atlas (TCGA) for the Luminal A and B, Basal, and HER2-enriched molecular subtypes as well as normal tumor adjacent tissue, we identified predictors for gene expression through the use of an elastic net model. We compare and contrast the sets of identified regulators for the gene signature in each molecular subtype, and systematically compare them to current literature. We also identified a unique set of predictors for the expression of genes in the PAM50 signature associated with each of the molecular subtypes. Most selected predictors are exclusive for a PAM50 gene and predictors are not shared across subtypes. There are only 13 coding transcripts and 2 miRNAs selected for the four subtypes. MiR-21 and miR-10b connect almost all the PAM50 genes in all the subtypes and normal tissue, but do it in an exclusive manner, suggesting a cancer switch from miR-10b coordination in normal tissue to miR-21. The PAM50 gene sets of selected predictors that enrich for a function across subtypes, support that different regulatory molecular mechanisms are taking place. With this study we aim to a wider understanding of the regulatory mechanisms that differentiate the expression of the PAM50 signature, which in turn could perhaps help understand the molecular basis of the differences between the molecular subtypes.
ABSTRACT
In this paper, we present a novel methodology to solve the classification problem, based on sparse (data-driven) regressions, combined with techniques for ensuring stability, especially useful for high-dimensional datasets and small samples number. The sensitivity and specificity of the classifiers are assessed by a stable ROC procedure, which uses a non-parametric algorithm for estimating the area under the ROC curve. This method allows assessing the performance of the classification by the ROC technique, when more than two groups are involved in the classification problem, i.e., when the gold standard is not binary. We apply this methodology to the EEG spectral signatures to find biomarkers that allow discriminating between (and predicting pertinence to) different subgroups of children diagnosed as Not Otherwise Specified Learning Disabilities (LD-NOS) disorder. Children with LD-NOS have notable learning difficulties, which affect education but are not able to be put into some specific category as reading (Dyslexia), Mathematics (Dyscalculia), or Writing (Dysgraphia). By using the EEG spectra, we aim to identify EEG patterns that may be related to specific learning disabilities in an individual case. This could be useful to develop subject-based methods of therapy, based on information provided by the EEG. Here we study 85 LD-NOS children, divided in three subgroups previously selected by a clustering technique over the scores of cognitive tests. The classification equation produced stable marginal areas under the ROC of 0.71 for discrimination between Group 1 vs. Group 2; 0.91 for Group 1 vs. Group 3; and 0.75 for Group 2 vs. Group1. A discussion of the EEG characteristics of each group related to the cognitive scores is also presented.