ABSTRACT
PURPOSE: Accurate lesion segmentation is important for measurements of lesion load and atrophy in subjects with multiple sclerosis (MS). International MS lesion challenges show a preference of convolutional neural networks (CNN) strategies, such as nicMSlesions. However, since the software is trained on fairly homogenous training data, we aimed to test the performance of nicMSlesions in an independent dataset with manual and other automatic lesion segmentations to determine whether this method is suitable for larger, multi-center studies. METHODS: Manual lesion segmentation was performed in fourteen subjects with MS on sagittal 3D FLAIR images from a 3T GE whole-body scanner with 8-channel head coil. We compared five different categories of automated lesion segmentation methods for their volumetric and spatial agreement with manual segmentation: (i) unsupervised, untrained (LesionTOADS); (ii) supervised, untrained (LST-LPA and nicMSlesions with default settings); (iii) supervised, untrained with threshold adjustment (LST-LPA optimized for current data); (iv) supervised, trained with leave-one-out cross-validation on fourteen subjects with MS (nicMSlesions and BIANCA); and (v) supervised, trained on a single subject with MS (nicMSlesions). Volumetric accuracy was determined by the intra-class correlation coefficient (ICC) and spatial accuracy by Dice's similarity index (SI). Volumes and SI were compared between methods using repeated measures ANOVA or Friedman tests with post-hoc pairwise comparison. RESULTS: The best volumetric and spatial agreement with manual was obtained with the supervised and trained methods nicMSlesions and BIANCA (ICC absolute agreement >â¯0.968 and median SIâ¯>â¯0.643) and the worst with the unsupervised, untrained method LesionTOADS (ICC absolute agreementâ¯=â¯0.140 and median SIâ¯=â¯0.444). Agreement with manual in the single-subject network training of nicMSlesions was poor for input with low lesion volumes (i.e. two subjects with lesion volumes ≤â¯3.0â¯ml). For the other twelve subjects, ICC varied from 0.593 to 0.973 and median SI varied from 0.535 to 0.606. In all cases, the single-subject trained nicMSlesions segmentations outperformed LesionTOADS, and in almost all cases it also outperformed LST-LPA. CONCLUSION: Input from only one subject to re-train the deep learning CNN nicMSlesions is sufficient for adequate lesion segmentation, with on average higher volumetric and spatial agreement with manual than obtained with the untrained methods LesionTOADS and LST-LPA.
Subject(s)
Brain/diagnostic imaging , Deep Learning , Multiple Sclerosis, Relapsing-Remitting/diagnostic imaging , Supervised Machine Learning , Unsupervised Machine Learning , Adult , Female , Humans , Image Processing, Computer-Assisted , Magnetic Resonance Imaging , Male , Middle Aged , Neural Networks, ComputerABSTRACT
INTRODUCTION: Atrophy of the spinal cord is known to occur in multiple sclerosis (MS). The mean upper cervical cord area (MUCCA) can be used to measure this atrophy. Currently, several (semi-)automated methods for MUCCA measurement exist, but validation in clinical magnetic resonance (MR) images is lacking. METHODS: Five methods to measure MUCCA (SCT-PropSeg, SCT-DeepSeg, NeuroQLab, Xinapse JIM and ITK-SNAP) were investigated in a predefined upper cervical cord region. First, within-scanner reproducibility and between-scanner robustness were assessed using intra-class correlation coefficient (ICC) and Dice's similarity index (SI) in scan-rescan 3DT1-weighted images (brain, including cervical spine using a head coil) performed on three 3â¯T MR machines (GE MR750, Philips Ingenuity, Toshiba Vantage Titan) in 21 subjects with MS and 6 healthy controls (dataset A). Second, sensitivity of MUCCA measurement to lesions in the upper cervical cord was assessed with cervical 3D T1-weighted images (3â¯T GE HDxT using a head-neck-spine coil) in 7 subjects with MS without and 14 subjects with MS with cervical lesions (dataset B), using ICC and SI with manual reference segmentations. RESULTS: In dataset A, MUCCA differed between MR machines (pâ¯<â¯0.001) and methods (pâ¯<â¯0.001) used, but not between scan sessions. With respect to MUCCA values, Xinapse JIM showed the highest within-scanner reproducibility (ICC absolute agreementâ¯=â¯0.995) while Xinapse JIM and SCT-PropSeg showed the highest between-scanner robustness (ICC consistencyâ¯=â¯0.981 and 0.976, respectively). Reproducibility of segmentations between scan sessions was highest in Xinapse JIM and SCT-PropSeg segmentations (median SIâ¯≥â¯0.921), with a significant main effect of method (pâ¯<â¯0.001), but not of MR machine or subject group. In dataset B, SI with manual outlines did not differ between patients with or without cervical lesions for any of the segmentation methods (pâ¯>â¯0.176). However, there was an effect of method for both volumetric and voxel wise agreement of the segmentations (both pâ¯<â¯0.001). Highest volumetric and voxel wise agreement was obtained with Xinapse JIM (ICC absolute agreementâ¯=â¯0.940 and median SIâ¯=â¯0.962). CONCLUSION: Although MUCCA is highly reproducible within a scanner for each individual measurement method, MUCCA differs between scanners and between methods. Cervical cord lesions do not affect MUCCA measurement performance.