Search | VHL Regional Portal

1.

Uncertainty-aware multiple-instance learning for reliable classification: Application to optical coherence tomography.

de Vente, Coen; van Ginneken, Bram; Hoyng, Carel B; Klaver, Caroline C W; Sánchez, Clara I.

Med Image Anal ; 97: 103259, 2024 Jun 27.

Article in English | MEDLINE | ID: mdl-38959721

ABSTRACT

Deep learning classification models for medical image analysis often perform well on data from scanners that were used to acquire the training data. However, when these models are applied to data from different vendors, their performance tends to drop substantially. Artifacts that only occur within scans from specific scanners are major causes of this poor generalizability. We aimed to enhance the reliability of deep learning classification models using a novel method called Uncertainty-Based Instance eXclusion (UBIX). UBIX is an inference-time module that can be employed in multiple-instance learning (MIL) settings. MIL is a paradigm in which instances (generally crops or slices) of a bag (generally an image) contribute towards a bag-level output. Instead of assuming equal contribution of all instances to the bag-level output, UBIX detects instances corrupted due to local artifacts on-the-fly using uncertainty estimation, reducing or fully ignoring their contributions before MIL pooling. In our experiments, instances are 2D slices and bags are volumetric images, but alternative definitions are also possible. Although UBIX is generally applicable to diverse classification tasks, we focused on the staging of age-related macular degeneration in optical coherence tomography. Our models were trained on data from a single scanner and tested on external datasets from different vendors, which included vendor-specific artifacts. UBIX showed reliable behavior, with a slight decrease in performance (a decrease of the quadratic weighted kappa (κw) from 0.861 to 0.708), when applied to images from different vendors containing artifacts; while a state-of-the-art 3D neural network without UBIX suffered from a significant detriment of performance (κw from 0.852 to 0.084) on the same test set. We showed that instances with unseen artifacts can be identified with OOD detection. UBIX can reduce their contribution to the bag-level predictions, improving reliability without retraining on new data. This potentially increases the applicability of artificial intelligence models to data from other scanners than the ones for which they were developed. The source code for UBIX, including trained model weights, is publicly available through https://github.com/qurAI-amsterdam/ubix-for-reliable-classification.

2.

Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study.

Saha, Anindo; Bosma, Joeran S; Twilt, Jasper J; van Ginneken, Bram; Bjartell, Anders; Padhani, Anwar R; Bonekamp, David; Villeirs, Geert; Salomon, Georg; Giannarini, Gianluca; Kalpathy-Cramer, Jayashree; Barentsz, Jelle; Maier-Hein, Klaus H; Rusu, Mirabela; Rouvière, Olivier; van den Bergh, Roderick; Panebianco, Valeria; Kasivisvanathan, Veeru; Obuchowski, Nancy A; Yakar, Derya; Elschot, Mattijs; Veltman, Jeroen; Fütterer, Jurgen J; de Rooij, Maarten; Huisman, Henkjan.

Lancet Oncol ; 25(7): 879-887, 2024 Jul.

Article in English | MEDLINE | ID: mdl-38876123

ABSTRACT

BACKGROUND: Artificial intelligence (AI) systems can potentially aid the diagnostic pathway of prostate cancer by alleviating the increasing workload, preventing overdiagnosis, and reducing the dependence on experienced radiologists. We aimed to investigate the performance of AI systems at detecting clinically significant prostate cancer on MRI in comparison with radiologists using the Prostate Imaging-Reporting and Data System version 2.1 (PI-RADS 2.1) and the standard of care in multidisciplinary routine practice at scale. METHODS: In this international, paired, non-inferiority, confirmatory study, we trained and externally validated an AI system (developed within an international consortium) for detecting Gleason grade group 2 or greater cancers using a retrospective cohort of 10 207 MRI examinations from 9129 patients. Of these examinations, 9207 cases from three centres (11 sites) based in the Netherlands were used for training and tuning, and 1000 cases from four centres (12 sites) based in the Netherlands and Norway were used for testing. In parallel, we facilitated a multireader, multicase observer study with 62 radiologists (45 centres in 20 countries; median 7 [IQR 5-10] years of experience in reading prostate MRI) using PI-RADS (2.1) on 400 paired MRI examinations from the testing cohort. Primary endpoints were the sensitivity, specificity, and the area under the receiver operating characteristic curve (AUROC) of the AI system in comparison with that of all readers using PI-RADS (2.1) and in comparison with that of the historical radiology readings made during multidisciplinary routine practice (ie, the standard of care with the aid of patient history and peer consultation). Histopathology and at least 3 years (median 5 [IQR 4-6] years) of follow-up were used to establish the reference standard. The statistical analysis plan was prespecified with a primary hypothesis of non-inferiority (considering a margin of 0·05) and a secondary hypothesis of superiority towards the AI system, if non-inferiority was confirmed. This study was registered at ClinicalTrials.gov, NCT05489341. FINDINGS: Of the 10 207 examinations included from Jan 1, 2012, through Dec 31, 2021, 2440 cases had histologically confirmed Gleason grade group 2 or greater prostate cancer. In the subset of 400 testing cases in which the AI system was compared with the radiologists participating in the reader study, the AI system showed a statistically superior and non-inferior AUROC of 0·91 (95% CI 0·87-0·94; p<0·0001), in comparison to the pool of 62 radiologists with an AUROC of 0·86 (0·83-0·89), with a lower boundary of the two-sided 95% Wald CI for the difference in AUROC of 0·02. At the mean PI-RADS 3 or greater operating point of all readers, the AI system detected 6·8% more cases with Gleason grade group 2 or greater cancers at the same specificity (57·7%, 95% CI 51·6-63·3), or 50·4% fewer false-positive results and 20·0% fewer cases with Gleason grade group 1 cancers at the same sensitivity (89·4%, 95% CI 85·3-92·9). In all 1000 testing cases where the AI system was compared with the radiology readings made during multidisciplinary practice, non-inferiority was not confirmed, as the AI system showed lower specificity (68·9% [95% CI 65·3-72·4] vs 69·0% [65·5-72·5]) at the same sensitivity (96·1%, 94·0-98·2) as the PI-RADS 3 or greater operating point. The lower boundary of the two-sided 95% Wald CI for the difference in specificity (-0·04) was greater than the non-inferiority margin (-0·05) and a p value below the significance threshold was reached (p<0·001). INTERPRETATION: An AI system was superior to radiologists using PI-RADS (2.1), on average, at detecting clinically significant prostate cancer and comparable to the standard of care. Such a system shows the potential to be a supportive tool within a primary diagnostic setting, with several associated benefits for patients and radiologists. Prospective validation is needed to test clinical applicability of this system. FUNDING: Health~Holland and EU Horizon 2020.

Subject(s)

Artificial Intelligence , Magnetic Resonance Imaging , Prostatic Neoplasms , Radiologists , Humans , Male , Prostatic Neoplasms/diagnostic imaging , Prostatic Neoplasms/pathology , Aged , Retrospective Studies , Middle Aged , Neoplasm Grading , Netherlands , ROC Curve

3.

The STOIC2021 COVID-19 AI challenge: Applying reusable training methodologies to private data.

Boulogne, Luuk H; Lorenz, Julian; Kienzle, Daniel; Schön, Robin; Ludwig, Katja; Lienhart, Rainer; Jégou, Simon; Li, Guang; Chen, Cong; Wang, Qi; Shi, Derik; Maniparambil, Mayug; Müller, Dominik; Mertes, Silvan; Schröter, Niklas; Hellmann, Fabio; Elia, Miriam; Dirks, Ine; Bossa, Matías Nicolás; Berenguer, Abel Díaz; Mukherjee, Tanmoy; Vandemeulebroucke, Jef; Sahli, Hichem; Deligiannis, Nikos; Gonidakis, Panagiotis; Huynh, Ngoc Dung; Razzak, Imran; Bouadjenek, Reda; Verdicchio, Mario; Borrelli, Pasquale; Aiello, Marco; Meakin, James A; Lemm, Alexander; Russ, Christoph; Ionasec, Razvan; Paragios, Nikos; van Ginneken, Bram; Revel-Dubois, Marie-Pierre.

Med Image Anal ; 97: 103230, 2024 Jun 05.

Article in English | MEDLINE | ID: mdl-38875741

ABSTRACT

Challenges drive the state-of-the-art of automated medical image analysis. The quantity of public training data that they provide can limit the performance of their solutions. Public access to the training methodology for these solutions remains absent. This study implements the Type Three (T3) challenge format, which allows for training solutions on private data and guarantees reusable training methodologies. With T3, challenge organizers train a codebase provided by the participants on sequestered training data. T3 was implemented in the STOIC2021 challenge, with the goal of predicting from a computed tomography (CT) scan whether subjects had a severe COVID-19 infection, defined as intubation or death within one month. STOIC2021 consisted of a Qualification phase, where participants developed challenge solutions using 2000 publicly available CT scans, and a Final phase, where participants submitted their training methodologies with which solutions were trained on CT scans of 9724 subjects. The organizers successfully trained six of the eight Final phase submissions. The submitted codebases for training and running inference were released publicly. The winning solution obtained an area under the receiver operating characteristic curve for discerning between severe and non-severe COVID-19 of 0.815. The Final phase solutions of all finalists improved upon their Qualification phase solutions.

4.

Machine Learning-Based Prediction of Hemoglobinopathies Using Complete Blood Count Data.

Schipper, Anoeska; Rutten, Matthieu; van Gammeren, Adriaan; Harteveld, Cornelis L; Urrechaga, Eloísa; Weerkamp, Floor; den Besten, Gijs; Krabbe, Johannes; Slomp, Jennichjen; Schoonen, Lise; Broeren, Maarten; van Wijnen, Merel; Huijskens, Mirelle J A J; Koopmann, Tamara; van Ginneken, Bram; Kusters, Ron; Kurstjens, Steef.

Clin Chem ; 2024 Jun 22.

Article in English | MEDLINE | ID: mdl-38906831

ABSTRACT

BACKGROUND: Hemoglobinopathies, the most common inherited blood disorder, are frequently underdiagnosed. Early identification of carriers is important for genetic counseling of couples at risk. The aim of this study was to develop and validate a novel machine learning model on a multicenter data set, covering a wide spectrum of hemoglobinopathies based on routine complete blood count (CBC) testing. METHODS: Hemoglobinopathy test results from 10 322 adults were extracted retrospectively from 8 Dutch laboratories. eXtreme Gradient Boosting (XGB) and logistic regression models were developed to differentiate negative from positive hemoglobinopathy cases, using 7 routine CBC parameters. External validation was conducted on a data set from an independent Dutch laboratory, with an additional external validation on a Spanish data set (n = 2629) specifically for differentiating thalassemia from iron deficiency anemia (IDA). RESULTS: The XGB and logistic regression models achieved an area under the receiver operating characteristic (AUROC) of 0.88 and 0.84, respectively, in distinguishing negative from positive hemoglobinopathy cases in the independent external validation set. Subclass analysis showed that the XGB model reached an AUROC of 0.97 for ß-thalassemia, 0.98 for α0-thalassemia, 0.95 for homozygous α+-thalassemia, 0.78 for heterozygous α+-thalassemia, and 0.94 for the structural hemoglobin variants Hemoglobin C, Hemoglobin D, Hemoglobin E. Both models attained AUROCs of 0.95 in differentiating IDA from thalassemia. CONCLUSIONS: Both the XGB and logistic regression model demonstrate high accuracy in predicting a broad range of hemoglobinopathies and are effective in differentiating hemoglobinopathies from IDA. Integration of these models into the laboratory information system facilitates automated hemoglobinopathy detection using routine CBC parameters.

5.

Integrating molecular and radiological screening tools during community-based active case-finding for tuberculosis and COVID-19 in southern Africa.

Scott, Alex John; Limbada, Mohammed; Perumal, Tahlia; Jaumdally, Shameem; Kotze, Andrea; van der Merwe, Charnay; Cheeba, Maina; Milimo, Deborah; Murphy, Keelin; van Ginneken, Bram; de Kock, Mariana; Warren, Robin Mark; Gina, Phindile; Swanepoel, Jeremi; Kühn, Louié; Oelofse, Suzette; Pooran, Anil; Esmail, Aliasgar; Ayles, Helen; Dheda, Keertan.

Int J Infect Dis ; 145: 107081, 2024 Aug.

Article in English | MEDLINE | ID: mdl-38701914

ABSTRACT

OBJECTIVES: To evaluate diagnostic yield and feasibility of integrating testing for TB and COVID-19 using molecular and radiological screening tools during community-based active case-finding (ACF). METHODS: Community-based participants with presumed TB and/or COVID-19 were recruited using a mobile clinic. Participants underwent simultaneous point-of-care (POC) testing for TB (sputum; Xpert Ultra) and COVID-19 (nasopharyngeal swabs; Xpert SARS-CoV-2). Sputum culture and SARS-CoV-2 RT-PCR served as reference standards. Participants underwent ultra-portable POC chest radiography with computer-aided detection (CAD). TB infectiousness was evaluated using smear microscopy, cough aerosol sampling studies (CASS), and chest radiographic cavity detection. Feasibility of POC testing was evaluated via user-appraisals. RESULTS: Six hundred and one participants were enrolled, with 144/601 (24.0%) reporting symptoms suggestive of TB and/or COVID-19. 16/144 (11.1%) participants tested positive for TB, while 10/144 (6.9%) tested positive for COVID-19 (2/144 [1.4%] had concurrent TB/COVID-19). Seven (7/16 [43.8%]) individuals with TB were probably infectious. Test-specific sensitivity and specificity (95% CI) were: Xpert Ultra 75.0% (42.8-94.5) and 96.9% (92.4-99.2); Xpert SARS-CoV-2 66.7% (22.3-95.7) and 97.1% (92.7-99.2). Area under the curve (AUC) for CAD4TB was 0.90 (0.82-0.97). User appraisals indicated POC Xpert to have 'good' user-friendliness. CONCLUSIONS: Integrating TB/COVID-19 screening during community-based ACF using POC molecular and radiological tools is feasible, has a high diagnostic yield, and can identity probably infectious persons.

Subject(s)

COVID-19 , SARS-CoV-2 , Humans , COVID-19/diagnosis , COVID-19/epidemiology , Male , Female , Adult , Middle Aged , Mass Screening/methods , Point-of-Care Testing , Sputum/microbiology , Sputum/virology , Tuberculosis/diagnosis , Tuberculosis/epidemiology , Tuberculosis/diagnostic imaging , Africa, Southern/epidemiology , Sensitivity and Specificity , Feasibility Studies , Tuberculosis, Pulmonary/diagnosis , Tuberculosis, Pulmonary/diagnostic imaging , Tuberculosis, Pulmonary/epidemiology

6.

Performance of AI to exclude normal chest radiographs to reduce radiologists' workload.

Schalekamp, Steven; van Leeuwen, Kicky; Calli, Erdi; Murphy, Keelin; Rutten, Matthieu; Geurts, Bram; Peters-Bax, Liesbeth; van Ginneken, Bram; Prokop, Mathias.

Eur Radiol ; 2024 May 17.

Article in English | MEDLINE | ID: mdl-38758252

ABSTRACT

INTRODUCTION: This study investigates the performance of a commercially available artificial intelligence (AI) system to identify normal chest radiographs and its potential to reduce radiologist workload. METHODS: Retrospective analysis included consecutive chest radiographs from two medical centers between Oct 1, 2016 and Oct 14, 2016. Exclusions comprised follow-up exams within the inclusion period, bedside radiographs, incomplete images, imported radiographs, and pediatric radiographs. Three chest radiologists categorized findings into normal, clinically irrelevant, clinically relevant, urgent, and critical. A commercial AI system processed all radiographs, scoring 10 chest abnormalities on a 0-100 confidence scale. AI system performance was evaluated using the area under the ROC curve (AUC), assessing the detection of normal radiographs. Sensitivity was calculated for the default and a conservative operating point. the detection of negative predictive value (NPV) for urgent and critical findings, as well as the potential workload reduction, was calculated. RESULTS: A total of 2603 radiographs were acquired in 2141 unique patients. Post-exclusion, 1670 radiographs were analyzed. Categories included 479 normal, 332 clinically irrelevant, 339 clinically relevant, 501 urgent, and 19 critical findings. The AI system achieved an AUC of 0.92. Sensitivity for normal radiographs was 92% at default and 53% at the conservative operating point. At the conservative operating point, NPV was 98% for urgent and critical findings, and could result in a 15% workload reduction. CONCLUSION: A commercially available AI system effectively identifies normal chest radiographs and holds the potential to lessen radiologists' workload by omitting half of the normal exams from reporting. CLINICAL RELEVANCE STATEMENT: The AI system is able to detect half of all normal chest radiographs at a clinically acceptable operating point, thereby potentially reducing the workload for the radiologists by 15%. KEY POINTS: The AI system reached an AUC of 0.92 for the detection of normal chest radiographs. Fifty-three percent of normal chest radiographs were identified with a NPV of 98% for urgent findings. AI can reduce the workload of chest radiography reporting by 15%.

7.

Artificial intelligence for automated detection and measurements of carpal instability signs on conventional radiographs.

Hendrix, Nils; Hendrix, Ward; Maresch, Bas; van Amersfoort, Job; Oosterveld-Bonsma, Tineke; Kolderman, Stephanie; Vestering, Myrthe; Zielinski, Stephanie; Rutten, Karlijn; Dammeier, Jan; Ong, Lee-Ling Sharon; van Ginneken, Bram; Rutten, Matthieu.

Eur Radiol ; 2024 Apr 18.

Article in English | MEDLINE | ID: mdl-38634877

ABSTRACT

OBJECTIVES: To develop and validate an artificial intelligence (AI) system for measuring and detecting signs of carpal instability on conventional radiographs. MATERIALS AND METHODS: Two case-control datasets of hand and wrist radiographs were retrospectively acquired at three hospitals (hospitals A, B, and C). Dataset 1 (2178 radiographs from 1993 patients, hospitals A and B, 2018-2019) was used for developing an AI system for measuring scapholunate (SL) joint distances, SL and capitolunate (CL) angles, and carpal arc interruptions. Dataset 2 (481 radiographs from 217 patients, hospital C, 2017-2021) was used for testing, and with a subsample (174 radiographs from 87 patients), an observer study was conducted to compare its performance to five clinicians. Evaluation metrics included mean absolute error (MAE), sensitivity, and specificity. RESULTS: Dataset 2 included 258 SL distances, 189 SL angles, 191 CL angles, and 217 carpal arc labels obtained from 217 patients (mean age, 51 years ± 23 [standard deviation]; 133 women). The MAE in measuring SL distances, SL angles, and CL angles was respectively 0.65 mm (95%CI: 0.59, 0.72), 7.9 degrees (95%CI: 7.0, 8.9), and 5.9 degrees (95%CI: 5.2, 6.6). The sensitivity and specificity for detecting arc interruptions were 83% (95%CI: 74, 91) and 64% (95%CI: 56, 71). The measurements were largely comparable to those of the clinicians, while arc interruption detections were more accurate than those of most clinicians. CONCLUSION: This study demonstrates that a newly developed automated AI system accurately measures and detects signs of carpal instability on conventional radiographs. CLINICAL RELEVANCE STATEMENT: This system has the potential to improve detections of carpal arc interruptions and could be a promising tool for supporting clinicians in detecting carpal instability.

8.

Lumbar spine segmentation in MR images: a dataset and a public benchmark.

van der Graaf, Jasper W; van Hooff, Miranda L; Buckens, Constantinus F M; Rutten, Matthieu; van Susante, Job L C; Kroeze, Robert Jan; de Kleuver, Marinus; van Ginneken, Bram; Lessmann, Nikolas.

Sci Data ; 11(1): 264, 2024 Mar 02.

Article in English | MEDLINE | ID: mdl-38431692

ABSTRACT

This paper presents a large publicly available multi-center lumbar spine magnetic resonance imaging (MRI) dataset with reference segmentations of vertebrae, intervertebral discs (IVDs), and spinal canal. The dataset includes 447 sagittal T1 and T2 MRI series from 218 patients with a history of low back pain and was collected from four different hospitals. An iterative data annotation approach was used by training a segmentation algorithm on a small part of the dataset, enabling semi-automatic segmentation of the remaining images. The algorithm provided an initial segmentation, which was subsequently reviewed, manually corrected, and added to the training data. We provide reference performance values for this baseline algorithm and nnU-Net, which performed comparably. Performance values were computed on a sequestered set of 39 studies with 97 series, which were additionally used to set up a continuous segmentation challenge that allows for a fair comparison of different segmentation algorithms. This study may encourage wider collaboration in the field of spine segmentation and improve the diagnostic value of lumbar spine MRI.

Subject(s)

Intervertebral Disc , Lumbar Vertebrae , Humans , Algorithms , Image Processing, Computer-Assisted/methods , Intervertebral Disc/pathology , Lumbar Vertebrae/diagnostic imaging , Magnetic Resonance Imaging/methods , Low Back Pain

9.

The emperor has few clothes: a realistic appraisal of current AI in radiology.

Huisman, Merel; van Ginneken, Bram; Harvey, Hugh.

Eur Radiol ; 2024 Mar 07.

Article in English | MEDLINE | ID: mdl-38451323

10.

Comparing deep learning and pathologist quantification of cell-level PD-L1 expression in non-small cell lung cancer whole-slide images.

van Eekelen, Leander; Spronck, Joey; Looijen-Salamon, Monika; Vos, Shoko; Munari, Enrico; Girolami, Ilaria; Eccher, Albino; Acs, Balazs; Boyaci, Ceren; de Souza, Gabriel Silva; Demirel-Andishmand, Muradije; Meesters, Luca Dulce; Zegers, Daan; van der Woude, Lieke; Theelen, Willemijn; van den Heuvel, Michel; Grünberg, Katrien; van Ginneken, Bram; van der Laak, Jeroen; Ciompi, Francesco.

Sci Rep ; 14(1): 7136, 2024 03 26.

Article in English | MEDLINE | ID: mdl-38531958

ABSTRACT

Programmed death-ligand 1 (PD-L1) expression is currently used in the clinic to assess eligibility for immune-checkpoint inhibitors via the tumor proportion score (TPS), but its efficacy is limited by high interobserver variability. Multiple papers have presented systems for the automatic quantification of TPS, but none report on the task of determining cell-level PD-L1 expression and often reserve their evaluation to a single PD-L1 monoclonal antibody or clinical center. In this paper, we report on a deep learning algorithm for detecting PD-L1 negative and positive tumor cells at a cellular level and evaluate it on a cell-level reference standard established by six readers on a multi-centric, multi PD-L1 assay dataset. This reference standard also provides for the first time a benchmark for computer vision algorithms. In addition, in line with other papers, we also evaluate our algorithm at slide-level by measuring the agreement between the algorithm and six pathologists on TPS quantification. We find a moderately low interobserver agreement at cell-level level (mean reader-reader F1 score = 0.68) which our algorithm sits slightly under (mean reader-AI F1 score = 0.55), especially for cases from the clinical center not included in the training set. Despite this, we find good AI-pathologist agreement on quantifying TPS compared to the interobserver agreement (mean reader-reader Cohen's kappa = 0.54, 95% CI 0.26-0.81, mean reader-AI kappa = 0.49, 95% CI 0.27-0.72). In conclusion, our deep learning algorithm demonstrates promise in detecting PD-L1 expression at a cellular level and exhibits favorable agreement with pathologists in quantifying the tumor proportion score (TPS). We publicly release our models for use via the Grand-Challenge platform.

Subject(s)

Carcinoma, Non-Small-Cell Lung , Deep Learning , Lung Neoplasms , Humans , Carcinoma, Non-Small-Cell Lung/pathology , Lung Neoplasms/pathology , Pathologists , B7-H1 Antigen/metabolism , Immunohistochemistry , Biomarkers, Tumor/metabolism

11.

Combining public datasets for automated tooth assessment in panoramic radiographs.

van Nistelrooij, Niels; Ghoul, Khalid El; Xi, Tong; Saha, Anindo; Kempers, Steven; Cenci, Max; Loomans, Bas; Flügge, Tabea; van Ginneken, Bram; Vinayahalingam, Shankeeth.

BMC Oral Health ; 24(1): 387, 2024 Mar 26.

Article in English | MEDLINE | ID: mdl-38532414

ABSTRACT

OBJECTIVE: Panoramic radiographs (PRs) provide a comprehensive view of the oral and maxillofacial region and are used routinely to assess dental and osseous pathologies. Artificial intelligence (AI) can be used to improve the diagnostic accuracy of PRs compared to bitewings and periapical radiographs. This study aimed to evaluate the advantages and challenges of using publicly available datasets in dental AI research, focusing on solving the novel task of predicting tooth segmentations, FDI numbers, and tooth diagnoses, simultaneously. MATERIALS AND METHODS: Datasets from the OdontoAI platform (tooth instance segmentations) and the DENTEX challenge (tooth bounding boxes with associated diagnoses) were combined to develop a two-stage AI model. The first stage implemented tooth instance segmentation with FDI numbering and extracted regions of interest around each tooth segmentation, whereafter the second stage implemented multi-label classification to detect dental caries, impacted teeth, and periapical lesions in PRs. The performance of the automated tooth segmentation algorithm was evaluated using a free-response receiver-operating-characteristics (FROC) curve and mean average precision (mAP) metrics. The diagnostic accuracy of detection and classification of dental pathology was evaluated with ROC curves and F1 and AUC metrics. RESULTS: The two-stage AI model achieved high accuracy in tooth segmentations with a FROC score of 0.988 and a mAP of 0.848. High accuracy was also achieved in the diagnostic classification of impacted teeth (F1 = 0.901, AUC = 0.996), whereas moderate accuracy was achieved in the diagnostic classification of deep caries (F1 = 0.683, AUC = 0.960), early caries (F1 = 0.662, AUC = 0.881), and periapical lesions (F1 = 0.603, AUC = 0.974). The model's performance correlated positively with the quality of annotations in the used public datasets. Selected samples from the DENTEX dataset revealed cases of missing (false-negative) and incorrect (false-positive) diagnoses, which negatively influenced the performance of the AI model. CONCLUSIONS: The use and pooling of public datasets in dental AI research can significantly accelerate the development of new AI models and enable fast exploration of novel tasks. However, standardized quality assurance is essential before using the datasets to ensure reliable outcomes and limit potential biases.

Subject(s)

Dental Caries , Tooth, Impacted , Tooth , Humans , Artificial Intelligence , Radiography, Panoramic , Bone and Bones

12.

Nodule detection and generation on chest X-rays: NODE21 Challenge.

Sogancioglu, Ecem; Van Ginneken, Bram; Behrendt, Finn; Bengs, Marcel; Schlaefer, Alexander; Radu, Miron; Xu, Di; Sheng, Ke; Scalzo, Fabien; Marcus, Eric; Papa, Samuele; Teuwen, Jonas; Scholten, Ernst Th; Schalekamp, Steven; Hendrix, Nils; Jacobs, Colin; Hendrix, Ward; Sanchez, Clara I; Murphy, Keelin.

IEEE Trans Med Imaging ; PP2024 Mar 26.

Article in English | MEDLINE | ID: mdl-38530714

ABSTRACT

Pulmonary nodules may be an early manifestation of lung cancer, the leading cause of cancer-related deaths among both men and women. Numerous studies have established that deep learning methods can yield high-performance levels in the detection of lung nodules in chest X-rays. However, the lack of gold-standard public datasets slows down the progression of the research and prevents benchmarking of methods for this task. To address this, we organized a public research challenge, NODE21, aimed at the detection and generation of lung nodules in chest X-rays. While the detection track assesses state-of-the-art nodule detection systems, the generation track determines the utility of nodule generation algorithms to augment training data and hence improve the performance of the detection systems. This paper summarizes the results of the NODE21 challenge and performs extensive additional experiments to examine the impact of the synthetically generated nodule training images on the detection algorithm performance.

13.

Development and validation of AI-based automatic measurement of coronal Cobb angles in degenerative scoliosis using sagittal lumbar MRI.

van der Graaf, Jasper W; van Hooff, Miranda L; van Ginneken, Bram; Huisman, Merel; Rutten, Matthieu; Lamers, Dominique; Lessmann, Nikolas; de Kleuver, Marinus.

Eur Radiol ; 2024 Feb 21.

Article in English | MEDLINE | ID: mdl-38383922

ABSTRACT

OBJECTIVES: Severity of degenerative scoliosis (DS) is assessed by measuring the Cobb angle on anteroposterior radiographs. However, MRI images are often available to study the degenerative spine. This retrospective study aims to develop and evaluate the reliability of a novel automatic method that measures coronal Cobb angles on lumbar MRI in DS patients. MATERIALS AND METHODS: Vertebrae and intervertebral discs were automatically segmented using a 3D AI algorithm, trained on 447 lumbar MRI series. The segmentations were used to calculate all possible angles between the vertebral endplates, with the largest being the Cobb angle. The results were validated with 50 high-resolution sagittal lumbar MRI scans of DS patients, in which three experienced readers measured the Cobb angle. Reliability was determined using the intraclass correlation coefficient (ICC). RESULTS: The ICCs between the readers ranged from 0.90 (95% CI 0.83-0.94) to 0.93 (95% CI 0.88-0.96). The ICC between the maximum angle found by the algorithm and the average manually measured Cobb angles was 0.83 (95% CI 0.71-0.90). In 9 out of the 50 cases (18%), all readers agreed on both vertebral levels for Cobb angle measurement. When using the algorithm to extract the angles at the vertebral levels chosen by the readers, the ICCs ranged from 0.92 (95% CI 0.87-0.96) to 0.97 (95% CI 0.94-0.98). CONCLUSION: The Cobb angle can be accurately measured on MRI using the newly developed algorithm in patients with DS. The readers failed to consistently choose the same vertebral level for Cobb angle measurement, whereas the automatic approach ensures the maximum angle is consistently measured. CLINICAL RELEVANCE STATEMENT: Our AI-based algorithm offers reliable Cobb angle measurement on routine MRI for degenerative scoliosis patients, potentially reducing the reliance on conventional radiographs, ensuring consistent assessments, and therefore improving patient care. KEY POINTS: â¢ While often available, MRI images are rarely utilized to determine the severity of degenerative scoliosis. â¢ The presented MRI Cobb angle algorithm is more reliable than humans in patients with degenerative scoliosis. â¢ Radiographic imaging for Cobb angle measurements is mitigated when lumbar MRI images are available.

14.

Understanding metric-related pitfalls in image analysis validation.

Reinke, Annika; Tizabi, Minu D; Baumgartner, Michael; Eisenmann, Matthias; Heckmann-Nötzel, Doreen; Kavur, A Emre; Rädsch, Tim; Sudre, Carole H; Acion, Laura; Antonelli, Michela; Arbel, Tal; Bakas, Spyridon; Benis, Arriel; Buettner, Florian; Cardoso, M Jorge; Cheplygina, Veronika; Chen, Jianxu; Christodoulou, Evangelia; Cimini, Beth A; Farahani, Keyvan; Ferrer, Luciana; Galdran, Adrian; van Ginneken, Bram; Glocker, Ben; Godau, Patrick; Hashimoto, Daniel A; Hoffman, Michael M; Huisman, Merel; Isensee, Fabian; Jannin, Pierre; Kahn, Charles E; Kainmueller, Dagmar; Kainz, Bernhard; Karargyris, Alexandros; Kleesiek, Jens; Kofler, Florian; Kooi, Thijs; Kopp-Schneider, Annette; Kozubek, Michal; Kreshuk, Anna; Kurc, Tahsin; Landman, Bennett A; Litjens, Geert; Madani, Amin; Maier-Hein, Klaus; Martel, Anne L; Meijering, Erik; Menze, Bjoern; Moons, Karel G M; Müller, Henning.

Nat Methods ; 21(2): 182-194, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38347140

ABSTRACT

Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.

Subject(s)

Artificial Intelligence

15.

Metrics reloaded: recommendations for image analysis validation.

Maier-Hein, Lena; Reinke, Annika; Godau, Patrick; Tizabi, Minu D; Buettner, Florian; Christodoulou, Evangelia; Glocker, Ben; Isensee, Fabian; Kleesiek, Jens; Kozubek, Michal; Reyes, Mauricio; Riegler, Michael A; Wiesenfarth, Manuel; Kavur, A Emre; Sudre, Carole H; Baumgartner, Michael; Eisenmann, Matthias; Heckmann-Nötzel, Doreen; Rädsch, Tim; Acion, Laura; Antonelli, Michela; Arbel, Tal; Bakas, Spyridon; Benis, Arriel; Blaschko, Matthew B; Cardoso, M Jorge; Cheplygina, Veronika; Cimini, Beth A; Collins, Gary S; Farahani, Keyvan; Ferrer, Luciana; Galdran, Adrian; van Ginneken, Bram; Haase, Robert; Hashimoto, Daniel A; Hoffman, Michael M; Huisman, Merel; Jannin, Pierre; Kahn, Charles E; Kainmueller, Dagmar; Kainz, Bernhard; Karargyris, Alexandros; Karthikesalingam, Alan; Kofler, Florian; Kopp-Schneider, Annette; Kreshuk, Anna; Kurc, Tahsin; Landman, Bennett A; Litjens, Geert; Madani, Amin.

Nat Methods ; 21(2): 195-212, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38347141

ABSTRACT

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.

Subject(s)

Algorithms , Image Processing, Computer-Assisted , Machine Learning , Semantics

16.

Estimating lung function from computed tomography at the patient and lobe level using machine learning.

Boulogne, Luuk H; Charbonnier, Jean-Paul; Jacobs, Colin; van der Heijden, Erik H F M; van Ginneken, Bram.

Med Phys ; 51(4): 2834-2845, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38329315

ABSTRACT

BACKGROUND: Automated estimation of Pulmonary function test (PFT) results from Computed Tomography (CT) could advance the use of CT in screening, diagnosis, and staging of restrictive pulmonary diseases. Estimating lung function per lobe, which cannot be done with PFTs, would be helpful for risk assessment for pulmonary resection surgery and bronchoscopic lung volume reduction. PURPOSE: To automatically estimate PFT results from CT and furthermore disentangle the individual contribution of pulmonary lobes to a patient's lung function. METHODS: We propose I3Dr, a deep learning architecture for estimating global measures from an image that can also estimate the contributions of individual parts of the image to this global measure. We apply it to estimate the separate contributions of each pulmonary lobe to a patient's total lung function from CT, while requiring only CT scans and patient level lung function measurements for training. I3Dr consists of a lobe-level and a patient-level model. The lobe-level model extracts all anatomical pulmonary lobes from a CT scan and processes them in parallel to produce lobe level lung function estimates that sum up to a patient level estimate. The patient-level model directly estimates patient level lung function from a CT scan and is used to re-scale the output of the lobe-level model to increase performance. After demonstrating the viability of the proposed approach, the I3Dr model is trained and evaluated for PFT result estimation using a large data set of 8 433 CT volumes for training, 1 775 CT volumes for validation, and 1 873 CT volumes for testing. RESULTS: First, we demonstrate the viability of our approach by showing that a model trained with a collection of digit images to estimate their sum implicitly learns to assign correct values to individual digits. Next, we show that our models can estimate lobe-level quantities, such as COVID-19 severity scores, pulmonary volume (PV), and functional pulmonary volume (FPV) from CT while only provided with patient-level quantities during training. Lastly, we train and evaluate models for producing spirometry and diffusion capacity of carbon mono-oxide (DLCO) estimates at the patient and lobe level. For producing Forced Expiratory Volume in one second (FEV1), Forced Vital Capacity (FVC), and DLCO estimates, I3Dr obtains mean absolute errors (MAE) of 0.377 L, 0.297 L, and 2.800 mL/min/mm Hg respectively. We release the resulting algorithms for lung function estimation to the research community at https://grand-challenge.org/algorithms/lobe-wise-lung-function-estimation/ CONCLUSIONS: I3Dr can estimate global measures from an image, as well as the contributions of individual parts of the image to this global measure. It offers a promising approach for estimating PFT results from CT scans and disentangling the individual contribution of pulmonary lobes to a patient's lung function. The findings presented in this work may advance the use of CT in screening, diagnosis, and staging of restrictive pulmonary diseases as well as in risk assessment for pulmonary resection surgery and bronchoscopic lung volume reduction.

Subject(s)

Lung Diseases , Lung , Humans , Lung/diagnostic imaging , Lung/surgery , Tomography, X-Ray Computed/methods , Vital Capacity , Machine Learning

17.

In reply.

Berg, Hidde Ten; van Bakel, Bram; van de Wouw, Lieke; Jie, Kim E; Schipper, Anoeska; Jansen, Henry; O'Connor, Rory D; van Ginneken, Bram; Kurstjens, Steef.

Ann Emerg Med ; 83(3): 287-288, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38388085

18.

Comparison of Commercial AI Software Performance for Radiograph Lung Nodule Detection and Bone Age Prediction.

van Leeuwen, Kicky G; Schalekamp, Steven; Rutten, Matthieu J C M; Huisman, Merel; Schaefer-Prokop, Cornelia M; de Rooij, Maarten; van Ginneken, Bram; Maresch, Bas; Geurts, Bram H J; van Dijke, Cornelius F; Laupman-Koedam, Emmeline; Hulleman, Enzo V; Verhoeff, Eric L; Meys, Evelyne M J; Mohamed Hoesein, Firdaus A A; Ter Brugge, Floor M; van Hoorn, Francois; van der Wel, Frank; van den Berk, Inge A H; Luyendijk, Jacqueline M; Meakin, James; Habets, Jesse; Verbeke, Jonathan I M L; Nederend, Joost; Meys, Karlijn M E; Deden, Laura N; Langezaal, Lucianne C M; Nasrollah, Mahtab; Meij, Marleen; Boomsma, Martijn F; Vermeulen, Matthijs; Vestering, Myrthe M; Vijlbrief, Onno; Algra, Paul; Algra, Selma; Bollen, Stijn M; Samson, Tijs; von Brucken Fock, Yntor H G.

Radiology ; 310(1): e230981, 2024 Jan.

Article in English | MEDLINE | ID: mdl-38193833

ABSTRACT

Background Multiple commercial artificial intelligence (AI) products exist for assessing radiographs; however, comparable performance data for these algorithms are limited. Purpose To perform an independent, stand-alone validation of commercially available AI products for bone age prediction based on hand radiographs and lung nodule detection on chest radiographs. Materials and Methods This retrospective study was carried out as part of Project AIR. Nine of 17 eligible AI products were validated on data from seven Dutch hospitals. For bone age prediction, the root mean square error (RMSE) and Pearson correlation coefficient were computed. The reference standard was set by three to five expert readers. For lung nodule detection, the area under the receiver operating characteristic curve (AUC) was computed. The reference standard was set by a chest radiologist based on CT. Randomized subsets of hand (n = 95) and chest (n = 140) radiographs were read by 14 and 17 human readers, respectively, with varying experience. Results Two bone age prediction algorithms were tested on hand radiographs (from January 2017 to January 2022) in 326 patients (mean age, 10 years ± 4 [SD]; 173 female patients) and correlated strongly with the reference standard (r = 0.99; P < .001 for both). No difference in RMSE was observed between algorithms (0.63 years [95% CI: 0.58, 0.69] and 0.57 years [95% CI: 0.52, 0.61]) and readers (0.68 years [95% CI: 0.64, 0.73]). Seven lung nodule detection algorithms were validated on chest radiographs (from January 2012 to May 2022) in 386 patients (mean age, 64 years ± 11; 223 male patients). Compared with readers (mean AUC, 0.81 [95% CI: 0.77, 0.85]), four algorithms performed better (AUC range, 0.86-0.93; P value range, <.001 to .04). Conclusions Compared with human readers, four AI algorithms for detecting lung nodules on chest radiographs showed improved performance, whereas the remaining algorithms tested showed no evidence of a difference in performance. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Omoumi and Richiardi in this issue.

Subject(s)

Artificial Intelligence , Software , Humans , Female , Male , Child , Middle Aged , Retrospective Studies , Algorithms , Lung

19.

Full resolution reconstruction of whole-mount sections from digitized individual tissue fragments.

Schouten, Daan; van der Laak, Jeroen; van Ginneken, Bram; Litjens, Geert.

Sci Rep ; 14(1): 1497, 2024 01 17.

Article in English | MEDLINE | ID: mdl-38233535

ABSTRACT

Whole-mount sectioning is a technique in histopathology where a full slice of tissue, such as a transversal cross-section of a prostate specimen, is prepared on a large microscope slide without further sectioning into smaller fragments. Although this technique can offer improved correlation with pre-operative imaging and is paramount for multimodal research, it is not commonly employed due to its technical difficulty, associated cost and cumbersome integration in (digital) pathology workflows. In this work, we present a computational tool named PythoStitcher which reconstructs artificial whole-mount sections from digitized tissue fragments, thereby bringing the benefits of whole-mount sections to pathology labs currently unable to employ this technique. Our proposed algorithm consists of a multi-step approach where it (i) automatically determines how fragments need to be reassembled, (ii) iteratively optimizes the stitch using a genetic algorithm and (iii) efficiently reconstructs the final artificial whole-mount section on full resolution (0.25 µm/pixel). PythoStitcher was validated on a total of 198 cases spanning five datasets with a varying number of tissue fragments originating from different organs from multiple centers. PythoStitcher successfully reconstructed the whole-mount section in 86-100% of cases for a given dataset with a residual registration mismatch of 0.65-2.76 mm on automatically selected landmarks. It is expected that our algorithm can aid pathology labs unable to employ whole-mount sectioning through faster clinical case evaluation and improved radiology-pathology correlation workflows.

Subject(s)

Algorithms , Diagnostic Imaging , Image Processing, Computer-Assisted , Humans

20.

Computer-aided detection thresholds for digital chest radiography interpretation in tuberculosis diagnostic algorithms.

Vanobberghen, Fiona; Keter, Alfred Kipyegon; Jacobs, Bart K M; Glass, Tracy R; Lynen, Lutgarde; Law, Irwin; Murphy, Keelin; van Ginneken, Bram; Ayakaka, Irene; van Heerden, Alastair; Maama, Llang; Reither, Klaus.

ERJ Open Res ; 10(1)2024 Jan.

Article in English | MEDLINE | ID: mdl-38196890

ABSTRACT

Objectives: Use of computer-aided detection (CAD) software is recommended to improve tuberculosis screening and triage, but threshold determination is challenging if reference testing has not been performed in all individuals. We aimed to determine such thresholds through secondary analysis of the 2019 Lesotho national tuberculosis prevalence survey. Methods: Symptom screening and chest radiographs were performed in participants aged ≥15âyears; those symptomatic or with abnormal chest radiographs provided samples for Xpert MTB/RIF and culture testing. Chest radiographs were processed using CAD4TB version 7. We used six methodological approaches to deal with participants who did not have bacteriological test results to estimate pulmonary tuberculosis prevalence and assess diagnostic accuracy. Results: Among 17 070 participants, 5214 (31%) had their tuberculosis status determined; 142 had tuberculosis. Prevalence estimates varied between methodological approaches (0.83-2.72%). Using multiple imputation to estimate tuberculosis status for those eligible but not tested, and assuming those not eligible for testing were negative, a CAD4TBv7 threshold of 13 had a sensitivity of 89.7% (95% CI 84.6-94.8) and a specificity of 74.2% (73.6-74.9), close to World Health Organization (WHO) target product profile criteria. Assuming all those not tested were negative produced similar results. Conclusions: This is the first study to evaluate CAD4TB in a community screening context employing a range of approaches to account for unknown tuberculosis status. The assumption that those not tested are negative - regardless of testing eligibility status - was robust. As threshold determination must be context specific, our analytically straightforward approach should be adopted to leverage prevalence surveys for CAD threshold determination in other settings with a comparable proportion of eligible but not tested participants.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL