Search | VHL Regional Portal

Strong versus Weak Data Labeling for Artificial Intelligence Algorithms in the Measurement of Geographic Atrophy.

Domalpally, Amitha; Slater, Robert; Linderman, Rachel E; Balaji, Rohit; Bogost, Jacob; Voland, Rick; Pak, Jeong; Blodi, Barbara A; Channa, Roomasa; Fong, Donald; Chew, Emily Y.

Ophthalmol Sci ; 4(5): 100477, 2024.

Article in English | MEDLINE | ID: mdl-38827491

ABSTRACT

Purpose: To gain an understanding of data labeling requirements to train deep learning models for measurement of geographic atrophy (GA) with fundus autofluorescence (FAF) images. Design: Evaluation of artificial intelligence (AI) algorithms. Subjects: The Age-Related Eye Disease Study 2 (AREDS2) images were used for training and cross-validation, and GA clinical trial images were used for testing. Methods: Training data consisted of 2 sets of FAF images; 1 with area measurements only and no indication of GA location (Weakly labeled) and the second with GA segmentation masks (Strongly labeled). Main Outcome Measures: Bland-Altman plots and scatter plots were used to compare GA area measurement between ground truth and AI measurements. The Dice coefficient was used to compare accuracy of segmentation of the Strong model. Results: In the cross-validation AREDS2 data set (n = 601), the mean (standard deviation [SD]) area of GA measured by human grader, Weakly labeled AI model, and Strongly labeled AI model was 6.65 (6.3) mm2, 6.83 (6.29) mm2, and 6.58 (6.24) mm2, respectively. The mean difference between ground truth and AI was 0.18 mm2 (95% confidence interval, [CI], -7.57 to 7.92) for the Weakly labeled model and -0.07 mm2 (95% CI, -1.61 to 1.47) for the Strongly labeled model. With GlaxoSmithKline testing data (n = 156), the mean (SD) GA area was 9.79 (5.6) mm2, 8.82 (4.61) mm2, and 9.55 (5.66) mm2 for human grader, Strongly labeled AI model, and Weakly labeled AI model, respectively. The mean difference between ground truth and AI for the 2 models was -0.97 mm2 (95% CI, -4.36 to 2.41) and -0.24 mm2 (95% CI, -4.98 to 4.49), respectively. The Dice coefficient was 0.99 for intergrader agreement, 0.89 for the cross-validation data, and 0.92 for the testing data. Conclusions: Deep learning models can achieve reasonable accuracy even with Weakly labeled data. Training methods that integrate large volumes of Weakly labeled images with small number of Strongly labeled images offer a promising solution to overcome the burden of cost and time for data labeling. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Implementation of a Large-Scale Image Curation Workflow Using Deep Learning Framework.

Domalpally, Amitha; Slater, Robert; Barrett, Nancy; Voland, Rick; Balaji, Rohit; Heathcote, Jennifer; Channa, Roomasa; Blodi, Barbara.

Ophthalmol Sci ; 2(4): 100198, 2022 Dec.

Article in English | MEDLINE | ID: mdl-36531570

ABSTRACT

Purpose: The curation of images using human resources is time intensive but an essential step for developing artificial intelligence (AI) algorithms. Our goal was to develop and implement an AI algorithm for image curation in a high-volume setting. We also explored AI tools that will assist in deploying a tiered approach, in which the AI model labels images and flags potential mislabels for human review. Design: Implementation of an AI algorithm. Participants: Seven-field stereoscopic images from multiple clinical trials. Methods: The 7-field stereoscopic image protocol includes 7 pairs of images from various parts of the central retina along with images of the anterior part of the eye. All images were labeled for field number by reading center graders. The model output included classification of the retinal images into 8 field numbers. Probability scores (0-1) were generated to identify misclassified images, with 1 indicating a high probability of a correct label. Main Outcome Measures: Agreement of AI prediction with grader classification of field number and the use of probability scores to identify mislabeled images. Results: The AI model was trained and validated on 17 529 images and tested on 3004 images. The pooled agreement of field numbers between grader classification and the AI model was 88.3% (kappa, 0.87). The pooled mean probability score was 0.97 (standard deviation [SD], 0.08) for images for which the graders agreed with the AI-generated labels and 0.77 (SD, 0.19) for images for which the graders disagreed with the AI-generated labels (P < 0.0001). Using receiver operating characteristic curves, a probability score of 0.99 was identified as a cutoff for distinguishing mislabeled images. A tiered workflow using a probability score of < 0.99 as a cutoff would include 27.6% of the 3004 images for human review and reduce the error rate from 11.7% to 1.5%. Conclusions: The implementation of AI algorithms requires measures in addition to model validation. Tools to flag potential errors in the labels generated by AI models will reduce inaccuracies, increase trust in the system, and provide data for continuous model development.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL