Search | VHL Regional Portal

Eye Gaze Guided Cross-Modal Alignment Network for Radiology Report Generation.

Peng, Peixi; Fan, Wanshu; Shen, Yue; Liu, Wenfei; Yang, Xin; Zhang, Qiang; Wei, Xiaopeng; Zhou, Dongsheng.

IEEE J Biomed Health Inform ; PP2024 Jul 12.

Article in English | MEDLINE | ID: mdl-38995704

ABSTRACT

The potential benefits of automatic radiology report generation, such as reducing misdiagnosis rates and enhancing clinical diagnosis efficiency, are significant. However, existing data-driven methods lack essential medical prior knowledge, which hampers their performance. Moreover, establishing global correspondences between radiology images and related reports, while achieving local alignments between images correlated with prior knowledge and text, remains a challenging task. To address these shortcomings, we introduce a novel Eye Gaze Guided Cross-modal Alignment Network (EGGCA-Net) for generating accurate medical reports. Our approach incorporates prior knowledge from radiologists' Eye Gaze Region (EGR) to refine the fidelity and comprehensibility of report generation. Specifically, we design a Dual Fine-Grained Branch (DFGB) and a Multi-Task Branch (MTB) to collaboratively ensure the alignment of visual and textual semantics across multiple levels. To establish fine-grained alignment between EGR-related images and sentences, we introduce the Sentence Fine-grained Prototype Module (SFPM) within DFGB to capture cross-modal information at different levels. Additionally, to learn the alignment of EGR-related image topics, we introduce the Multi-task Feature Fusion Module (MFFM) within MTB to refine the encoder output information. Finally, a specifically designed label matching mechanism is designed to generate reports that are consistent with the anticipated disease states. The experimental outcomes indicate that the introduced methodology surpasses previous advanced techniques, yielding enhanced performance on two extensively used benchmark datasets: Open-i and MIMIC-CXR.

Dual modality prompt learning for visual question-grounded answering in robotic surgery.

Zhang, Yue; Fan, Wanshu; Peng, Peixi; Yang, Xin; Zhou, Dongsheng; Wei, Xiaopeng.

Vis Comput Ind Biomed Art ; 7(1): 9, 2024 Apr 22.

Article in English | MEDLINE | ID: mdl-38647624

ABSTRACT

With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL