keyboard_arrow_up
A Unified Multi-Dataset Framework for Medical Visual Question Answering via Pretrained Transformers and Contrastive Learning

Authors

Bao-Nguyen Quoc, Huy-Ho Huu, Khang-Nguyen Hoang Duy and Thu-Le Vo Minh, FPT University, Vietnam

Abstract

Medical Visual Question Answering (Med-VQA) aims to generate accurate answers to clinical questions grounded in medical images. However, existing models often struggle with limited generalization across datasets and insufficient understanding of specialized medical terminology. In this work, we propose a unified multi-dataset Med-VQA framework that integrates general-purpose vision-language models (e.g., BLIP) with domain-specific language models such as BioGPT to better capture biomedical semantics. Our architecture introduces a novel Mixture-of-Experts (Med-MoE) module that fuses knowledge across modalities and datasets, and it is jointly optimized using contrastive loss, image-text matching, and language modeling objectives. By combining cross-dataset supervision with domain-aware components, our approach achieves improved reasoning and generalization. Experimental results on VQA-RAD and PathVQA demonstrate state-of-the-art performance, validating the effectiveness of our unified framework.

Keywords

Vision Transformer (ViT), Medical VQA, Transformer, PathVQA, VQA-RAD

Full Text  Volume 15, Number 7