ACM Transactions on Multimedia Computing, Communications and Applications, volume 21, issue 4, pages 1-21

Language-guided Bias Generation Contrastive Strategy for Visual Question Answering

Publication typeJournal Article
Publication date2025-03-10
scimago Q1
SJR1.399
CiteScore8.5
Impact factor5.2
ISSN15516857, 15516865
Abstract

Visual question answering (VQA) is a challenging task that requires models to understand both visual and linguistic inputs and produce accurate answers. However, VQA models often exploit biases in datasets to make predictions rather than reasoning based on the inputs. Prior approaches to debiasing have suggested the implementation of a supplementary model, deliberately designed to exhibit bias, which subsequently informs the training of a resilient target model. Nevertheless, such techniques merely quantify the model’s divergence based on the statistical distribution of labels within the training dataset or in relation to unimodal branches. In this work, we propose a novel method of generating bias from the target model itself, called LEGO, which aims to combat the language guidance bias. Specifically, LEGO framework employs a generative network that assimilates the biases inherent in the target model by integrating adversarial goals with the principles of knowledge distillation. Then, we use a debiased contrastive learning strategy to model the language guidance bias of caption and question. In the process of modelling, in order to obtain robust semantic coreference, the multi-modal representations of two semantic granularity are modelled by mutual information fusion and contrast learning difference modelling. We evaluate our method on various VQA-biased datasets, including VQA-CP2, GQA-OOD, and RSICD, and show that it outperforms similar methods.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?
Profiles