ACM Transactions on Multimedia Computing, Communications and Applications, volume 21, issue 4, pages 1-21

Language-guided Bias Generation Contrastive Strategy for Visual Question Answering

Enyuan Zhao ^{1, 2}

Ning Song ^{1, 2}

Ze Zhang ^{1, 2}

Jie Nie ^{1, 2}

Xinyue Liang ^{1, 2}

Zhiqiang Wei ^{2, 3}

Hide authors affiliations

Ocean University of China, China |

Ocean University of China, Qingdao, China |

Qingdao University, China |

Publication type: Journal Article

Publication date: 2025-03-10

Association for Computing Machinery (ACM)

Journal: ACM Transactions on Multimedia Computing, Communications and Applications

scimago Q1

SJR: 1.399

CiteScore: 8.5

Impact factor: 5.2

ISSN: 15516857, 15516865

DOI: 10.1145/3715141

Copy DOI

Abstract

Visual question answering (VQA) is a challenging task that requires models to understand both visual and linguistic inputs and produce accurate answers. However, VQA models often exploit biases in datasets to make predictions rather than reasoning based on the inputs. Prior approaches to debiasing have suggested the implementation of a supplementary model, deliberately designed to exhibit bias, which subsequently informs the training of a resilient target model. Nevertheless, such techniques merely quantify the model’s divergence based on the statistical distribution of labels within the training dataset or in relation to unimodal branches. In this work, we propose a novel method of generating bias from the target model itself, called LEGO, which aims to combat the language guidance bias. Specifically, LEGO framework employs a generative network that assimilates the biases inherent in the target model by integrating adversarial goals with the principles of knowledge distillation. Then, we use a debiased contrastive learning strategy to model the language guidance bias of caption and question. In the process of modelling, in order to obtain robust semantic coreference, the multi-modal representations of two semantic granularity are modelled by mutual information fusion and contrast learning difference modelling. We evaluate our method on various VQA-biased datasets, including VQA-CP2, GQA-OOD, and RSICD, and show that it outperforms similar methods.