Language-guided Bias Generation Contrastive Strategy for Visual Question Answering
Visual question answering (VQA) is a challenging task that requires models to understand both visual and linguistic inputs and produce accurate answers. However, VQA models often exploit biases in datasets to make predictions rather than reasoning based on the inputs. Prior approaches to debiasing have suggested the implementation of a supplementary model, deliberately designed to exhibit bias, which subsequently informs the training of a resilient target model. Nevertheless, such techniques merely quantify the model’s divergence based on the statistical distribution of labels within the training dataset or in relation to unimodal branches. In this work, we propose a novel method of generating bias from the target model itself, called LEGO, which aims to combat the language guidance bias. Specifically, LEGO framework employs a generative network that assimilates the biases inherent in the target model by integrating adversarial goals with the principles of knowledge distillation. Then, we use a debiased contrastive learning strategy to model the language guidance bias of caption and question. In the process of modelling, in order to obtain robust semantic coreference, the multi-modal representations of two semantic granularity are modelled by mutual information fusion and contrast learning difference modelling. We evaluate our method on various VQA-biased datasets, including VQA-CP2, GQA-OOD, and RSICD, and show that it outperforms similar methods.