Exploring Semantic Attributes for Image Caption Synthesis in Low-Resource Assamese Language
Research on image caption generation has predominantly focused on resource-rich languages like English, leaving resource-poor languages (like Assamese and several others) largely understudied. In this context, this paper leverages both visual and semantic attribute based features for generating captions in Assamese language. Semantic attributes refer to the significant words that represent higher-level knowledge about the image content. This work contributes through the effective use of features derived from semantic words in low resource Assamese language. The second contribution is the proposal of a Visual-Semantic Self-Attention (VSSA) module for the combination of features derived from images and semantic attributes. The VSSA module enables the image captioning model to dynamically attend to relevant regions of the image as well as the important semantic attributes, thereby leading to more contextually relevant and linguistically accurate Assamese captions. Moreover, the VSSA module is incorporated into a Transformer model to leverage the stacked attention for performance improvement. The model is trained by using both cross-entropy loss optimization and reinforcement learning approach. The effectiveness of the proposed model is evaluated through both qualitative and quantitative analyses (using BLEU-n and CIDEr metrics). The proposed model shows significant performance improvement in Assamese caption synthesis compared to previous methods, achieving 93.7% CIDEr score on the COCO-Assamese Caption (COCO-AC) dataset.