ACM Transactions on Asian and Low-Resource Language Information Processing, volume 24, issue 4, pages 1-21

Exploring Semantic Attributes for Image Caption Synthesis in Low-Resource Assamese Language

Pankaj Choudhury ¹

Prithwijit Guha ²

Sukumar Nandi ³

Hide authors affiliations

Center For lingustics Science and Technology, Indian Institute of Technology Guwahati, Guwahati, India

Electronics & Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, India

Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, India |

Publication type: Journal Article

Publication date: 2025-03-23

Association for Computing Machinery (ACM)

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing

scimago Q2

SJR: 0.535

CiteScore: 3.6

Impact factor: 1.8

ISSN: 23754699, 23754702

DOI: 10.1145/3717612

Copy DOI

Abstract

Research on image caption generation has predominantly focused on resource-rich languages like English, leaving resource-poor languages (like Assamese and several others) largely understudied. In this context, this paper leverages both visual and semantic attribute based features for generating captions in Assamese language. Semantic attributes refer to the significant words that represent higher-level knowledge about the image content. This work contributes through the effective use of features derived from semantic words in low resource Assamese language. The second contribution is the proposal of a Visual-Semantic Self-Attention (VSSA) module for the combination of features derived from images and semantic attributes. The VSSA module enables the image captioning model to dynamically attend to relevant regions of the image as well as the important semantic attributes, thereby leading to more contextually relevant and linguistically accurate Assamese captions. Moreover, the VSSA module is incorporated into a Transformer model to leverage the stacked attention for performance improvement. The model is trained by using both cross-entropy loss optimization and reinforcement learning approach. The effectiveness of the proposed model is evaluated through both qualitative and quantitative analyses (using BLEU-n and CIDEr metrics). The proposed model shows significant performance improvement in Assamese caption synthesis compared to previous methods, achieving 93.7% CIDEr score on the COCO-Assamese Caption (COCO-AC) dataset.