Open Access
Open access
volume 2 issue 1 publication number 31

ViTGaze: gaze following with interaction features in vision transformers

Publication typeJournal Article
Publication date2024-11-21
SJR
CiteScore4.0
Impact factor
ISSN27319008, 20973330
Abstract

Gaze following aims to interpret human-scene interactions by predicting the person’s focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the previous modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art performance among all single-modality methods (3.4% improvement in the area under curve score, 5.1% improvement in the average precision) and very comparable performance against multi-modality methods with 59% fewer parameters.

Found 
Found 

Top-30

Journals

1
2
3
Lecture Notes in Computer Science
3 publications, 42.86%
Visual Intelligence
1 publication, 14.29%
1
2
3

Publishers

1
2
3
4
Springer Nature
4 publications, 57.14%
Institute of Electrical and Electronics Engineers (IEEE)
3 publications, 42.86%
1
2
3
4
  • We do not take into account publications without a DOI.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
7
Share
Cite this
GOST |
Cite this
GOST Copy
Song Y. et al. ViTGaze: gaze following with interaction features in vision transformers // Visual Intelligence. 2024. Vol. 2. No. 1. 31
GOST all authors (up to 50) Copy
Song Y., Wang X., Yao J., Liu W., Zhang J., Xu X. ViTGaze: gaze following with interaction features in vision transformers // Visual Intelligence. 2024. Vol. 2. No. 1. 31
RIS |
Cite this
RIS Copy
TY - JOUR
DO - 10.1007/s44267-024-00064-9
UR - https://link.springer.com/10.1007/s44267-024-00064-9
TI - ViTGaze: gaze following with interaction features in vision transformers
T2 - Visual Intelligence
AU - Song, Yuehao
AU - Wang, Xinggang
AU - Yao, Jingfeng
AU - Liu, Wen-Yu
AU - Zhang, Jinglin
AU - Xu, Xiangmin
PY - 2024
DA - 2024/11/21
PB - Springer Nature
IS - 1
VL - 2
SN - 2731-9008
SN - 2097-3330
ER -
BibTex
Cite this
BibTex (up to 50 authors) Copy
@article{2024_Song,
author = {Yuehao Song and Xinggang Wang and Jingfeng Yao and Wen-Yu Liu and Jinglin Zhang and Xiangmin Xu},
title = {ViTGaze: gaze following with interaction features in vision transformers},
journal = {Visual Intelligence},
year = {2024},
volume = {2},
publisher = {Springer Nature},
month = {nov},
url = {https://link.springer.com/10.1007/s44267-024-00064-9},
number = {1},
pages = {31},
doi = {10.1007/s44267-024-00064-9}
}