TY - JOUR
T1 - ESC-ZSAR
T2 - Expanded Semantics from Categories with Cross-Attention for Zero-Shot Action Recognition
AU - Lee, Jeong Cheol
AU - Lee, Dong Gyu
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/12/1
Y1 - 2024/12/1
N2 - Zero-shot action recognition endeavors to identify novel action categories not encountered during training by aligning a joint semantic space. However, despite advancements, zero-shot action recognition still needs to grapple with the inadequate semantic representation of seen data, hindering the transfer of diverse action videos. This study introduces a novel framework combining video, optical flow, and expanded label description via a cross-attention mechanism. This integration facilitates the capture of low and high-level motion dynamics, effectively bridging the domain gap between the video and text modalities. The proposed approach of generating expanded label descriptions efficiently enhances semantic information, thus ameliorating zero-shot transferability and providing a comprehensive grasp of semantics and motion. The temporal shuffle and alignment module is designed to enhance the generalization ability of image sequences by capturing discriminative high-level motions through frame sorting. The efficacy of the proposed method is validated through extensive experiments on three benchmark datasets, namely Kinetic-600, UCF-101, and HMDB-51. Notably, our model achieves state-of-the-art results in the zero-shot action recognition task.
AB - Zero-shot action recognition endeavors to identify novel action categories not encountered during training by aligning a joint semantic space. However, despite advancements, zero-shot action recognition still needs to grapple with the inadequate semantic representation of seen data, hindering the transfer of diverse action videos. This study introduces a novel framework combining video, optical flow, and expanded label description via a cross-attention mechanism. This integration facilitates the capture of low and high-level motion dynamics, effectively bridging the domain gap between the video and text modalities. The proposed approach of generating expanded label descriptions efficiently enhances semantic information, thus ameliorating zero-shot transferability and providing a comprehensive grasp of semantics and motion. The temporal shuffle and alignment module is designed to enhance the generalization ability of image sequences by capturing discriminative high-level motions through frame sorting. The efficacy of the proposed method is validated through extensive experiments on three benchmark datasets, namely Kinetic-600, UCF-101, and HMDB-51. Notably, our model achieves state-of-the-art results in the zero-shot action recognition task.
KW - Cross-attention
KW - Semantics expansion
KW - Zero-shot action recognition
UR - http://www.scopus.com/inward/record.url?scp=85199381707&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2024.124786
DO - 10.1016/j.eswa.2024.124786
M3 - Article
AN - SCOPUS:85199381707
SN - 0957-4174
VL - 255
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 124786
ER -