ESC-ZSAR: Expanded Semantics from Categories with Cross-Attention for Zero-Shot Action Recognition

Jeong Cheol Lee, Dong Gyu Lee

Research output: Contribution to journalArticlepeer-review

Abstract

Zero-shot action recognition endeavors to identify novel action categories not encountered during training by aligning a joint semantic space. However, despite advancements, zero-shot action recognition still needs to grapple with the inadequate semantic representation of seen data, hindering the transfer of diverse action videos. This study introduces a novel framework combining video, optical flow, and expanded label description via a cross-attention mechanism. This integration facilitates the capture of low and high-level motion dynamics, effectively bridging the domain gap between the video and text modalities. The proposed approach of generating expanded label descriptions efficiently enhances semantic information, thus ameliorating zero-shot transferability and providing a comprehensive grasp of semantics and motion. The temporal shuffle and alignment module is designed to enhance the generalization ability of image sequences by capturing discriminative high-level motions through frame sorting. The efficacy of the proposed method is validated through extensive experiments on three benchmark datasets, namely Kinetic-600, UCF-101, and HMDB-51. Notably, our model achieves state-of-the-art results in the zero-shot action recognition task.

Original languageEnglish
Article number124786
JournalExpert Systems with Applications
Volume255
DOIs
StatePublished - 1 Dec 2024

Keywords

  • Cross-attention
  • Semantics expansion
  • Zero-shot action recognition

Fingerprint

Dive into the research topics of 'ESC-ZSAR: Expanded Semantics from Categories with Cross-Attention for Zero-Shot Action Recognition'. Together they form a unique fingerprint.

Cite this