Skip to main navigation Skip to search Skip to main content

Hierarchical Action Understanding: Fine-to-Coarse Reasoning Framework for Video Interpretation

  • Jun Beom Moon
  • , Jiye Won
  • , Ye Eun Joo
  • , Sehwan Heo
  • , Soon Ki Jung
  • Kyungpook National University

Research output: Contribution to journalConference articlepeer-review

Abstract

Existing video action recognition methods often fail to bridge the semantic gap between fine-grained, frame-level labels and coarse, high-level actions. To address this, we propose a hierarchical reasoning framework that predicts coarse action labels from sequences of fine-grained labels. Using the Breakfast dataset, we apply rule-based preprocessing to align and pair fine and coarse labels, resolving temporal misalignment and background segments. Our model-agnostic framework supports integration with outputs from Temporal Action Segmentation (TAS) models. We evaluate sequence models - LSTM, TCN, Transformer, and Mamba - under causal and non-causal settings using multiple loss functions, including cross-entropy and cosine similarity. Results show that causal models, particularly LSTM and Mamba, outperform others in accuracy, edit distance, and F1-score, especially with hybrid losses. Our method is robust to noisy fine labels and preserves interpretability through explicit fine-to-coarse mapping. This work offers a scalable and modular solution for multi-level action understanding across diverse video domains.

Fingerprint

Dive into the research topics of 'Hierarchical Action Understanding: Fine-to-Coarse Reasoning Framework for Video Interpretation'. Together they form a unique fingerprint.

Cite this