Abstract
Existing video action recognition methods often fail to bridge the semantic gap between fine-grained, frame-level labels and coarse, high-level actions. To address this, we propose a hierarchical reasoning framework that predicts coarse action labels from sequences of fine-grained labels. Using the Breakfast dataset, we apply rule-based preprocessing to align and pair fine and coarse labels, resolving temporal misalignment and background segments. Our model-agnostic framework supports integration with outputs from Temporal Action Segmentation (TAS) models. We evaluate sequence models - LSTM, TCN, Transformer, and Mamba - under causal and non-causal settings using multiple loss functions, including cross-entropy and cosine similarity. Results show that causal models, particularly LSTM and Mamba, outperform others in accuracy, edit distance, and F1-score, especially with hybrid losses. Our method is robust to noisy fine labels and preserves interpretability through explicit fine-to-coarse mapping. This work offers a scalable and modular solution for multi-level action understanding across diverse video domains.
| Original language | English |
|---|---|
| Journal | Proceedings - IEEE International Conference on Advanced Video and Signal-Based Surveillance, AVSS |
| Issue number | 2025 |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE International Conference on Advanced Visual and Signal-Based Systems, AVSS 2025 - Tainan, Taiwan, Province of China Duration: 11 Aug 2025 → 13 Aug 2025 |
Fingerprint
Dive into the research topics of 'Hierarchical Action Understanding: Fine-to-Coarse Reasoning Framework for Video Interpretation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver