TY - JOUR
T1 - Masked Kinematic Continuity-aware Hierarchical Attention Network for pose estimation in videos
AU - Jin, Kyung Min
AU - Lee, Gun Hee
AU - Nam, Woo Jeoung
AU - Kang, Tae Kyung
AU - Kim, Hyun Woo
AU - Lee, Seong Whan
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2024/1
Y1 - 2024/1
N2 - Existing methods for estimating human poses from video content exploit the temporal features of the video sequences and have shown impressive results. However, most methods address spatiotemporal issues separately. They compromise on accuracy to reduce jitter, or require high-resolution images to deal with occlusion, preventing full consideration of temporal features. Unfortunately, these two issues are interrelated. For example, occlusion causes uncertainty between successive frames, leading to unsmoothed results. To address these issues, we propose the Masked Kinematic Continuity-aware Hierarchical Attention Network (M-HANet) as a novel framework that exploits masked kinematic keypoint features by extending our framework HANet framework. First, we randomly select and mask a keypoint to treat the masked keypoint as it is occluded, which allows us to make the network resilient to occlusion. We also use the velocity and acceleration of each individual keypoint to effectively capture temporal features. Second, the proposed hierarchical transformer encoder refines a 2D or 3D input pose derived from existing estimators by aggregating the masked continuity of the spatiotemporal dependencies of human motion. Finally, to facilitate collaborative optimization, we perform an online cross-supervision between the final pose from our decoder and the refined input pose produced by our encoder. We validate the effectiveness of our model demonstrating that our proposed approach improves [email protected] by 14.1% and MPJPE by 8.7 mm compared to the existing method on a variety of tasks, including 2D and 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation.
AB - Existing methods for estimating human poses from video content exploit the temporal features of the video sequences and have shown impressive results. However, most methods address spatiotemporal issues separately. They compromise on accuracy to reduce jitter, or require high-resolution images to deal with occlusion, preventing full consideration of temporal features. Unfortunately, these two issues are interrelated. For example, occlusion causes uncertainty between successive frames, leading to unsmoothed results. To address these issues, we propose the Masked Kinematic Continuity-aware Hierarchical Attention Network (M-HANet) as a novel framework that exploits masked kinematic keypoint features by extending our framework HANet framework. First, we randomly select and mask a keypoint to treat the masked keypoint as it is occluded, which allows us to make the network resilient to occlusion. We also use the velocity and acceleration of each individual keypoint to effectively capture temporal features. Second, the proposed hierarchical transformer encoder refines a 2D or 3D input pose derived from existing estimators by aggregating the masked continuity of the spatiotemporal dependencies of human motion. Finally, to facilitate collaborative optimization, we perform an online cross-supervision between the final pose from our decoder and the refined input pose produced by our encoder. We validate the effectiveness of our model demonstrating that our proposed approach improves [email protected] by 14.1% and MPJPE by 8.7 mm compared to the existing method on a variety of tasks, including 2D and 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation.
KW - Body mesh recovery
KW - Pose estimation
KW - Transformer
KW - Video understanding
UR - http://www.scopus.com/inward/record.url?scp=85175335892&partnerID=8YFLogxK
U2 - 10.1016/j.neunet.2023.10.038
DO - 10.1016/j.neunet.2023.10.038
M3 - Article
C2 - 37918271
AN - SCOPUS:85175335892
SN - 0893-6080
VL - 169
SP - 282
EP - 292
JO - Neural Networks
JF - Neural Networks
ER -