TY - JOUR
T1 - 4G-VOS
T2 - Video Object Segmentation using guided context embedding
AU - Fiaz, Mustansar
AU - Zaheer, Muhammad Zaigham
AU - Mahmood, Arif
AU - Lee, Seung Ik
AU - Jung, Soon Ki
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/11/14
Y1 - 2021/11/14
N2 - Video Object Segmentation (VOS) is a fundamental task required in many high-level real-world computer vision applications. VOS becomes challenging due to the presence of background distractors as well as to object appearance variations. Many existing VOS approaches use online model updates to capture the appearance variations which incurs high computational cost. Template matching and propagation-based VOS methods, although cost-effective, suffer from performance degradation under challenging scenarios such as occlusion and background clutter. In order to tackle these challenges, we propose a network architecture dubbed 4G-VOS to encode video context for improved VOS performance to tackle these challenges. To preserve long term semantic information, we propose a guided transfer embedding module. We employ a global instance matching module to generate similarity maps from the initial image and the mask. Besides, we use a generative directional appearance module to estimate and dynamically update the foreground/background class probabilities in a spherical embedding space. Moreover, during feature refinement, existing approaches may lose contextual information. Therefore, we propose a guided pooled decoder to exploit the global and local contextual information during feature refinement. The proposed framework is an end-to-end learning architecture that is trained in an offline fashion. Evaluations over three VOS benchmark datasets including DAVIS2016, DAVIS2017, and YouTube-VOS have demonstrated outstanding performance of the proposed algorithm compared to 40 existing state-of-the-art methods.
AB - Video Object Segmentation (VOS) is a fundamental task required in many high-level real-world computer vision applications. VOS becomes challenging due to the presence of background distractors as well as to object appearance variations. Many existing VOS approaches use online model updates to capture the appearance variations which incurs high computational cost. Template matching and propagation-based VOS methods, although cost-effective, suffer from performance degradation under challenging scenarios such as occlusion and background clutter. In order to tackle these challenges, we propose a network architecture dubbed 4G-VOS to encode video context for improved VOS performance to tackle these challenges. To preserve long term semantic information, we propose a guided transfer embedding module. We employ a global instance matching module to generate similarity maps from the initial image and the mask. Besides, we use a generative directional appearance module to estimate and dynamically update the foreground/background class probabilities in a spherical embedding space. Moreover, during feature refinement, existing approaches may lose contextual information. Therefore, we propose a guided pooled decoder to exploit the global and local contextual information during feature refinement. The proposed framework is an end-to-end learning architecture that is trained in an offline fashion. Evaluations over three VOS benchmark datasets including DAVIS2016, DAVIS2017, and YouTube-VOS have demonstrated outstanding performance of the proposed algorithm compared to 40 existing state-of-the-art methods.
KW - Channel convolutional neural networks
KW - Encoder–decoder
KW - Feature refinement
KW - Feature transfer and matching
KW - Spherical embedding
KW - Video Object Segmentation
UR - http://www.scopus.com/inward/record.url?scp=85114296612&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.107401
DO - 10.1016/j.knosys.2021.107401
M3 - Article
AN - SCOPUS:85114296612
SN - 0950-7051
VL - 231
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 107401
ER -