TY - GEN
T1 - STXD
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
AU - Jang, Sujin
AU - Jo, Dae Ung
AU - Hwang, Sung Ju
AU - Lee, Dongwook
AU - Ji, Daehyun
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - 3D object detection (3DOD) from multi-view images is an economically appealing alternative to expensive LiDAR-based detectors, but also an extremely challenging task due to the absence of precise spatial cues.Recent studies have leveraged the teacher-student paradigm for cross-modal distillation, where a strong LiDAR-modality teacher transfers useful knowledge to a multi-view-based image-modality student.However, prior approaches have only focused on minimizing global distances between cross-modal features, which may lead to suboptimal knowledge distillation results.Based on these insights, we propose a novel structural and temporal cross-modal knowledge distillation (STXD) framework for multi-view 3DOD.First, STXD reduces redundancy of the feature components of the student by regularizing the cross-correlation of cross-modal features, while maximizing their similarities.Second, to effectively transfer temporal knowledge, STXD encodes temporal relations of features across a sequence of frames via similarity maps.Lastly, STXD also adopts a response distillation method to further enhance the quality of knowledge distillation at the output-level.Our extensive experiments demonstrate that STXD significantly improves the NDS and mAP of the based student detectors by 2.8% ∼ 4.5% on the nuScenes testing dataset.
AB - 3D object detection (3DOD) from multi-view images is an economically appealing alternative to expensive LiDAR-based detectors, but also an extremely challenging task due to the absence of precise spatial cues.Recent studies have leveraged the teacher-student paradigm for cross-modal distillation, where a strong LiDAR-modality teacher transfers useful knowledge to a multi-view-based image-modality student.However, prior approaches have only focused on minimizing global distances between cross-modal features, which may lead to suboptimal knowledge distillation results.Based on these insights, we propose a novel structural and temporal cross-modal knowledge distillation (STXD) framework for multi-view 3DOD.First, STXD reduces redundancy of the feature components of the student by regularizing the cross-correlation of cross-modal features, while maximizing their similarities.Second, to effectively transfer temporal knowledge, STXD encodes temporal relations of features across a sequence of frames via similarity maps.Lastly, STXD also adopts a response distillation method to further enhance the quality of knowledge distillation at the output-level.Our extensive experiments demonstrate that STXD significantly improves the NDS and mAP of the based student detectors by 2.8% ∼ 4.5% on the nuScenes testing dataset.
UR - https://www.scopus.com/pages/publications/85191191057
M3 - Conference contribution
AN - SCOPUS:85191191057
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
A2 - Oh, A.
A2 - Neumann, T.
A2 - Globerson, A.
A2 - Saenko, K.
A2 - Hardt, M.
A2 - Levine, S.
PB - Neural information processing systems foundation
Y2 - 10 December 2023 through 16 December 2023
ER -