TY - GEN
T1 - Vision Transformer Compression and Architecture Exploration with Efficient Embedding Space Search
AU - Kim, Daeho
AU - Kim, Jaeil
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - This paper addresses theoretical and practical problems in the compression of vision transformers for resource-constrained environments. We found that deep feature collapse and gradient collapse can occur during the search process for the vision transformer compression. Deep feature collapse diminishes feature diversity rapidly as the layer depth deepens, and gradient collapse causes gradient explosion in training. Against these issues, we propose a novel framework, called VTCA, for accomplishing vision transformer compression and architecture exploration jointly with embedding space search using Bayesian optimization. In this framework, we formulate block-wise removal, shrinkage, cross-block skip augmentation to prevent deep feature collapse, and Res-Post layer normalization to prevent gradient collapse under a knowledge distillation loss. In the search phase, we adopt a training speed estimation for a large-scale dataset and propose a novel elastic reward function that can represent a generalized manifold of rewards. Experiments were conducted with DeiT-Tiny/Small/Base backbones on the ImageNet, and our approach achieved competitive accuracy to recent patch reduction and pruning methods. The code is available at https://github.com/kdaeho27/VTCA.
AB - This paper addresses theoretical and practical problems in the compression of vision transformers for resource-constrained environments. We found that deep feature collapse and gradient collapse can occur during the search process for the vision transformer compression. Deep feature collapse diminishes feature diversity rapidly as the layer depth deepens, and gradient collapse causes gradient explosion in training. Against these issues, we propose a novel framework, called VTCA, for accomplishing vision transformer compression and architecture exploration jointly with embedding space search using Bayesian optimization. In this framework, we formulate block-wise removal, shrinkage, cross-block skip augmentation to prevent deep feature collapse, and Res-Post layer normalization to prevent gradient collapse under a knowledge distillation loss. In the search phase, we adopt a training speed estimation for a large-scale dataset and propose a novel elastic reward function that can represent a generalized manifold of rewards. Experiments were conducted with DeiT-Tiny/Small/Base backbones on the ImageNet, and our approach achieved competitive accuracy to recent patch reduction and pruning methods. The code is available at https://github.com/kdaeho27/VTCA.
UR - http://www.scopus.com/inward/record.url?scp=85151066533&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-26313-2_32
DO - 10.1007/978-3-031-26313-2_32
M3 - Conference contribution
AN - SCOPUS:85151066533
SN - 9783031263125
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 524
EP - 540
BT - Computer Vision – ACCV 2022 - 16th Asian Conference on Computer Vision, Proceedings
A2 - Wang, Lei
A2 - Gall, Juergen
A2 - Chin, Tat-Jun
A2 - Sato, Imari
A2 - Chellappa, Rama
PB - Springer Science and Business Media Deutschland GmbH
T2 - 16th Asian Conference on Computer Vision, ACCV 2022
Y2 - 4 December 2022 through 8 December 2022
ER -