Vision Transformer Compression and Architecture Exploration with Efficient Embedding Space Search

Daeho Kim, Jaeil Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper addresses theoretical and practical problems in the compression of vision transformers for resource-constrained environments. We found that deep feature collapse and gradient collapse can occur during the search process for the vision transformer compression. Deep feature collapse diminishes feature diversity rapidly as the layer depth deepens, and gradient collapse causes gradient explosion in training. Against these issues, we propose a novel framework, called VTCA, for accomplishing vision transformer compression and architecture exploration jointly with embedding space search using Bayesian optimization. In this framework, we formulate block-wise removal, shrinkage, cross-block skip augmentation to prevent deep feature collapse, and Res-Post layer normalization to prevent gradient collapse under a knowledge distillation loss. In the search phase, we adopt a training speed estimation for a large-scale dataset and propose a novel elastic reward function that can represent a generalized manifold of rewards. Experiments were conducted with DeiT-Tiny/Small/Base backbones on the ImageNet, and our approach achieved competitive accuracy to recent patch reduction and pruning methods. The code is available at https://github.com/kdaeho27/VTCA.

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2022 - 16th Asian Conference on Computer Vision, Proceedings
EditorsLei Wang, Juergen Gall, Tat-Jun Chin, Imari Sato, Rama Chellappa
PublisherSpringer Science and Business Media Deutschland GmbH
Pages524-540
Number of pages17
ISBN (Print)9783031263125
DOIs
StatePublished - 2023
Event16th Asian Conference on Computer Vision, ACCV 2022 - Macao, China
Duration: 4 Dec 20228 Dec 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13843 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th Asian Conference on Computer Vision, ACCV 2022
Country/TerritoryChina
CityMacao
Period4/12/228/12/22

Fingerprint

Dive into the research topics of 'Vision Transformer Compression and Architecture Exploration with Efficient Embedding Space Search'. Together they form a unique fingerprint.

Cite this