Abstract
In our work, we have empirically found that Vision Transformer (ViT) could not extract object-centric features when applied to out-of-distribution (OOD) detection. To make object-centric attention, we design an additional module that employs a cross-attention between class-wise token proxy and feature token sequence of an input image. For inference suitable to our cross-attention structure with multiple class-wise token proxies, we propose a score ensemble that can be applied to any scoring function. Compared to ViT, the proposed inference scheme achieves outperforming performance by synergizing with our cross-attention structure. Through experiments, we demonstrate that the proposed cross-attention structure with score ensemble inference improves largely near OOD detection performance, where FPR95 improvement in near OOD detection compared to the state-of-the-art method becomes 2.55% for CIFAR-10 and 2.67% for CIFAR-100, keeping competitive classification accuracy.
| Original language | English |
|---|---|
| Pages (from-to) | 62793-62803 |
| Number of pages | 11 |
| Journal | IEEE Access |
| Volume | 12 |
| DOIs | |
| State | Published - 2024 |
Keywords
- Near out-of-distribution (OOD) detection
- class-wise cross attention
- vision transformer
Fingerprint
Dive into the research topics of 'C2I-CAT: Class-to-Image Cross Attention Transformer for Out-of-Distribution Detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver