TY - JOUR
T1 - Convolutional Neural Networks or Vision Transformers
T2 - Who Will Win the Race for Action Recognitions in Visual Data?
AU - Moutik, Oumaima
AU - Sekkat, Hiba
AU - Tigani, Smail
AU - Chehri, Abdellah
AU - Saadane, Rachid
AU - Tchakoucht, Taha Ait
AU - Paul, Anand
N1 - Publisher Copyright:
© 2023 by the authors.
PY - 2023/1
Y1 - 2023/1
N2 - Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
AB - Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
KW - action recognition
KW - action recognitions
KW - conversational systems
KW - convolutional neural networks
KW - natural language understanding
KW - recurrent neural networks
KW - vision transformers
UR - http://www.scopus.com/inward/record.url?scp=85146493380&partnerID=8YFLogxK
U2 - 10.3390/s23020734
DO - 10.3390/s23020734
M3 - Review article
C2 - 36679530
AN - SCOPUS:85146493380
SN - 1424-8220
VL - 23
JO - Sensors
JF - Sensors
IS - 2
M1 - 734
ER -