Visual–auditory learning network for construction equipment action detection

Seunghoon Jung, Jaewon Jeoung, Dong Eun Lee, Hyounseung Jang, Taehoon Hong

Research output: Contribution to journalArticlepeer-review

14 Scopus citations

Abstract

Action detection of construction equipment is critical for tracking project performance, facilitating construction automation, and fostering construction efficiency in terms of construction site monitoring. Particularly, the auditory signal can provide additional information on computer vision-based action detection of various types of construction equipment. Therefore, this study aims to develop a visual–auditory learning network model for the action detection of construction equipment based on two modalities (i.e., vision and audition). To this end, both visual and auditory features are extracted from the multi-modal feature extractor. In addition, the multi-head attention and detection module is designed to conduct the localization and classification tasks in separate heads in which different attention mechanisms for each task are applied. Particularly, the content-based attention mechanism and the dot-product attention mechanism are, respectively, adopted for spatial attention in the localization head and channel attention in the classification head. The evaluation results show that the precision and recall of the proposed model can reach 86.92% and 84.00% with the adoption of the multi-head attention and detection module, which has proven to improve overall detection performance by utilizing different correlations of visual and auditory features for localization and classification, respectively.

Original languageEnglish
Pages (from-to)1916-1934
Number of pages19
JournalComputer-Aided Civil and Infrastructure Engineering
Volume38
Issue number14
DOIs
StatePublished - 15 Sep 2023

Fingerprint

Dive into the research topics of 'Visual–auditory learning network for construction equipment action detection'. Together they form a unique fingerprint.

Cite this