TY - JOUR
T1 - EENet
T2 - embedding enhancement network for compositional image-text retrieval using generated text
AU - Hur, Chan
AU - Park, Hyeyoung
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023.
PY - 2024/5
Y1 - 2024/5
N2 - In this paper, we consider the compositional image-text retrieval task, which searches for appropriate target images given a reference image with feedback text as a query. For instance, when a user finds a dress on an E-commerce site that meets all their needs except for the length and decoration, the user can give sentence-form feedback, e.g., "I like this dress, but I wish it was a little shorter and had no ribbon," to the system. This is a practical scenario for advanced retrieval systems and is applicable to user interactive search systems or E-commerce systems. To tackle this task, we propose a model, the Embedding Enhancement Network (EENet), which includes a text generation module and an image feature enhancement module using the generated text. While the conventional works mainly focus on developing an efficient composition module of a given image and text query, EENet actively generates an additional textual description to enhance the image feature vector in the embedding space, which is inspired by the human ability to recognize an object using a visual sensor and prior textual information. Also, a new training loss is introduced to ensure that images and additional generated texts are well combined. The experimental results show that the EENet achieves considerable improvement on retrieval performance evaluations; for the Recall@1 metric, it improved by 3.4% in Fashion200k and 1.4% in MIT-States over the baseline model.
AB - In this paper, we consider the compositional image-text retrieval task, which searches for appropriate target images given a reference image with feedback text as a query. For instance, when a user finds a dress on an E-commerce site that meets all their needs except for the length and decoration, the user can give sentence-form feedback, e.g., "I like this dress, but I wish it was a little shorter and had no ribbon," to the system. This is a practical scenario for advanced retrieval systems and is applicable to user interactive search systems or E-commerce systems. To tackle this task, we propose a model, the Embedding Enhancement Network (EENet), which includes a text generation module and an image feature enhancement module using the generated text. While the conventional works mainly focus on developing an efficient composition module of a given image and text query, EENet actively generates an additional textual description to enhance the image feature vector in the embedding space, which is inspired by the human ability to recognize an object using a visual sensor and prior textual information. Also, a new training loss is introduced to ensure that images and additional generated texts are well combined. The experimental results show that the EENet achieves considerable improvement on retrieval performance evaluations; for the Recall@1 metric, it improved by 3.4% in Fashion200k and 1.4% in MIT-States over the baseline model.
KW - Compositional Image-Text Retrieval
KW - Image-Captioning
KW - Joint embedding
KW - Textual Feature Generation
KW - Visual Feature Enhancement
UR - http://www.scopus.com/inward/record.url?scp=85175623196&partnerID=8YFLogxK
U2 - 10.1007/s11042-023-17531-y
DO - 10.1007/s11042-023-17531-y
M3 - Article
AN - SCOPUS:85175623196
SN - 1380-7501
VL - 83
SP - 49689
EP - 49705
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 16
ER -