In this paper, we propose a novel framework called Image-Text Modify Attention ITMA and a Transformer-based combining function that performs preservation and transformation features of t
Trang 1Image Retrieval with Text Feedback based on
Transformer Deep Model
Truc Luong-Phuong Huynh1, Ngoc Quoc Ly2
1Faculty of Information Techonology,2Computer Vision & Cognitive Cybernetics Dept
VNUHCM-University of Science
Ho Chi Minh City, Vietnam
11712842@student.hcmus.edu.vn,2lqngoc@fit.hcmus.edu.vn
Abstract—Image retrieval with text feedback has many
poten-tials when applied in product retrieval for e-commerce platforms
Given an input image and text feedback, the system needs
to retrieve images that not only look visually similar to the
input image but also have some modified details mentioned in
the text feedback This is a tricky task as it requires a good
understanding of image, text, and also their combination In this
paper, we propose a novel framework called Image-Text Modify
Attention (ITMA) and a Transformer-based combining function
that performs preservation and transformation features of the
input image based on the text feedback and captures important
features of database images By using multiple image features at
different Convolution Neural Network (CNN) depths, the
com-bining function can have multi-level visual information to achieve
an impressive representation that satisfies for effective image
retrieval We conduct quantitative and qualitative experiments
on two datasets: CSS and FashionIQ ITMA outperforms existing
approaches on these datasets and can deal with many types of
text feedback such as object attributes and natural language We
are also the first ones to discover the exceptional behavior of
the attention mechanism in this task which ignores input image
regions where text feedback wants to remove or change
Index Terms—Image Retrieval with Text Feedback,
Convolu-tion Neural Network, AttenConvolu-tion Mechanism, Transformer Deep
Model
I INTRODUCTION Image retrieval is a well-known problem in computer vision
that has been researched and applied in the industry for a long
time The major problem of studying this task is how to exploit
the intention of humans when they retrieve something Most
retrieval systems are based on image-to-image matching [7]
and text-to-image matching [5] Therefore, it is difficult for
people to express their ideas in a single image or a couple of
words
To overcome these limitations, people are trying to create
models with suitable inputs that not only allow models to
give accurate predictions but also retain the whole ideas of
the user in them Some of them use desired changes to the
input image of the user to describe the target image They are
usually expressed in text form of certain attributes [8, 18] or
relative attributes [13] More recently, researchers can use a
more general type of text feedback that is natural language
[28] but it is still an unusual area of research Being able to
deal with a more general type of language feedback is creating
a system that can run in applications for real users
Fig 1 Examples of customers using image retrieval with text feedback in a fashion e-commerce application Customers upload their clothes (red outline) and type text feedback describing changes they want (red sentence) The system gives suggestions about customers’ target clothes (green outline).
In this work, inputs are an image and text feedback de-scribing modifications of the input image Unlike previous approaches, we consider more general language input types which are object attributes [26] and natural language [28], and propose a simpler approach giving better performances The core ideas behind our method are: (1) using text features to modify image features, and (2) modified image features need
to “live in” the same space with the database image features
to match between them
Regarding the first one, we use the attention mechanism to in-charge of preservation and transformation image features according to text features at the same time As a result,
we not only obtain a more compact model than previous methods but also are the first ones to discover a very special behavior of the attention mechanism when it avoids retaining features that the text semantics wants to transform through our visualizations As for letting them interact in the same space,
we perform the attention learning at features from various CNN depths for both input and database images Using the attention mechanism for database images assists in selecting the best features through matching image features to achieve their best representations
Trang 2ablation experiments.
• We show a distinctive behavior of the attention
mech-anism on image features according to text feedback
when applied in this retrieval task through visualization
experiments
II RELATED WORK
A Image Retrieval with various types of text feedback
Many efforts have been made to find a method that can
improve the accuracy of retrieval systems Some of them
use user feedback as an interactive signal to navigate the
model to targets In general, user feedback is in various
forms such as relative attribute [13], attribute [8, 18], natural
language [28], sketch [32] However, text form is the most
used type of communication between humans and computers
which can hold an adequate amount of information to specify
complicated ideas of people about the target
As the first attempt in tackling image retrieval with text
feedback, Text Image Residual Gating (TIRG) [26] proposed
two functions for transformation and preservation image
features separately Visiolinguistic Attention Learning (VAL)
[2] is more advanced by using image features at multiple
CNN depths Besides, these approaches use only CNN to
learn the representation of images and two functions above
to modify input image features to resemble the CNN target
image features Since the representation learned from CNN
was not good enough, VAL added another loss function to
bring the image features closer to the semantics of their text
descriptions
We share the same idea with them in transformation and
preservation image features according to text semantics and
using image features at multiple convolution layers However,
we improve them by building a combining function for both
transformationand preservation image features and combining
once on multiple CNN image features We use CNN and the
combining function for all images to get their best
representa-tion with one loss funcrepresenta-tion and no need for image descriprepresenta-tions
during training Our ultimate aim is to improve results by not
using an overcomplicated framework
B Attention Mechanism
Attention Mechanism [1] is widely used in tasks related
to image and language, the purpose of this mechanism is to
imitate the human sense of focusing on important information
which stands out in the background [3] To specify fixed
positions in the image, attention mechanism creates different
weights on image spatial information and the value of weights
to learn the latent relations among spatial information
C Compositional Learning Compositional Learning is considered an essential function when developing an intelligent machine Its target is to learn encoded features that can encompass multiple primitives [27] Although CNNs can learn the composition of visual infor-mation, they still can’t learn a clear composition of language and image information Recently, extended research [23] from pre-train strategies of BERT [4] has been proposed to learn a compositional representation of image and text to solve VQA, Image Captioning, and Image-Text Matching Unfortunately, these works mainly fix the feature extractions in complicated object detection [21] and recognition [29] model This not only limits its applications in a variety of problems but also leads
to an overcomplicated and heavy framework We propose to use image features at varying depths inside CNN and combine them with text features This is an effective method to combine image and text features to a compositional representation through a simpler and lighter model
III METHOD Details for training process are illustrated in Fig 2 During training, the system predicts a representation φxtwhich is most similar to target representation φy We begin as an image-to-image retrieval system that matches between features of x and target y Then, we gradually learn meaningful modification
to features of x according to features of t Meanwhile, target images y are not available during testing so the system predicts images from the database whose representation is most similar
to the representation φxt
A Image Encoder Having the idea from VAL [2], we use a standard CNN
to encode images and features from multiple layers inside
In essence, CNN gradually filters image features through its layers to retain the most representative components However, they also remove some of the features that could be impor-tant in our modification step in order to achieve impressive representation Therefore, we use features from two different layers of CNN and call them Middle Layer φM ∈ Rw×h×c
and High Layer φH ∈ Rw 0 ×h 0 ×c 0
to prevent important details from being removed:
{φM, φH} = fCN N(x) (1) Actually, the number of CNN layers is a hyperparameter, and using more than two layers in our case will not benefit
Trang 3Fig 2 An overview of the training pipeline Given an input pair of an image x (red outline) and text feedback t (red sentence), a target image y (green outline), and images from the database (gray outline) Red arrows show directions for the input pair (x and t), green arrows show directions for database images including target y, and blue arrows show direction for both input pair and database images We have 4 modules: (a) Image Encoder, (b) Text Encoder, (c) Combining Function, and (d) Loss Function between combined features and features from database.
the performance After this, φM and φH will have the same
number of channels which is C = 512 through a learned linear
projection
B Text Encoder
To get the text representation, we need to define a function
ftext(t) to encode the text feedback t to a representative vector
φt whose size d = 512:
φt= fLST M(t) ∈ Rd (2)
We use a standard Long Short Term Memory (LSTM) [10],
followed by a max-pooling and a linear transformation as the
text encoder φt is obtained by passing each word of t into
the text encoder and taking the output from the last timestep
C Combining Function
To obtain a composite representation of image and text,
we transform and preserve image features according to text
feedback semantics Inspired by Transformer in Multimodal
Learning [12], we create a composite Transformer using
multi-level features of CNN
1) Image-Text Representation: As the information flows
through visual and linguistic domains, input image features
from multiple layers of CNN φM, φH and text features φtare
fused to obtain the image-text representation:
φ = [φM, φH, φt] (3)
To be more specific, we reshape φM to Rn×C(n = w × h),
φH to Rn0×C(n0= w0× h0) and concatenate all of them This
has the similar spirit as Relation Network [22], the relation
between input image and text is performed in φ As for features
from database images, φ has contributions only from φM and
φH:
2) Image-Text Self-attention: To figure out the latent
con-nections between image regions needed for learning the
preservationand transformation, we pass the image-text
repre-sentation φ through a multi-head Transformer The core idea is
to capture important vision and linguistic information through
self-attention learning First, we project φ into the latent space
as query, key, value (i.e Q, K, V ):
Q = φWQ, K = φWK, V = φWV, (5) where WQ, WK, and WV are 1 × 1 convolutions The self-attention is followed by fully-connected layers as in the Transformer encoder [24] Here self-attention refers to the equation:
Attn(Q, K, V ) = f (QKT)V, (6) where f is the softmax function as in [24] Basically, this self-attention exploits interactions between components formed in the image-text representation For each one, it generates an at-tention mask to highlight the spatial information that is needed for learning the feature transformation and preservation, and visual matching
3) Embeddings: Due to the absence of φtin image features from the database, to be able to match input image and target features, they must be embedded in the same space We first average pool within each feature type Specifically, we do it separately on each φM, φH and φtfor image-text features and
φM and φH for database features to get two representations with shape 512×3 and 512×2 respectively They are averaged and then normalized to become two vectors with 512 elements each With the final representations, we multiply each vector
by a learned scale weight initialized as 4
In the training process, the representation φxt and φy (Fig 2d) are passed to the loss function described in the next section In the testing process, images from database whose representation is in the top k with highest cosine similarity with the representation φxt are predicted as results
D Loss Function Our training target is to bring the representation φxt closer
to the target representation φy while pushing it far away from other representations We adopt the classification-based loss from TIRG [26] for our training Specially, we train a batch
of B queries, in query ith, we have the representation φx i t i
needed to get closer to its target representation φy i The other representations needed to push far away φy j, where j is not i,
Trang 4does not work well for the R@k metric.
IV EXPERIMENTS
A Implementation details
We conduct all the experiments in PyTorch With the image
encoder, we use ResNet-50 [9] (output feature size is 512)
pre-trained on ImageNet With the text encoder, we use LSTM [10]
(hidden size is 512) with random initial parameters Our model
is trained using the initial learning rate of 0.001 for parameters
of the image encoder and 0.01 for the remaining parameters
The learning rate decreases by 10 for every 50K iterations and
the training ends after 150K iterations for FasionIQ and 60K
iterations for CSS Our model has 2 attention blocks with 512
units for each Q, K, and V , 8 attention heads, and 256 units
for the Fully Connected Layer of attention blocks We use a
batch size of 32 which is the same as previous papers
B Results
1) CSS [26]: (which is short for Color, Shape, and Size)
is an attribute-based retrieval dataset including 32K queries
(16K for train set and 16K for test set) Text feedback, e.g
“make yellow sphere small”, modify synthesized images of
3-by-3 grid scene Although CSS looks relatively simple, we
can carry out carefully control experiments on it
Unlike most other datasets, trained models on CSS are likely
to be overfitted by large configurations Therefore, we limit our
model by using only features from the High Layer of
ResNet-50 and one attention block In addition, replacing softmax with
identity function in attention block improves results in R@1
Besides, using sinusoidal encodings as in [24] not only does
not improve the performance when using softmax but also
impairs the performance when using the identity function
We compared with R@1 performances reported in [11, 26]
about them and other recent methods [19, 20, 22, 25] We also
used CSS 3D images provided by [26] for our experiments
and ITMA outperforms them for the retrieval task shown in
Table I
2) FashionIQ [28]: is a fashion retrieval dataset with
natural language text feedback It consists of 77648 images
collected from the e-commerce site Amazon.com of 3
cate-gories: Dresses, Toptees, and Shirts Among 46,609 training
images, there are 18,000 pairs of query-target images along
with 2 text feedback sentences in natural language to describe
one to two modified properties e.g ”longer more dressy”
We used R@50 and R@100 results and the same FasionIQ
dataset from [2] As FashionIQ is a relatively new dataset,
Locally Bounded Features [11] 79.2 ± 1.2
there has not been much experiment on it and we outperform other competitors shown in Table II
Fig 3 presents our qualitative results on FashionIQ Al-though there is much semantics hidden behind the natural language text feedback, ITMA is still able to capture almost every aspect of the text including contents relating to fashion e.g color, gloss, and printing, etc We also found that our model is able to understand not only global descriptions such
as the overall colors and patterns on the outfit but also local details like a logo in a specific location
TABLE II
COMBINING FUNCTION PERFORMING BOTH FEATURE PRESERVATION AND TRANSFORMATION HELPS THE MODEL GENERALIZE BETTER THAN THE
C Ablation studies
We first experiment with the influence of Middle Layer (14 × 14) and High Layer (7 × 7) in ResNet-50 on our model Table III shows that using both of them substantially improves results The result in Table IV shows the sensitivity of our model to the various number of units in the Fully Connected Layer behind the attention blocks Table V shows the effect but relatively small of using the different number of attention blocks in the architecture Overall, we use a combination that achieves the best results for our architecture including 2 different layers of ResNet-50, 256 units in the Fully Connected Layer of attention blocks, and 2 attention blocks
D Visualization Attention maps in Fig 4 emphasize image regions accord-ing to bold words in text feedback that explains the behaviors
of attention in retrieval task With additional changes like
“longer dress” or “longer sleeves”, the model put positive
Trang 5Fig 3 Top 10 results of ITMA for FashionIQ validation set queries Images with green outline are ”correct” images.
TABLE III
7x7 14x14 Dresses R10 (R50) Top & Tees R10 (R50) Shirts R10 (R50) (R@10 + R@50)/2
X - 22.9 ± 1.1 (47.7 ± 1.0) 26.4 ± 0.6 (52.4 ± 0.5) 19.5 ± 0.9 (42.0 ± 0.6) 35.1 ± 0.6
X X 23.8 ± 0.6 (48.6 ± 1.0) 27.9 ± 0.8 (53.6 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 36.6 ± 0.4
TABLE IV
BLOCKS W E USE 2 ATTENTION BLOCKS AND IMAGE FEATURES FROM H IGH L AYER (7 X 7) AND M IDDLE L AYER (14 X 14).
Width Dresses R10 (R50) Top & Tees R10 (R50) Shirts R10 (R50) (R10 + R50)/2
64 23.9 ± 0.8 (48.6 ± 0.8) 27.7 ± 1.0 (53.7 ± 0.8) 21.1 ± 0.5 (43.3 ± 0.6) 36.4 ± 0.3
128 24.7 ± 0.6 (49.0 ± 0.7) 27.8 ± 0.6 (54.1 ± 0.3) 21.1 ± 0.4 (43.9 ± 0.9) 36.8 ± 0.2
256 23.8 ± 0.6 (48.6 ± 1.0) 27.9 ± 0.8 (53.6 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 36.6 ± 0.4
512 24.2 ± 0.4 (48.3 ± 0.4) 27.7 ± 0.8 (53.8 ± 1.1) 21.2 ± 1.1 (43.2 ± 0.3) 36.4 ± 0.5
TABLE V
C ONNECTED L AYERS AND IMAGE FEATURES FROM H IGH L AYER (7 X 7) AND M IDDLE L AYER (14 X 14).
# Attention Blocks Dresses R10 (R50) Top & Tees R10 (R50) Shirts R10 (R50) (R10 + R50)/2
1 24.2 ± 1.2 (48.3 ± 1.3) 27.9 ± 0.8 (54.0 ± 0.9) 20.6 ± 0.7 (43.3 ± 0.6) 36.4 ± 0.5
2 23.8 ± 0.6 (48.6 ± 1.0) 27.9 ± 0.8 (53.6 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 36.6 ± 0.4
3 24.1 ± 0.3 (48.3 ± 0.4) 27.3 ± 0.7 (53.6 ± 0.9) 20.6 ± 0.2 (43.5 ± 0.7) 36.2 ± 0.4
4 23.7 ± 0.8 (47.9 ± 1.0) 27.1 ± 0.8 (52.9 ± 1.0) 20.0 ± 0.5 (42.6 ± 0.7) 35.7 ± 0.6
weights to the corresponding areas in the items (red areas)
On the contrary, the model put negative weights to the
cor-responding areas in the items (purple areas) for diminished
changes such as “shorter dress”
The behavior of a word ignoring what it makes reference
to seems to contradict with attention mechanism from Image
Captioning [31] which shows that the attention will use words
to attend to relevant regions of the image Actually, our text
feedback describes modifications to the image and the model
tries to learn a final representation which is close to the
target representation Therefore, a word avoids referring to
its referenced items because that representation tends not to
support any information about the target
Fig 4 Attention visualization on FashionIQ We use attention maps from
2 among 8 attention heads in a block of our best model and obtain this visualization by averaging over the first 200 images from the FashionIQ validation set.
Trang 6includes categories, attributes, colors, textures, shape, etc to
assist text feedback and our training model We will also
develop a smarter system to recommend the most suitable
clothes for the customer
ACKNOWLEDGMENT
We would like to thank Hong-Huan Do, Vinh-Loi Ly, and
Sy-Tuyen Ho for their helpful discussions
REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio Neural
machine translation by jointly learning to align and translate arXiv
preprint arXiv:1409.0473, 2014.
[2] Yanbei Chen, Shaogang Gong, and Loris Bazzani Image search with
text feedback by visiolinguistic attention learning In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 3001–3011, 2020.
[3] Maurizio Corbetta and Gordon L Shulman Control of goal-directed and
stimulus-driven attention in the brain Nature Reviews Neuroscience,
3(3):201–215, 2002.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language
un-derstanding arXiv preprint arXiv:1810.04805, 2018.
[5] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey
Dean, Marc’Aurelio Ranzato, and Tomas Mikolov Devise: A deep
visual-semantic embedding model 2013.
[6] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi,
Xiaogang Wang, and Hongsheng Li Dynamic fusion with
intra-and inter-modality attention flow for visual question answering In
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 6639–6648, 2019.
[7] Albert Gordo, Jon Almaz´an, Jerome Revaud, and Diane Larlus Deep
image retrieval: Learning global representations for image search In
European Conference on Computer Vision, pages 241–257 Springer,
2016.
[8] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong
Zhu, Yuan Li, Yang Zhao, and Larry S Davis Automatic spatially-aware
fashion concept discovery In Proceedings of the IEEE International
Conference on Computer Vision, pages 1463–1471, 2017.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Deep
residual learning for image recognition In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 770–
778, 2016.
[10] Sepp Hochreiter and J¨urgen Schmidhuber Long short-term memory.
Neural Computation, 9(8):1735–1780, 1997.
[11] Mehrdad Hosseinzadeh and Yang Wang Composed query image
re-trieval using locally bounded features In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 3596–
3605, 2020.
[12] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei Attention
on attention for image captioning In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 4634–4643, 2019.
[13] Adriana Kovashka, Devi Parikh, and Kristen Grauman Whittlesearch:
Image search with relative attribute feedback In 2012 IEEE Conference
on Computer Vision and Pattern Recognition, pages 2973–2980 IEEE,
2012.
[14] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee Vilbert:
Pretrain-ing task-agnostic visiolPretrain-inguistic representations for vision-and-language
tasks arXiv preprint arXiv:1908.02265, 2019.
ontology and deep neuron network In Asian Conference on Intelligent Information and Database Systems, pages 539–549 Springer, 2018 [19] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han Image question answering using convolutional neural network with dynamic parameter prediction In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016.
[20] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville Film: Visual reasoning with a general conditioning layer In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Faster r-cnn: Towards real-time object detection with region proposal networks Advances in Neural Information Processing Systems, 28:91–99, 2015 [22] Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap A sim-ple neural network module for relational reasoning arXiv preprint arXiv:1706.01427, 2017.
[23] Hao Tan and Mohit Bansal Lxmert: Learning cross-modality encoder representations from transformers arXiv preprint arXiv:1908.07490, 2019.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin Attention
is all you need In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
[25] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan Show and tell: A neural image caption generator In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
[26] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays Composing text and image for image retrieval-an empirical odyssey In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6439–6448, 2019.
[27] Kun Wei, Muli Yang, Hao Wang, Cheng Deng, and Xianglong Liu Adversarial fine-grained composition learning for unseen attribute-object recognition In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 3741–3749, 2019.
[28] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris Fashion iq: A new dataset towards retrieving images by natural language feedback In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11307–11317, 2021.
[29] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Mur-phy Rethinking spatiotemporal feature learning for video understanding arXiv preprint arXiv:1712.04851, 1(2):5, 2017.
[30] Huijuan Xu and Kate Saenko Ask, attend and answer: Exploring question-guided spatial attention for visual question answering In European Conference on Computer Vision, pages 451–466 Springer, 2016.
[31] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio Show, attend and tell: Neural image caption generation with visual attention In Inter-national Conference on Machine Learning, pages 2048–2057 PMLR, 2015.
[32] Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy Sketch me that shoe In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 799–807, 2016.
[33] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian Deep modular co-attention networks for visual question answering In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6281–6290, 2019.