1. Trang chủ
  2. » Tất cả

Image retrieval with text feedback based on transformer deep model

6 1 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Image Retrieval with Text Feedback Based on Transformer Deep Model
Tác giả Truc Luong-Phuong Huynh, Ngoc Quoc Ly
Trường học VNUHCM-University of Science
Chuyên ngành Information and Computer Science
Thể loại NICS conference paper
Năm xuất bản 2021
Thành phố Ho Chi Minh City
Định dạng
Số trang 6
Dung lượng 2,33 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In this paper, we propose a novel framework called Image-Text Modify Attention ITMA and a Transformer-based combining function that performs preservation and transformation features of t

Trang 1

Image Retrieval with Text Feedback based on

Transformer Deep Model

Truc Luong-Phuong Huynh1, Ngoc Quoc Ly2

1Faculty of Information Techonology,2Computer Vision & Cognitive Cybernetics Dept

VNUHCM-University of Science

Ho Chi Minh City, Vietnam

11712842@student.hcmus.edu.vn,2lqngoc@fit.hcmus.edu.vn

Abstract—Image retrieval with text feedback has many

poten-tials when applied in product retrieval for e-commerce platforms

Given an input image and text feedback, the system needs

to retrieve images that not only look visually similar to the

input image but also have some modified details mentioned in

the text feedback This is a tricky task as it requires a good

understanding of image, text, and also their combination In this

paper, we propose a novel framework called Image-Text Modify

Attention (ITMA) and a Transformer-based combining function

that performs preservation and transformation features of the

input image based on the text feedback and captures important

features of database images By using multiple image features at

different Convolution Neural Network (CNN) depths, the

com-bining function can have multi-level visual information to achieve

an impressive representation that satisfies for effective image

retrieval We conduct quantitative and qualitative experiments

on two datasets: CSS and FashionIQ ITMA outperforms existing

approaches on these datasets and can deal with many types of

text feedback such as object attributes and natural language We

are also the first ones to discover the exceptional behavior of

the attention mechanism in this task which ignores input image

regions where text feedback wants to remove or change

Index Terms—Image Retrieval with Text Feedback,

Convolu-tion Neural Network, AttenConvolu-tion Mechanism, Transformer Deep

Model

I INTRODUCTION Image retrieval is a well-known problem in computer vision

that has been researched and applied in the industry for a long

time The major problem of studying this task is how to exploit

the intention of humans when they retrieve something Most

retrieval systems are based on image-to-image matching [7]

and text-to-image matching [5] Therefore, it is difficult for

people to express their ideas in a single image or a couple of

words

To overcome these limitations, people are trying to create

models with suitable inputs that not only allow models to

give accurate predictions but also retain the whole ideas of

the user in them Some of them use desired changes to the

input image of the user to describe the target image They are

usually expressed in text form of certain attributes [8, 18] or

relative attributes [13] More recently, researchers can use a

more general type of text feedback that is natural language

[28] but it is still an unusual area of research Being able to

deal with a more general type of language feedback is creating

a system that can run in applications for real users

Fig 1 Examples of customers using image retrieval with text feedback in a fashion e-commerce application Customers upload their clothes (red outline) and type text feedback describing changes they want (red sentence) The system gives suggestions about customers’ target clothes (green outline).

In this work, inputs are an image and text feedback de-scribing modifications of the input image Unlike previous approaches, we consider more general language input types which are object attributes [26] and natural language [28], and propose a simpler approach giving better performances The core ideas behind our method are: (1) using text features to modify image features, and (2) modified image features need

to “live in” the same space with the database image features

to match between them

Regarding the first one, we use the attention mechanism to in-charge of preservation and transformation image features according to text features at the same time As a result,

we not only obtain a more compact model than previous methods but also are the first ones to discover a very special behavior of the attention mechanism when it avoids retaining features that the text semantics wants to transform through our visualizations As for letting them interact in the same space,

we perform the attention learning at features from various CNN depths for both input and database images Using the attention mechanism for database images assists in selecting the best features through matching image features to achieve their best representations

Trang 2

ablation experiments.

• We show a distinctive behavior of the attention

mech-anism on image features according to text feedback

when applied in this retrieval task through visualization

experiments

II RELATED WORK

A Image Retrieval with various types of text feedback

Many efforts have been made to find a method that can

improve the accuracy of retrieval systems Some of them

use user feedback as an interactive signal to navigate the

model to targets In general, user feedback is in various

forms such as relative attribute [13], attribute [8, 18], natural

language [28], sketch [32] However, text form is the most

used type of communication between humans and computers

which can hold an adequate amount of information to specify

complicated ideas of people about the target

As the first attempt in tackling image retrieval with text

feedback, Text Image Residual Gating (TIRG) [26] proposed

two functions for transformation and preservation image

features separately Visiolinguistic Attention Learning (VAL)

[2] is more advanced by using image features at multiple

CNN depths Besides, these approaches use only CNN to

learn the representation of images and two functions above

to modify input image features to resemble the CNN target

image features Since the representation learned from CNN

was not good enough, VAL added another loss function to

bring the image features closer to the semantics of their text

descriptions

We share the same idea with them in transformation and

preservation image features according to text semantics and

using image features at multiple convolution layers However,

we improve them by building a combining function for both

transformationand preservation image features and combining

once on multiple CNN image features We use CNN and the

combining function for all images to get their best

representa-tion with one loss funcrepresenta-tion and no need for image descriprepresenta-tions

during training Our ultimate aim is to improve results by not

using an overcomplicated framework

B Attention Mechanism

Attention Mechanism [1] is widely used in tasks related

to image and language, the purpose of this mechanism is to

imitate the human sense of focusing on important information

which stands out in the background [3] To specify fixed

positions in the image, attention mechanism creates different

weights on image spatial information and the value of weights

to learn the latent relations among spatial information

C Compositional Learning Compositional Learning is considered an essential function when developing an intelligent machine Its target is to learn encoded features that can encompass multiple primitives [27] Although CNNs can learn the composition of visual infor-mation, they still can’t learn a clear composition of language and image information Recently, extended research [23] from pre-train strategies of BERT [4] has been proposed to learn a compositional representation of image and text to solve VQA, Image Captioning, and Image-Text Matching Unfortunately, these works mainly fix the feature extractions in complicated object detection [21] and recognition [29] model This not only limits its applications in a variety of problems but also leads

to an overcomplicated and heavy framework We propose to use image features at varying depths inside CNN and combine them with text features This is an effective method to combine image and text features to a compositional representation through a simpler and lighter model

III METHOD Details for training process are illustrated in Fig 2 During training, the system predicts a representation φxtwhich is most similar to target representation φy We begin as an image-to-image retrieval system that matches between features of x and target y Then, we gradually learn meaningful modification

to features of x according to features of t Meanwhile, target images y are not available during testing so the system predicts images from the database whose representation is most similar

to the representation φxt

A Image Encoder Having the idea from VAL [2], we use a standard CNN

to encode images and features from multiple layers inside

In essence, CNN gradually filters image features through its layers to retain the most representative components However, they also remove some of the features that could be impor-tant in our modification step in order to achieve impressive representation Therefore, we use features from two different layers of CNN and call them Middle Layer φM ∈ Rw×h×c

and High Layer φH ∈ Rw 0 ×h 0 ×c 0

to prevent important details from being removed:

{φM, φH} = fCN N(x) (1) Actually, the number of CNN layers is a hyperparameter, and using more than two layers in our case will not benefit

Trang 3

Fig 2 An overview of the training pipeline Given an input pair of an image x (red outline) and text feedback t (red sentence), a target image y (green outline), and images from the database (gray outline) Red arrows show directions for the input pair (x and t), green arrows show directions for database images including target y, and blue arrows show direction for both input pair and database images We have 4 modules: (a) Image Encoder, (b) Text Encoder, (c) Combining Function, and (d) Loss Function between combined features and features from database.

the performance After this, φM and φH will have the same

number of channels which is C = 512 through a learned linear

projection

B Text Encoder

To get the text representation, we need to define a function

ftext(t) to encode the text feedback t to a representative vector

φt whose size d = 512:

φt= fLST M(t) ∈ Rd (2)

We use a standard Long Short Term Memory (LSTM) [10],

followed by a max-pooling and a linear transformation as the

text encoder φt is obtained by passing each word of t into

the text encoder and taking the output from the last timestep

C Combining Function

To obtain a composite representation of image and text,

we transform and preserve image features according to text

feedback semantics Inspired by Transformer in Multimodal

Learning [12], we create a composite Transformer using

multi-level features of CNN

1) Image-Text Representation: As the information flows

through visual and linguistic domains, input image features

from multiple layers of CNN φM, φH and text features φtare

fused to obtain the image-text representation:

φ = [φM, φH, φt] (3)

To be more specific, we reshape φM to Rn×C(n = w × h),

φH to Rn0×C(n0= w0× h0) and concatenate all of them This

has the similar spirit as Relation Network [22], the relation

between input image and text is performed in φ As for features

from database images, φ has contributions only from φM and

φH:

2) Image-Text Self-attention: To figure out the latent

con-nections between image regions needed for learning the

preservationand transformation, we pass the image-text

repre-sentation φ through a multi-head Transformer The core idea is

to capture important vision and linguistic information through

self-attention learning First, we project φ into the latent space

as query, key, value (i.e Q, K, V ):

Q = φWQ, K = φWK, V = φWV, (5) where WQ, WK, and WV are 1 × 1 convolutions The self-attention is followed by fully-connected layers as in the Transformer encoder [24] Here self-attention refers to the equation:

Attn(Q, K, V ) = f (QKT)V, (6) where f is the softmax function as in [24] Basically, this self-attention exploits interactions between components formed in the image-text representation For each one, it generates an at-tention mask to highlight the spatial information that is needed for learning the feature transformation and preservation, and visual matching

3) Embeddings: Due to the absence of φtin image features from the database, to be able to match input image and target features, they must be embedded in the same space We first average pool within each feature type Specifically, we do it separately on each φM, φH and φtfor image-text features and

φM and φH for database features to get two representations with shape 512×3 and 512×2 respectively They are averaged and then normalized to become two vectors with 512 elements each With the final representations, we multiply each vector

by a learned scale weight initialized as 4

In the training process, the representation φxt and φy (Fig 2d) are passed to the loss function described in the next section In the testing process, images from database whose representation is in the top k with highest cosine similarity with the representation φxt are predicted as results

D Loss Function Our training target is to bring the representation φxt closer

to the target representation φy while pushing it far away from other representations We adopt the classification-based loss from TIRG [26] for our training Specially, we train a batch

of B queries, in query ith, we have the representation φx i t i

needed to get closer to its target representation φy i The other representations needed to push far away φy j, where j is not i,

Trang 4

does not work well for the R@k metric.

IV EXPERIMENTS

A Implementation details

We conduct all the experiments in PyTorch With the image

encoder, we use ResNet-50 [9] (output feature size is 512)

pre-trained on ImageNet With the text encoder, we use LSTM [10]

(hidden size is 512) with random initial parameters Our model

is trained using the initial learning rate of 0.001 for parameters

of the image encoder and 0.01 for the remaining parameters

The learning rate decreases by 10 for every 50K iterations and

the training ends after 150K iterations for FasionIQ and 60K

iterations for CSS Our model has 2 attention blocks with 512

units for each Q, K, and V , 8 attention heads, and 256 units

for the Fully Connected Layer of attention blocks We use a

batch size of 32 which is the same as previous papers

B Results

1) CSS [26]: (which is short for Color, Shape, and Size)

is an attribute-based retrieval dataset including 32K queries

(16K for train set and 16K for test set) Text feedback, e.g

“make yellow sphere small”, modify synthesized images of

3-by-3 grid scene Although CSS looks relatively simple, we

can carry out carefully control experiments on it

Unlike most other datasets, trained models on CSS are likely

to be overfitted by large configurations Therefore, we limit our

model by using only features from the High Layer of

ResNet-50 and one attention block In addition, replacing softmax with

identity function in attention block improves results in R@1

Besides, using sinusoidal encodings as in [24] not only does

not improve the performance when using softmax but also

impairs the performance when using the identity function

We compared with R@1 performances reported in [11, 26]

about them and other recent methods [19, 20, 22, 25] We also

used CSS 3D images provided by [26] for our experiments

and ITMA outperforms them for the retrieval task shown in

Table I

2) FashionIQ [28]: is a fashion retrieval dataset with

natural language text feedback It consists of 77648 images

collected from the e-commerce site Amazon.com of 3

cate-gories: Dresses, Toptees, and Shirts Among 46,609 training

images, there are 18,000 pairs of query-target images along

with 2 text feedback sentences in natural language to describe

one to two modified properties e.g ”longer more dressy”

We used R@50 and R@100 results and the same FasionIQ

dataset from [2] As FashionIQ is a relatively new dataset,

Locally Bounded Features [11] 79.2 ± 1.2

there has not been much experiment on it and we outperform other competitors shown in Table II

Fig 3 presents our qualitative results on FashionIQ Al-though there is much semantics hidden behind the natural language text feedback, ITMA is still able to capture almost every aspect of the text including contents relating to fashion e.g color, gloss, and printing, etc We also found that our model is able to understand not only global descriptions such

as the overall colors and patterns on the outfit but also local details like a logo in a specific location

TABLE II

COMBINING FUNCTION PERFORMING BOTH FEATURE PRESERVATION AND TRANSFORMATION HELPS THE MODEL GENERALIZE BETTER THAN THE

C Ablation studies

We first experiment with the influence of Middle Layer (14 × 14) and High Layer (7 × 7) in ResNet-50 on our model Table III shows that using both of them substantially improves results The result in Table IV shows the sensitivity of our model to the various number of units in the Fully Connected Layer behind the attention blocks Table V shows the effect but relatively small of using the different number of attention blocks in the architecture Overall, we use a combination that achieves the best results for our architecture including 2 different layers of ResNet-50, 256 units in the Fully Connected Layer of attention blocks, and 2 attention blocks

D Visualization Attention maps in Fig 4 emphasize image regions accord-ing to bold words in text feedback that explains the behaviors

of attention in retrieval task With additional changes like

“longer dress” or “longer sleeves”, the model put positive

Trang 5

Fig 3 Top 10 results of ITMA for FashionIQ validation set queries Images with green outline are ”correct” images.

TABLE III

7x7 14x14 Dresses R10 (R50) Top & Tees R10 (R50) Shirts R10 (R50) (R@10 + R@50)/2

X - 22.9 ± 1.1 (47.7 ± 1.0) 26.4 ± 0.6 (52.4 ± 0.5) 19.5 ± 0.9 (42.0 ± 0.6) 35.1 ± 0.6

X X 23.8 ± 0.6 (48.6 ± 1.0) 27.9 ± 0.8 (53.6 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 36.6 ± 0.4

TABLE IV

BLOCKS W E USE 2 ATTENTION BLOCKS AND IMAGE FEATURES FROM H IGH L AYER (7 X 7) AND M IDDLE L AYER (14 X 14).

Width Dresses R10 (R50) Top & Tees R10 (R50) Shirts R10 (R50) (R10 + R50)/2

64 23.9 ± 0.8 (48.6 ± 0.8) 27.7 ± 1.0 (53.7 ± 0.8) 21.1 ± 0.5 (43.3 ± 0.6) 36.4 ± 0.3

128 24.7 ± 0.6 (49.0 ± 0.7) 27.8 ± 0.6 (54.1 ± 0.3) 21.1 ± 0.4 (43.9 ± 0.9) 36.8 ± 0.2

256 23.8 ± 0.6 (48.6 ± 1.0) 27.9 ± 0.8 (53.6 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 36.6 ± 0.4

512 24.2 ± 0.4 (48.3 ± 0.4) 27.7 ± 0.8 (53.8 ± 1.1) 21.2 ± 1.1 (43.2 ± 0.3) 36.4 ± 0.5

TABLE V

C ONNECTED L AYERS AND IMAGE FEATURES FROM H IGH L AYER (7 X 7) AND M IDDLE L AYER (14 X 14).

# Attention Blocks Dresses R10 (R50) Top & Tees R10 (R50) Shirts R10 (R50) (R10 + R50)/2

1 24.2 ± 1.2 (48.3 ± 1.3) 27.9 ± 0.8 (54.0 ± 0.9) 20.6 ± 0.7 (43.3 ± 0.6) 36.4 ± 0.5

2 23.8 ± 0.6 (48.6 ± 1.0) 27.9 ± 0.8 (53.6 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 36.6 ± 0.4

3 24.1 ± 0.3 (48.3 ± 0.4) 27.3 ± 0.7 (53.6 ± 0.9) 20.6 ± 0.2 (43.5 ± 0.7) 36.2 ± 0.4

4 23.7 ± 0.8 (47.9 ± 1.0) 27.1 ± 0.8 (52.9 ± 1.0) 20.0 ± 0.5 (42.6 ± 0.7) 35.7 ± 0.6

weights to the corresponding areas in the items (red areas)

On the contrary, the model put negative weights to the

cor-responding areas in the items (purple areas) for diminished

changes such as “shorter dress”

The behavior of a word ignoring what it makes reference

to seems to contradict with attention mechanism from Image

Captioning [31] which shows that the attention will use words

to attend to relevant regions of the image Actually, our text

feedback describes modifications to the image and the model

tries to learn a final representation which is close to the

target representation Therefore, a word avoids referring to

its referenced items because that representation tends not to

support any information about the target

Fig 4 Attention visualization on FashionIQ We use attention maps from

2 among 8 attention heads in a block of our best model and obtain this visualization by averaging over the first 200 images from the FashionIQ validation set.

Trang 6

includes categories, attributes, colors, textures, shape, etc to

assist text feedback and our training model We will also

develop a smarter system to recommend the most suitable

clothes for the customer

ACKNOWLEDGMENT

We would like to thank Hong-Huan Do, Vinh-Loi Ly, and

Sy-Tuyen Ho for their helpful discussions

REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio Neural

machine translation by jointly learning to align and translate arXiv

preprint arXiv:1409.0473, 2014.

[2] Yanbei Chen, Shaogang Gong, and Loris Bazzani Image search with

text feedback by visiolinguistic attention learning In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 3001–3011, 2020.

[3] Maurizio Corbetta and Gordon L Shulman Control of goal-directed and

stimulus-driven attention in the brain Nature Reviews Neuroscience,

3(3):201–215, 2002.

[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.

Bert: Pre-training of deep bidirectional transformers for language

un-derstanding arXiv preprint arXiv:1810.04805, 2018.

[5] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey

Dean, Marc’Aurelio Ranzato, and Tomas Mikolov Devise: A deep

visual-semantic embedding model 2013.

[6] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi,

Xiaogang Wang, and Hongsheng Li Dynamic fusion with

intra-and inter-modality attention flow for visual question answering In

Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 6639–6648, 2019.

[7] Albert Gordo, Jon Almaz´an, Jerome Revaud, and Diane Larlus Deep

image retrieval: Learning global representations for image search In

European Conference on Computer Vision, pages 241–257 Springer,

2016.

[8] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong

Zhu, Yuan Li, Yang Zhao, and Larry S Davis Automatic spatially-aware

fashion concept discovery In Proceedings of the IEEE International

Conference on Computer Vision, pages 1463–1471, 2017.

[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Deep

residual learning for image recognition In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 770–

778, 2016.

[10] Sepp Hochreiter and J¨urgen Schmidhuber Long short-term memory.

Neural Computation, 9(8):1735–1780, 1997.

[11] Mehrdad Hosseinzadeh and Yang Wang Composed query image

re-trieval using locally bounded features In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pages 3596–

3605, 2020.

[12] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei Attention

on attention for image captioning In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages 4634–4643, 2019.

[13] Adriana Kovashka, Devi Parikh, and Kristen Grauman Whittlesearch:

Image search with relative attribute feedback In 2012 IEEE Conference

on Computer Vision and Pattern Recognition, pages 2973–2980 IEEE,

2012.

[14] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee Vilbert:

Pretrain-ing task-agnostic visiolPretrain-inguistic representations for vision-and-language

tasks arXiv preprint arXiv:1908.02265, 2019.

ontology and deep neuron network In Asian Conference on Intelligent Information and Database Systems, pages 539–549 Springer, 2018 [19] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han Image question answering using convolutional neural network with dynamic parameter prediction In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016.

[20] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville Film: Visual reasoning with a general conditioning layer In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Faster r-cnn: Towards real-time object detection with region proposal networks Advances in Neural Information Processing Systems, 28:91–99, 2015 [22] Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap A sim-ple neural network module for relational reasoning arXiv preprint arXiv:1706.01427, 2017.

[23] Hao Tan and Mohit Bansal Lxmert: Learning cross-modality encoder representations from transformers arXiv preprint arXiv:1908.07490, 2019.

[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin Attention

is all you need In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.

[25] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan Show and tell: A neural image caption generator In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.

[26] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays Composing text and image for image retrieval-an empirical odyssey In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6439–6448, 2019.

[27] Kun Wei, Muli Yang, Hao Wang, Cheng Deng, and Xianglong Liu Adversarial fine-grained composition learning for unseen attribute-object recognition In Proceedings of the IEEE/CVF International Conference

on Computer Vision, pages 3741–3749, 2019.

[28] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris Fashion iq: A new dataset towards retrieving images by natural language feedback In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11307–11317, 2021.

[29] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Mur-phy Rethinking spatiotemporal feature learning for video understanding arXiv preprint arXiv:1712.04851, 1(2):5, 2017.

[30] Huijuan Xu and Kate Saenko Ask, attend and answer: Exploring question-guided spatial attention for visual question answering In European Conference on Computer Vision, pages 451–466 Springer, 2016.

[31] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio Show, attend and tell: Neural image caption generation with visual attention In Inter-national Conference on Machine Learning, pages 2048–2057 PMLR, 2015.

[32] Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy Sketch me that shoe In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 799–807, 2016.

[33] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian Deep modular co-attention networks for visual question answering In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6281–6290, 2019.

Ngày đăng: 18/02/2023, 08:02

TỪ KHÓA LIÊN QUAN

w