A lightweight model for remote sensing image retrieval with knowledge distillation and mining interclass characteristics

Index Terms—remote-sensing, image retrieval, deep metric learning, knowledge distillation I.. This shows that the ability to distinguish between two classes that share the same character

Trang 1

A Lightweight Model for Remote Sensing

Image Retrieval with Knowledge Distillation and

Mining Interclass Characteristics

Khanh-An C Quan∗,Vinh-Tiep Nguyen†, Minh-Triet Tran‡

∗† University of Information Technology, Ho Chi Minh City, Vietnam

∗‡ John von Neumann Institute, Ho Chi Minh City, Vietnam

∗‡ University of Science, Ho Chi Minh City, Vietnam

∗†‡ Vietnam National University, Ho Chi Minh City, Vietnam Email: anqck@uit.edu.vn, tiepnv@uit.edu.vn, tmtriet@fit.hcmus.edu.vn

Abstract—There are more and more practical applications of

remote sensing image retrieval in a wide variety of areas, such

as land-cover analysis, ecosystem monitoring, or agriculture

It is essential to have a solution for this problem with both

high accuracy and efficiency, e.g small-sized models and low

computational cost This motivates us to propose a lightweight

model for remote sensing image retrieval We first employ

interclass characteristic mining to train a cumbersome and robust

model, aiming to boost the quality of retrieval results Then,

from the complex model, we apply the knowledge distillation to

reduce significantly the neural network’s size Our experiments

conducted on the UC Merced Land Use dataset demonstrate

the advantage of our method Our lightweight model achieves

the mAP of 0.9680 with only 3.8M parameters This model has

a higher mAP and lower number of parameters than EDML

method, proposed by Cao et al

Index Terms—remote-sensing, image retrieval, deep metric

learning, knowledge distillation

I INTRODUCTION

High-resolution remote-sensing photos have been widely

available due to the advancement of technologies and remote

sensors This has opened up new opportunities for exploiting

data in a range of essential applications, such as land-cover

analysis [1], ecosystem monitoring [2], agriculture [3] In

fact, visual interpretation of remote-sensing scenes is still a

challenging task because researchers need to deal with high

intra-class and low inter-class variability [4]

Despite deep neural networks’ excellent performance,

the successful management of an extensive remote-sensing

database is complicated by numerous issues caused by

tem-poral differences, viewpoints, high resolution, and different

contents [5] The ability to retrieve vast amounts of

remote-sensing images is a crucial first step toward adequately

manag-ing enormous volumes of remote-sensmanag-ing data [6] Deep metric

learning methods, in particular, have demonstrated remarkable

success in characterizing complex remote-sensing data [7]

It should be noticed that remote-sensing data has some

common characteristics shared between classes With a triplet

deep metric learning network, although achieving relatively

high results, when visualizing some failure cases, we can see

Fig 1 Top 5 nearest neighbors of query image in triplet deep metric learning network In the first query We can see that there are quite similar characteristics between the returned results although some results are the different classes with the query image (green left border if the resulting image same class with the query image; otherwise, red left border)

that there are quite similar characteristics between the returned results with the query, as shown in Figure 1 This shows that the ability to distinguish between two classes that share the same characteristics is not well-solved with the triplet deep metric learning network for remote-sensing images Thus, it

is necessary to improve the capability to discriminate classes for remote-sensing images Furthermore, it is crucial to create

a lightweight network that can achieve both high accuracy for image retrieval and low computation cost, in terms of reducing the number of parameters in the learned model

In this paper, our objective is to propose a lightweight retrieval solution for remote sensing images with two criteria: achieving high mAP while using only a low number of parameters, compared to other existing methods We inherit the EDML method, proposed by Cao et al [8] and propose enhancement for this method Because we have two criteria for

Trang 2

our work, our approach consists of two stages First, we learn

the shared features between classes that improve the

discrimi-native features for the remote-sensing image retrieval problem,

adopted the idea from Roth et al [9] Then we reduce the

model complexity by applying knowledge distillation [10]

We conduct extensive experiments on the popular dataset in

remote-sensing image retrieval (UCM dataset [11]) First, by

mining the shared features between classes adopted from [9],

the retrieval performance with ResNet-101 and Margin loss

[12] enhance from 0.9750 to 0.9762 in mAP These models

have higher mAP results than the original EDML method [8]

with 0.9663 Then, we reduce the complexity from the robust

model ResNet-101 with 43.3M parameters by transferring the

learned knowledge to the compact MobileNet v2 model with

only 3.8M It is worth mentioning that, although we can

signification reduce the number of parameters from 43.3M to

3.8M (more than 11 times), our model slightly decreases in

retrieval performance (mAP reduces from 0.9762 to 0.9680)

The content of our paper is as follows In Section II, we

briefly present about the remote-sensing image retrieval, metric

learning, and knowledge distillation Our proposed method

is presented in Section III We present experimental results

and ablation studies on our proposed method in Sections IV

and V, respectively Finally, conclusions and future work are

discussed in Section VI

II RELATEDWORKS

1) Remote-sensing image retrieval: The main objective of

content-based image retrieval is to find the powerful

discrim-inative features from images Previously, the remote-sensing

image retrieval method relied on handcrafted features that

necessitate specialist expertise and take time Global features

such as color, texture, and shape features, as well as local

features such as bag of visual words (BoVW) [11], vector

of locally aggregated descriptors (VLAD) [13], and Fisher

vector (FV) [14], are widely used as image representations

in remote-sensing image retrieval works The development

of deep learning has significantly advanced content-based

image retrieval Based on the learning capacity of

Convolu-tional Neural Network (CNNs), semantic and robust feature

representation obtained shown better performance in

remote-sensing image retrieval over traditional handcrafted features

[15, 16] The development of large-scale remote-sensing image

classification and retrieval datasets, such as UCM [11], AID

[17], and PatternNet [16], has also boosted the development

of content-based remote-sensing image retrieval

2) Metric learning: Deep metric learning is technique that

combination of deep learning and metric learning Deep neural

network aim to learn mapping the input image into a features

vector in the metric space Metric loss function is one of the

most important components in deep learning metric , which

can be categorized into two types of contrastive loss and triplet

loss Given a triplet (xi, xj, xk), xj is a similar sample to the

reference xiand xk is a dissimilar sample to the xi, the triplet

loss is defined as l = max(d (xi, xj)−d (xi, xk)+α, 0) where

α is margin parameter, where d (xi, xj) is euclidean distance

Many studies showed the effectiveness of applying deep metric learning (DML) in many problems such as image retrieval, visual search, image classification, etc Recently, Roth et al and Lin et al showed that learning the shared features between classes improve the discriminate ability of the model [9, 18] In the remote-sensing field, DML also shown the effective in the remote-sensing problems like classification [19], and image retrieval [20] Cao et al [8] shown that applying DML with triplet loss enhances the result on remote-sensing retrieval tasks compared to traditional deep learning methods Cao et al [20] shown that combining DML and GAN can archive potential results on a small training dataset 3) Knowledge distillation: Knowledge distillation is a pro-cess to distill the knowledge from the cumbersome model to the lightweight model without significant performance loss The ideal of the knowledge instead of learning knowledge from data with labels like the traditional way, in the knowledge distillation, the student network (which is the lightweight model) tries to learn how to predict like the teacher network (the powerful model with heavy architecture, or ensemble

of models) Hinton et al [10] propose a distill method in which a student model trains with the objective of matching the distribution of the softmax output of the teacher model

in classification problems Tang et al [21] the efficiency of using the Mean Square Error (MSE) loss between the student’s logits and the teachers’ logits for the knowledge distillation Prior research has demonstrated that knowledge distillation is effective for semi-supervised learning [22], domain adaptation [23] and many other applications

III APPROACH FORREMOTESENSINGIMAGERETRIEVAL

The overview of our method is shown in Figure 2 Our approach contains two stages: first, we train a robust teacher model that has good discriminant ability on remote-sensing retrieval problems, then we distill the knowledge of the teacher network, which is a cumbersome network, to the compact student network to reduce the complexity of the final model

As a result, we will have a compact network that has the ability

to predict good discriminative features for the remote-sensing image retrieval problem

1) Stage 1: Training a robust teacher network: In the first stages, our primary goal is to train a robust network capable

of extracting highly discriminant features We follow the deep metric learning to achieve this goal

For the first stage, we propose to replace the deep metric learning used in [8] with the mining interclass characteris-tic (MIC) method [9] The overview of this stage pipeline

is shown at Figure 1 (left) We evaluate the advantage of this enhancement with different backbones in Table I In this approach, there are three main components: a features extractor, encoder Eα and encoder Eβ For the image x ∈

RHeight×Width×3, the features extractor extract the image rep-resentative f (x) ∈ Rd Then, the encoder class-discriminative encoder Eα and the intra-class Eβ learn from the shared features f (x) with different purpose These three components trained jointly by standard back-propagation algorithm

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

218

Trang 3

Features Extractor F C Metric Learning

Loss

Backpropagation v

Training set

Image Database

Metric Learning loss

Backpropagation

F C

Surrogate task

Metric Learning loss Mutual information loss +

F C

E α

E β

Features

Extractor

v

Training set

v

Training set

Features Extractor

Teacher

F C

Student

F C

Distillation loss

Backpropagation

Our approach

Test set

Image Database Query Image

Features Extractor

Query result

KNN search

Fig 2 Overview of our approach compare to conventional deep metric learning.

The class-discriminative encoder Eα aims to learn how

to distinguish objects between different classes This refers

to a fully connected layer as a classifier Eα representative

satisfy properties of the metric learning through the metric

loss function lα The Eαcan be trained on the provided ground

truth labels of the training dataset The intra-class encoder Eβ

aim to learn the shared characteristic between classes Due

to reasons of same class usually share many common features

like color, context, shape To remove the characteristics shared

within classes, normalization guided by ground truth classes

applied For each class y in the training set, this approach

compute the mean µy and standard deviation σy based on

the features f (xi) , ∀xi : yi = y Then from the new

standardized image representation obtained Z = [z1, · · · , zN]

with zi=f (xi )−µui

σyi , where the class influence is now reduced

Afterwards, the auxiliary encoder Eβ can be trained using

the surrogate labels [c1, · · · , cN] produced by clustering the

space Z in the Surrogate task The intra-class encoder Eβalso

learned with metric loss function lβ

Many variations of metric learning loss have been proposed

recently, among them Margin loss [12] with adding and

additional margin β shown the potential result which used at

lα and lβ in our approach The Margin loss is expressed as

lmargin(xi, xj) = [α + µij(d (E (f (xi)) , E (f (xj))) − β)]+

whether the samples in the pair are similar (µij = 1) or different (µij = −1)

As the two encoders shared the same input f (x), they will learn some similar characteristics To reduce the similar char-acteristics shared between them and restrict the discriminative and shared characteristic to their own encoding space, the mutual information loss is applied:

ld= − Eαr(f (x)) ⊙ R Eβr(f (x))2

with R is function learned from two-layer fully-connected neural network that map Eα to the encoder space of Eβ ⊙

is a element-wise product (Hadamark product) and r stand for gradient reversal layer The main objective of is transfer non-discriminate characteristics to an intra-class encoder Eβ Finally, the total loss function to train all components in this method is computed by L = lα+ lβ+ γld, where γ is the weights the mutual information loss contribution in relation to the class metric loss lα and the auxiliary metric loss lβ 2) Stage 2: Distilling knowledge to the student network:

In this stage, our primary goal is to transfer the representation ability of the teacher network to the student network The overview of this stage pipeline is shown in Figure 1 (right) For the teacher network, we use the features extract and the class-discriminative encoder Eαadopt from the first stage The student network is a combination of a lightweight network (i.e., ResNet-18[24], MobileNet v2[25]) and the fully

Trang 4

con-IV EXPERIMENTS

A Datasets and Evaluation Metrics

For the evaluation, we use UC Merced Land Use Dataset

(UCM) dataset [11], which is most widely used as a

bench-mark for remote-sensing image retrieval problems The UCMD

dataset contains 21 classes; each class has 100 images All

the images have the size of 256 x 256 pixels, and the spatial

resolution of each pixel is 0.3m We follow the data splitting

that yields the best performance in [26], which randomly

selects 50% images of each class for training and the rest

50% for performance evaluation

For the similarity measurement, we use Euclidean distances

to measure the similarity of features vector corresponding to

the images The Euclidean distance is one of the most effective

and widely used measurement methods in image retrieval

similarity measurement

We use mean average precision (mAP) and precision at

K (P@K) to evaluate retrieval performance, which is widely

used for evaluating the image retrieval model performance

The mAP is defined as follows:

mAP =

PQ q=1AveP(q)

The definition of AveP is:

AveP =

Pn k=1(P (k) × rel(k)) number of relevant images (2) where Q is the number of all images in the dataset, P (k) is

the precision at cut-off k, and rel(k) is a piecewise function

The precision is calculated once for every image returned and

multiplied by the precision by the coefficient rel(k) If the

current returned image is related, the rel(k) is 1; otherwise 0

B Experiments setup

We are using Google Colab Pro with NVidia Tesla P100 and

NVidia Tesla V100 for training and testing for the experiment

environment The maximum training iteration is 100 epochs

For training the teacher model, we follow the setup of MIC

mentioned in the original paper [9] Specifically, we train the

model using Adam with a learning rate of 1e−5 and decrease

the learning rate to 3−5 when the training epoch reaches 50

We set the triplet parameters following [9], initializing β = 1.2

for the Margin loss and α = 0.2 as fixed triplet margin For

γ we utilize values in range [250, 2000] During training, we

randomly crop images of size 224 × 224 after resizing them

to 256 × 256, followed by random horizontal flips

After class standardization, the clustering is performed

via standard k-means using the faiss framework [27] For

efficiency, the clustering can be computed on GPU using faiss

[27] The number of clusters is set before training to a fixed to

30 for UCM dataset [11] We update the cluster labels every

other epoch The model is robust to both parameters since

many parameters give comparable results Later in Section

5, we study the effect of cluster numbers and cluster label

update frequencies for each dataset in more detail to motivate

the chosen numbers Finally, class assignments by clustering,

Fig 3 Qualitative nearest neighbor evaluation for UCM dataset based on E α

and E β encodings and their combination The results show that E β leverages class independent information (direction, color) while E α becomes indepen-dent to those features and focuses on the class detection The combination of the two reintroduces both.)

especially in the initial training stages, become near arbitrary for samples further away from cluster centers To ensure that not reinforce such a strong initial bias, we follow the MIC method to ease the class constraint by randomly switching samples with samples from different cluster classes (with probability p ≤ 0.2)

For the result in Table II, the Eαand Eβare same dimension and varies from 128, 256, 512, 1024 The backbone architech-ture is ResNet-50 [24] model pretrained on ImageNets For comparison, the features extracted by the conventional triplet deep metric learning network are used as baseline

For knowledge distillation stages, we use the Adam opti-mizer with a learning rate of 1e-4 The maximum training iteration is 25 epochs We compare both ResNet-18 [24] and MobileNet v2 [25] as the backbone of the student network

C Result and analysis 1) Training the teacher model: The overall results on the UCM datasets with different backbones are shown in Table I For the baseline result of each backbone (denoted as DML in Table I), we use the conventional deep metric learning with the same Margin loss [12] Our enhanced method, (denoted

as +MIC), achieves the better performance than the baseline solution with conventional triplet deep learning networks

By changing the network architecture more complex, the results increase noticeably Our approach with the ResNet-101 backbone achieves the best mAP result of 0.9762

ResNet-18 DML 0.9188 0.9558 0.9535 0.8802

+MIC 0.9226 0.9589 0.9815 0.9389 ResNet-34 DML 0.9577 0.9714 0.9692 0.9214

+MIC 0.9586 0.9733 0.9716 0.9225 ResNet-50 DML 0.9712 0.9819 0.9800 0.9381

+MIC 0.9720 0.9846 0.9815 0.9389 ResNet-101 DML 0.9750 0.9811 0.9799 0.9444

+MIC 0.9762 0.9823 0.9815 0.9450 MobileNet v2 DML 0.8883 0.9440 0.9344 0.8431

+MIC 0.9186 0.9621 0.9541 0.8723

TABLE I

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

220

Trang 5

Method Dim Metric

EDML (Cao et al [20]) 0.9663 0.9775 0.9757 0.9320

TABLE II

The overall results on the UCM datasets with different

embedding dimensions with ResNet-50 backbone are shown in

II For the baseline of each dimension, we use the conventional

deep metric learning with the same Margin loss [12] and

ResNet-50 as the backbone as reference Compared to deep

metric learning with triplet loss by Cao et al [20], our baseline

with an enhanced version of triplet loss (Margin loss [12])

achieves mAP of 0.9712, higher than the original method

using the same ResNet-50 architecture for feature extraction

In general, MIC-based features achieve the best performances

for each dimension compared to the conventional triplet deep

learning networks In addition, the higher dimension, the

better the image retrieval performance There is a considerable

difference in performance between EDML and MIC methods

at 256-dimension Meanwhile, at 512-dimensional, there is

only a slight difference It can also be seen that MIC at

1024-dimension has achieved the best results, outperforming others

on all the evaluation metrics

Qualitative results are shown in Figure 3, the class

en-coder Eα retrieves images sharing class-specific

character-istics, while the auxiliary encoder Eβ finds intrinsic,

class-independent object properties (e.g direction, context) The

combination retrieves images with both characteristics To

investigate in detail, qualitative results of several difficult

query cases are presented in Figure 4, which shows the top-5

retrieved images that are similar to the query images, using

features extracted by conventional triplet deep metric

learn-ing network, MIC-based, respectively MIC-based features

improves performance compare to conventional triplet deep

metric learning on the cases noticeably The results indicate

that learning the shared features between classes can enhance

remote-sensing retrieval performance

2) Distilling knowledge to the student network: In this

experiment, we distill the knowledge from the ResNet-101

backbone (our best result model at mAP in the embedding

size of 128) to the student network The overall results on the

UCM datasets with different student backbones are shown in

Table III For the baseline of each backbone, we use the result

with different backbone obtained in the first stage The student

model distilled from the ResNet-101 teacher outperformed

the baseline result The MobileNet v2 backbone achieved the

highest mAP result with 0.9680, 5.38% higher than baseline

Fig 4 Top-5 retrieval results for UCM Each figure part consists of two rows The first image in each row is the query image; the first and second rows are the extracted features from the conventional triplet deep metric learning network, MIC-based, respectively The left green border and red border indicate correct and false results, respectively.

It is worth noting that with the number of parameters less than 11.39 times, the MobileNet backbone gives results lesser than the teacher model with the ResNet-101 backbone 0.82%

V ABLATIONSTUDIES

In this section, we investigate the properties of our model and evaluate its components Specifically, we study the first stage model on the UCM dataset, including evaluation of Eα

as a function of the Eβ capacity, the influence of the number

of clusters, the influence of the cluster label update frequency

To examine the number of cluster hyper-parameters, Figure

5 compares the performances using a range of cluster numbers The chart depicts how the number of clusters affects the final results, implying that the quality of the latent structure recovered by the auxiliary encoder Eβ is critical for improved

Backbone Method Number Metric

of Params mAP P@5 P@10 P@50 ResNet-101 Baseline 43.3M 0.9762 0.9823 0.9815 0.9450 ResNet-18 Baseline 11.2M 0.9226 0.9589 0.9535 0.8802

Distilled 0.9672 0.9751 0.9725 0.9340 MobileNet v2 Baseline 3.8M 0.9186 0.9621 0.9541 0.8723

Distilled 0.9680 0.9751 0.9736 0.9365 GCN [28] 0.6481 0.8712 SGCN [28] 0.6989 0.9363 MiLaN [28] 0.9040

VGG-16 EDML [8] 0.9487 0.9841 0.9687 0.9057 ResNet-50 EDML [8] 0.9663 0.9775 0.9757 0.9320

TABLE III

Trang 6

Fig 5 Influence of the number of clusters on mAP A fixed cluster label

update period of 1 was used with an equal learning rate and consistent

scheduling.

Fig 6 Influence of the cluster label update frequency on mAP.

classification The MIC model performs best on the UCM

dataset when the number of clusters is set to 30

Figure 6 illustrates the update frequency for the auxiliary

labels affect the retrieval result Frequently updating the

aux-iliary label of auxaux-iliary encoder Eβ has good results

VI CONCLUSION

Content-based remote-sensing image retrieval is key to

the effective use of the ever-growing remote-sensing images

In this paper, we show that learning the shared features

between classes can enhance the retrieval performance of

remote-sensing images We evaluate the MIC method on the

UCM dataset and achieve potential results compare to the

conventional triplet deep metric learning network

We also achieve the second objective in this work to

significantly reduce the number of parameters in learned

models by applying the knowledge distillation approach After

training a robust teacher network with Resnet-101 as the

backbone, which can achieve the mAP up to 0.9762, we train

a lightweight student network with the backbone as

ResNet-18 or MobileNet v2 Our best found model is with MobileNet

v2 achieving the mAP of 0.9680 on the UCM dataset, even

higher than the mAP of EDML [8], while the model has only 3.8M parameters

However, our approach has some disadvantages, such as the number of hyperparameters and training costs The number

of hyperparameters that need to be tuned is more than that

of deep metric learning with triplet: the number of clusters, frequency of cluster updates, the weight of adversarial loss, and these parameters highly depend on data Some of this hyper-parameter has a high impact on the final result Although the number of parameters does not increase much, the cost of training time increases due to the clustering process

For future works, by evaluate other datasets (e.g., AID [17], PatternNet [16]), evaluation on different network architectures (e.g., EfficientNet [29]) and other metric loss functions and sampling methods may give a comprehensive insight

ACKNOWLEDGMENT

Khanh-An C Quan was funded by Vingroup Joint Stock Company and supported by the Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.ThS.JVN.07

REFERENCES

[1] J Kang, D Hong, J Liu, G Baier, N Yokoya, and

B Demir, “Learning convolutional sparse coding on complex domain for interferometric phase restoration,” IEEE transactions on neural networks and learning systems, vol 32, no 2, pp 826–840, 2020

[2] R Fernandez-Beltran, F Pla, and A Plaza, “Sentinel-2 and sentinel-3 intersensor vegetation estimation via con-strained topic modeling,” IEEE Geoscience and Remote Sensing Letters, vol 16, no 10, pp 1531–1535, 2019 [3] J Segarra, M L Buchaillot, J L Araus, and S C Kefau-ver, “Remote sensing for precision agriculture: Sentinel-2 improved features and applications,” Agronomy, vol 10,

no 5, p 641, 2020

[4] Z Gong, P Zhong, W Hu, and Y Hua, “Joint learning

of the center points and deep metrics for land-use clas-sification in remote sensing,” Remote Sensing, vol 11,

no 1, 2019

[5] Z Gong, P Zhong, Y Yu, and W Hu, “Diversity-promoting deep structural metric learning for remote sensing scene classification,” IEEE Transactions on Geo-science and Remote Sensing, vol 56, no 1, pp 371–390, 2017

[6] G Cheng, C Yang, X Yao, L Guo, and J Han, “When deep learning meets metric learning: Remote sensing im-age scene classification via learning discriminative cnns,” IEEE transactions on geoscience and remote sensing, vol 56, no 5, pp 2811–2821, 2018

[7] J Kang, R Fernandez-Beltran, Z Ye, X Tong,

P Ghamisi, and A Plaza, “Deep metric learning based

on scalable neighborhood components for remote sens-ing scene characterization,” IEEE Transactions on

Geo-2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

222

Trang 7

science and Remote Sensing, vol 58, no 12, pp 8905–

8918, 2020

[8] R Cao, Q Zhang, J Zhu, Q Li, Q Li, B Liu, and

G Qiu, “remote sensing image retrieval using a triplet

deep metric learning network,” International Journal of

Remote Sensing, vol 41, no 2, pp 740–751, 2020

[9] K Roth, B Brattoli, and B Ommer, “Mic: Mining

in-terclass characteristics for improved metric learning,” in

Proceedings of the IEEE/CVF International Conference

on Computer Vision, 2019, pp 8000–8009

[10] G Hinton, O Vinyals, and J Dean, “Distilling the

knowledge in a neural network,” in NIPS Deep Learning

and Representation Learning Workshop, 2015

[11] Y Yang and S Newsam, “Bag-of-visual-words and

spa-tial extensions for land-use classification,” in Proceedings

of the 18th SIGSPATIAL international conference on

advances in geographic information systems, 2010, pp

270–279

[12] C.-Y Wu, R Manmatha, A J Smola, and P

Krahen-buhl, “Sampling matters in deep embedding learning,”

in Proceedings of the IEEE International Conference on

Computer Vision, 2017, pp 2840–2848

[13] S ¨Ozkan, T Ates¸, E Tola, M Soysal, and E Esen,

“Performance analysis of state-of-the-art representation

methods for geographical image retrieval and

catego-rization,” IEEE Geoscience and Remote Sensing Letters,

vol 11, no 11, pp 1996–2000, 2014

[14] P Napoletano, “Visual descriptors for content-based

re-trieval of remote-sensing images,” International journal

of remote sensing, vol 39, no 5, pp 1343–1376, 2018

[15] Y Li, Y Zhang, X Huang, and A L Yuille, “Deep

networks under scene-level supervision for multi-class

geospatial object detection from remote sensing images,”

ISPRS journal of photogrammetry and remote sensing,

vol 146, pp 182–196, 2018

[16] W Zhou, S Newsam, C Li, and Z Shao, “Patternnet: A

benchmark dataset for performance evaluation of remote

sensing image retrieval,” ISPRS journal of

photogram-metry and remote sensing, vol 145, pp 197–209, 2018

[17] G.-S Xia, J Hu, F Hu, B Shi, X Bai, Y Zhong,

L Zhang, and X Lu, “Aid: A benchmark data set for

performance evaluation of aerial scene classification,”

IEEE Transactions on Geoscience and Remote Sensing,

vol 55, no 7, pp 3965–3981, 2017

[18] X Lin, Y Duan, Q Dong, J Lu, and J Zhou, “Deep

variational metric learning,” in Proceedings of the

Euro-pean Conference on Computer Vision (ECCV), 2018, pp

689–704

[19] G Cheng, Z Li, J Han, X Yao, and L Guo,

“Explor-ing hierarchical convolutional features for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol 56, no 11, pp 6712–6722, 2018

[20] Y Cao, Y Wang, J Peng, L Zhang, L Xu, K Yan, and L Li, “DML-GANR: Deep metric learning with generative adversarial network regularization for high spatial resolution remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol 58,

no 12, pp 8888–8904, 2020

[21] R Tang, Y Lu, L Liu, L Mou, O Vechtomova, and J Lin, “Distilling task-specific knowledge from bert into simple neural networks,” arXiv preprint arXiv:1903.12136, 2019

[22] A Tarvainen and H Valpola, “Mean teachers are bet-ter role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv preprint arXiv:1703.01780, 2017

[23] Z Meng, J Li, Y Gong, and B.-H Juang, “Adversarial teacher-student learning for unsupervised domain adapta-tion,” in 2018 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) IEEE,

2018, pp 5949–5953

[24] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recog-nition, 2016, pp 770–778

[25] M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp 4510– 4520

[26] F Ye, H Xiao, X Zhao, M Dong, W Luo, and W Min,

“Remote sensing image retrieval using convolutional neural network features and weighted distance,” IEEE geoscience and remote sensing letters, vol 15, no 10,

pp 1535–1539, 2018

[27] J Johnson, M Douze, and H J´egou, “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, 2019

[28] U Chaudhuri, B Banerjee, and A Bhattacharya,

“Siamese graph convolutional network for content based remote sensing image retrieval,” Computer Vision and Image Understanding, vol 184, pp 22–30, 2019 [29] M Tan and Q Le, “Efficientnet: Rethinking model scal-ing for convolutional neural networks,” in International Conference on Machine Learning PMLR, 2019, pp 6105–6114

Tiêu đề	A Lightweight Model for Remote Sensing Image Retrieval with Knowledge Distillation and Mining Interclass Characteristics
Tác giả	Khanh-An C. Quan, Vinh-Tiep Nguyen, Minh-Triet Tran
Trường học	University of Information Technology, Ho Chi Minh City
Chuyên ngành	Remote Sensing, Image Retrieval, Deep Metric Learning, Knowledge Distillation
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	7
Dung lượng	6,12 MB