Index Terms—remote-sensing, image retrieval, deep metric learning, knowledge distillation I.. This shows that the ability to distinguish between two classes that share the same character
Trang 1A Lightweight Model for Remote Sensing
Image Retrieval with Knowledge Distillation and
Mining Interclass Characteristics
Khanh-An C Quan∗,Vinh-Tiep Nguyen†, Minh-Triet Tran‡
∗† University of Information Technology, Ho Chi Minh City, Vietnam
∗‡ John von Neumann Institute, Ho Chi Minh City, Vietnam
∗‡ University of Science, Ho Chi Minh City, Vietnam
∗†‡ Vietnam National University, Ho Chi Minh City, Vietnam Email: anqck@uit.edu.vn, tiepnv@uit.edu.vn, tmtriet@fit.hcmus.edu.vn
Abstract—There are more and more practical applications of
remote sensing image retrieval in a wide variety of areas, such
as land-cover analysis, ecosystem monitoring, or agriculture
It is essential to have a solution for this problem with both
high accuracy and efficiency, e.g small-sized models and low
computational cost This motivates us to propose a lightweight
model for remote sensing image retrieval We first employ
interclass characteristic mining to train a cumbersome and robust
model, aiming to boost the quality of retrieval results Then,
from the complex model, we apply the knowledge distillation to
reduce significantly the neural network’s size Our experiments
conducted on the UC Merced Land Use dataset demonstrate
the advantage of our method Our lightweight model achieves
the mAP of 0.9680 with only 3.8M parameters This model has
a higher mAP and lower number of parameters than EDML
method, proposed by Cao et al
Index Terms—remote-sensing, image retrieval, deep metric
learning, knowledge distillation
I INTRODUCTION
High-resolution remote-sensing photos have been widely
available due to the advancement of technologies and remote
sensors This has opened up new opportunities for exploiting
data in a range of essential applications, such as land-cover
analysis [1], ecosystem monitoring [2], agriculture [3] In
fact, visual interpretation of remote-sensing scenes is still a
challenging task because researchers need to deal with high
intra-class and low inter-class variability [4]
Despite deep neural networks’ excellent performance,
the successful management of an extensive remote-sensing
database is complicated by numerous issues caused by
tem-poral differences, viewpoints, high resolution, and different
contents [5] The ability to retrieve vast amounts of
remote-sensing images is a crucial first step toward adequately
manag-ing enormous volumes of remote-sensmanag-ing data [6] Deep metric
learning methods, in particular, have demonstrated remarkable
success in characterizing complex remote-sensing data [7]
It should be noticed that remote-sensing data has some
common characteristics shared between classes With a triplet
deep metric learning network, although achieving relatively
high results, when visualizing some failure cases, we can see
Fig 1 Top 5 nearest neighbors of query image in triplet deep metric learning network In the first query We can see that there are quite similar characteristics between the returned results although some results are the different classes with the query image (green left border if the resulting image same class with the query image; otherwise, red left border)
that there are quite similar characteristics between the returned results with the query, as shown in Figure 1 This shows that the ability to distinguish between two classes that share the same characteristics is not well-solved with the triplet deep metric learning network for remote-sensing images Thus, it
is necessary to improve the capability to discriminate classes for remote-sensing images Furthermore, it is crucial to create
a lightweight network that can achieve both high accuracy for image retrieval and low computation cost, in terms of reducing the number of parameters in the learned model
In this paper, our objective is to propose a lightweight retrieval solution for remote sensing images with two criteria: achieving high mAP while using only a low number of parameters, compared to other existing methods We inherit the EDML method, proposed by Cao et al [8] and propose enhancement for this method Because we have two criteria for
Trang 2our work, our approach consists of two stages First, we learn
the shared features between classes that improve the
discrimi-native features for the remote-sensing image retrieval problem,
adopted the idea from Roth et al [9] Then we reduce the
model complexity by applying knowledge distillation [10]
We conduct extensive experiments on the popular dataset in
remote-sensing image retrieval (UCM dataset [11]) First, by
mining the shared features between classes adopted from [9],
the retrieval performance with ResNet-101 and Margin loss
[12] enhance from 0.9750 to 0.9762 in mAP These models
have higher mAP results than the original EDML method [8]
with 0.9663 Then, we reduce the complexity from the robust
model ResNet-101 with 43.3M parameters by transferring the
learned knowledge to the compact MobileNet v2 model with
only 3.8M It is worth mentioning that, although we can
signification reduce the number of parameters from 43.3M to
3.8M (more than 11 times), our model slightly decreases in
retrieval performance (mAP reduces from 0.9762 to 0.9680)
The content of our paper is as follows In Section II, we
briefly present about the remote-sensing image retrieval, metric
learning, and knowledge distillation Our proposed method
is presented in Section III We present experimental results
and ablation studies on our proposed method in Sections IV
and V, respectively Finally, conclusions and future work are
discussed in Section VI
II RELATEDWORKS
1) Remote-sensing image retrieval: The main objective of
content-based image retrieval is to find the powerful
discrim-inative features from images Previously, the remote-sensing
image retrieval method relied on handcrafted features that
necessitate specialist expertise and take time Global features
such as color, texture, and shape features, as well as local
features such as bag of visual words (BoVW) [11], vector
of locally aggregated descriptors (VLAD) [13], and Fisher
vector (FV) [14], are widely used as image representations
in remote-sensing image retrieval works The development
of deep learning has significantly advanced content-based
image retrieval Based on the learning capacity of
Convolu-tional Neural Network (CNNs), semantic and robust feature
representation obtained shown better performance in
remote-sensing image retrieval over traditional handcrafted features
[15, 16] The development of large-scale remote-sensing image
classification and retrieval datasets, such as UCM [11], AID
[17], and PatternNet [16], has also boosted the development
of content-based remote-sensing image retrieval
2) Metric learning: Deep metric learning is technique that
combination of deep learning and metric learning Deep neural
network aim to learn mapping the input image into a features
vector in the metric space Metric loss function is one of the
most important components in deep learning metric , which
can be categorized into two types of contrastive loss and triplet
loss Given a triplet (xi, xj, xk), xj is a similar sample to the
reference xiand xk is a dissimilar sample to the xi, the triplet
loss is defined as l = max(d (xi, xj)−d (xi, xk)+α, 0) where
α is margin parameter, where d (xi, xj) is euclidean distance
Many studies showed the effectiveness of applying deep metric learning (DML) in many problems such as image retrieval, visual search, image classification, etc Recently, Roth et al and Lin et al showed that learning the shared features between classes improve the discriminate ability of the model [9, 18] In the remote-sensing field, DML also shown the effective in the remote-sensing problems like classification [19], and image retrieval [20] Cao et al [8] shown that applying DML with triplet loss enhances the result on remote-sensing retrieval tasks compared to traditional deep learning methods Cao et al [20] shown that combining DML and GAN can archive potential results on a small training dataset 3) Knowledge distillation: Knowledge distillation is a pro-cess to distill the knowledge from the cumbersome model to the lightweight model without significant performance loss The ideal of the knowledge instead of learning knowledge from data with labels like the traditional way, in the knowledge distillation, the student network (which is the lightweight model) tries to learn how to predict like the teacher network (the powerful model with heavy architecture, or ensemble
of models) Hinton et al [10] propose a distill method in which a student model trains with the objective of matching the distribution of the softmax output of the teacher model
in classification problems Tang et al [21] the efficiency of using the Mean Square Error (MSE) loss between the student’s logits and the teachers’ logits for the knowledge distillation Prior research has demonstrated that knowledge distillation is effective for semi-supervised learning [22], domain adaptation [23] and many other applications
III APPROACH FORREMOTESENSINGIMAGERETRIEVAL
The overview of our method is shown in Figure 2 Our approach contains two stages: first, we train a robust teacher model that has good discriminant ability on remote-sensing retrieval problems, then we distill the knowledge of the teacher network, which is a cumbersome network, to the compact student network to reduce the complexity of the final model
As a result, we will have a compact network that has the ability
to predict good discriminative features for the remote-sensing image retrieval problem
1) Stage 1: Training a robust teacher network: In the first stages, our primary goal is to train a robust network capable
of extracting highly discriminant features We follow the deep metric learning to achieve this goal
For the first stage, we propose to replace the deep metric learning used in [8] with the mining interclass characteris-tic (MIC) method [9] The overview of this stage pipeline
is shown at Figure 1 (left) We evaluate the advantage of this enhancement with different backbones in Table I In this approach, there are three main components: a features extractor, encoder Eα and encoder Eβ For the image x ∈
RHeight×Width×3, the features extractor extract the image rep-resentative f (x) ∈ Rd Then, the encoder class-discriminative encoder Eα and the intra-class Eβ learn from the shared features f (x) with different purpose These three components trained jointly by standard back-propagation algorithm
2021 8th NAFOSTED Conference on Information and Computer Science (NICS)
218
Trang 3Features Extractor F C Metric Learning
Loss
Backpropagation v
Training set
Image Database
Metric Learning loss
Backpropagation
F C
Surrogate task
Metric Learning loss Mutual information loss +
F C
E α
E β
Features
Extractor
v
Training set
v
Training set
Features Extractor
Teacher
F C
Student
F C
Distillation loss
Backpropagation
Our approach
Test set
Image Database Query Image
Features Extractor
Query result
KNN search
Fig 2 Overview of our approach compare to conventional deep metric learning.
The class-discriminative encoder Eα aims to learn how
to distinguish objects between different classes This refers
to a fully connected layer as a classifier Eα representative
satisfy properties of the metric learning through the metric
loss function lα The Eαcan be trained on the provided ground
truth labels of the training dataset The intra-class encoder Eβ
aim to learn the shared characteristic between classes Due
to reasons of same class usually share many common features
like color, context, shape To remove the characteristics shared
within classes, normalization guided by ground truth classes
applied For each class y in the training set, this approach
compute the mean µy and standard deviation σy based on
the features f (xi) , ∀xi : yi = y Then from the new
standardized image representation obtained Z = [z1, · · · , zN]
with zi=f (xi )−µui
σyi , where the class influence is now reduced
Afterwards, the auxiliary encoder Eβ can be trained using
the surrogate labels [c1, · · · , cN] produced by clustering the
space Z in the Surrogate task The intra-class encoder Eβalso
learned with metric loss function lβ
Many variations of metric learning loss have been proposed
recently, among them Margin loss [12] with adding and
additional margin β shown the potential result which used at
lα and lβ in our approach The Margin loss is expressed as
lmargin(xi, xj) = [α + µij(d (E (f (xi)) , E (f (xj))) − β)]+
whether the samples in the pair are similar (µij = 1) or different (µij = −1)
As the two encoders shared the same input f (x), they will learn some similar characteristics To reduce the similar char-acteristics shared between them and restrict the discriminative and shared characteristic to their own encoding space, the mutual information loss is applied:
ld= − Eαr(f (x)) ⊙ R Eβr(f (x))2
with R is function learned from two-layer fully-connected neural network that map Eα to the encoder space of Eβ ⊙
is a element-wise product (Hadamark product) and r stand for gradient reversal layer The main objective of is transfer non-discriminate characteristics to an intra-class encoder Eβ Finally, the total loss function to train all components in this method is computed by L = lα+ lβ+ γld, where γ is the weights the mutual information loss contribution in relation to the class metric loss lα and the auxiliary metric loss lβ 2) Stage 2: Distilling knowledge to the student network:
In this stage, our primary goal is to transfer the representation ability of the teacher network to the student network The overview of this stage pipeline is shown in Figure 1 (right) For the teacher network, we use the features extract and the class-discriminative encoder Eαadopt from the first stage The student network is a combination of a lightweight network (i.e., ResNet-18[24], MobileNet v2[25]) and the fully
Trang 4con-IV EXPERIMENTS
A Datasets and Evaluation Metrics
For the evaluation, we use UC Merced Land Use Dataset
(UCM) dataset [11], which is most widely used as a
bench-mark for remote-sensing image retrieval problems The UCMD
dataset contains 21 classes; each class has 100 images All
the images have the size of 256 x 256 pixels, and the spatial
resolution of each pixel is 0.3m We follow the data splitting
that yields the best performance in [26], which randomly
selects 50% images of each class for training and the rest
50% for performance evaluation
For the similarity measurement, we use Euclidean distances
to measure the similarity of features vector corresponding to
the images The Euclidean distance is one of the most effective
and widely used measurement methods in image retrieval
similarity measurement
We use mean average precision (mAP) and precision at
K (P@K) to evaluate retrieval performance, which is widely
used for evaluating the image retrieval model performance
The mAP is defined as follows:
mAP =
PQ q=1AveP(q)
The definition of AveP is:
AveP =
Pn k=1(P (k) × rel(k)) number of relevant images (2) where Q is the number of all images in the dataset, P (k) is
the precision at cut-off k, and rel(k) is a piecewise function
The precision is calculated once for every image returned and
multiplied by the precision by the coefficient rel(k) If the
current returned image is related, the rel(k) is 1; otherwise 0
B Experiments setup
We are using Google Colab Pro with NVidia Tesla P100 and
NVidia Tesla V100 for training and testing for the experiment
environment The maximum training iteration is 100 epochs
For training the teacher model, we follow the setup of MIC
mentioned in the original paper [9] Specifically, we train the
model using Adam with a learning rate of 1e−5 and decrease
the learning rate to 3−5 when the training epoch reaches 50
We set the triplet parameters following [9], initializing β = 1.2
for the Margin loss and α = 0.2 as fixed triplet margin For
γ we utilize values in range [250, 2000] During training, we
randomly crop images of size 224 × 224 after resizing them
to 256 × 256, followed by random horizontal flips
After class standardization, the clustering is performed
via standard k-means using the faiss framework [27] For
efficiency, the clustering can be computed on GPU using faiss
[27] The number of clusters is set before training to a fixed to
30 for UCM dataset [11] We update the cluster labels every
other epoch The model is robust to both parameters since
many parameters give comparable results Later in Section
5, we study the effect of cluster numbers and cluster label
update frequencies for each dataset in more detail to motivate
the chosen numbers Finally, class assignments by clustering,
Fig 3 Qualitative nearest neighbor evaluation for UCM dataset based on E α
and E β encodings and their combination The results show that E β leverages class independent information (direction, color) while E α becomes indepen-dent to those features and focuses on the class detection The combination of the two reintroduces both.)
especially in the initial training stages, become near arbitrary for samples further away from cluster centers To ensure that not reinforce such a strong initial bias, we follow the MIC method to ease the class constraint by randomly switching samples with samples from different cluster classes (with probability p ≤ 0.2)
For the result in Table II, the Eαand Eβare same dimension and varies from 128, 256, 512, 1024 The backbone architech-ture is ResNet-50 [24] model pretrained on ImageNets For comparison, the features extracted by the conventional triplet deep metric learning network are used as baseline
For knowledge distillation stages, we use the Adam opti-mizer with a learning rate of 1e-4 The maximum training iteration is 25 epochs We compare both ResNet-18 [24] and MobileNet v2 [25] as the backbone of the student network
C Result and analysis 1) Training the teacher model: The overall results on the UCM datasets with different backbones are shown in Table I For the baseline result of each backbone (denoted as DML in Table I), we use the conventional deep metric learning with the same Margin loss [12] Our enhanced method, (denoted
as +MIC), achieves the better performance than the baseline solution with conventional triplet deep learning networks
By changing the network architecture more complex, the results increase noticeably Our approach with the ResNet-101 backbone achieves the best mAP result of 0.9762
ResNet-18 DML 0.9188 0.9558 0.9535 0.8802
+MIC 0.9226 0.9589 0.9815 0.9389 ResNet-34 DML 0.9577 0.9714 0.9692 0.9214
+MIC 0.9586 0.9733 0.9716 0.9225 ResNet-50 DML 0.9712 0.9819 0.9800 0.9381
+MIC 0.9720 0.9846 0.9815 0.9389 ResNet-101 DML 0.9750 0.9811 0.9799 0.9444
+MIC 0.9762 0.9823 0.9815 0.9450 MobileNet v2 DML 0.8883 0.9440 0.9344 0.8431
+MIC 0.9186 0.9621 0.9541 0.8723
TABLE I
2021 8th NAFOSTED Conference on Information and Computer Science (NICS)
220
Trang 5Method Dim Metric
EDML (Cao et al [20]) 0.9663 0.9775 0.9757 0.9320
TABLE II
The overall results on the UCM datasets with different
embedding dimensions with ResNet-50 backbone are shown in
II For the baseline of each dimension, we use the conventional
deep metric learning with the same Margin loss [12] and
ResNet-50 as the backbone as reference Compared to deep
metric learning with triplet loss by Cao et al [20], our baseline
with an enhanced version of triplet loss (Margin loss [12])
achieves mAP of 0.9712, higher than the original method
using the same ResNet-50 architecture for feature extraction
In general, MIC-based features achieve the best performances
for each dimension compared to the conventional triplet deep
learning networks In addition, the higher dimension, the
better the image retrieval performance There is a considerable
difference in performance between EDML and MIC methods
at 256-dimension Meanwhile, at 512-dimensional, there is
only a slight difference It can also be seen that MIC at
1024-dimension has achieved the best results, outperforming others
on all the evaluation metrics
Qualitative results are shown in Figure 3, the class
en-coder Eα retrieves images sharing class-specific
character-istics, while the auxiliary encoder Eβ finds intrinsic,
class-independent object properties (e.g direction, context) The
combination retrieves images with both characteristics To
investigate in detail, qualitative results of several difficult
query cases are presented in Figure 4, which shows the top-5
retrieved images that are similar to the query images, using
features extracted by conventional triplet deep metric
learn-ing network, MIC-based, respectively MIC-based features
improves performance compare to conventional triplet deep
metric learning on the cases noticeably The results indicate
that learning the shared features between classes can enhance
remote-sensing retrieval performance
2) Distilling knowledge to the student network: In this
experiment, we distill the knowledge from the ResNet-101
backbone (our best result model at mAP in the embedding
size of 128) to the student network The overall results on the
UCM datasets with different student backbones are shown in
Table III For the baseline of each backbone, we use the result
with different backbone obtained in the first stage The student
model distilled from the ResNet-101 teacher outperformed
the baseline result The MobileNet v2 backbone achieved the
highest mAP result with 0.9680, 5.38% higher than baseline
Fig 4 Top-5 retrieval results for UCM Each figure part consists of two rows The first image in each row is the query image; the first and second rows are the extracted features from the conventional triplet deep metric learning network, MIC-based, respectively The left green border and red border indicate correct and false results, respectively.
It is worth noting that with the number of parameters less than 11.39 times, the MobileNet backbone gives results lesser than the teacher model with the ResNet-101 backbone 0.82%
V ABLATIONSTUDIES
In this section, we investigate the properties of our model and evaluate its components Specifically, we study the first stage model on the UCM dataset, including evaluation of Eα
as a function of the Eβ capacity, the influence of the number
of clusters, the influence of the cluster label update frequency
To examine the number of cluster hyper-parameters, Figure
5 compares the performances using a range of cluster numbers The chart depicts how the number of clusters affects the final results, implying that the quality of the latent structure recovered by the auxiliary encoder Eβ is critical for improved
Backbone Method Number Metric
of Params mAP P@5 P@10 P@50 ResNet-101 Baseline 43.3M 0.9762 0.9823 0.9815 0.9450 ResNet-18 Baseline 11.2M 0.9226 0.9589 0.9535 0.8802
Distilled 0.9672 0.9751 0.9725 0.9340 MobileNet v2 Baseline 3.8M 0.9186 0.9621 0.9541 0.8723
Distilled 0.9680 0.9751 0.9736 0.9365 GCN [28] 0.6481 0.8712 SGCN [28] 0.6989 0.9363 MiLaN [28] 0.9040
VGG-16 EDML [8] 0.9487 0.9841 0.9687 0.9057 ResNet-50 EDML [8] 0.9663 0.9775 0.9757 0.9320
TABLE III
Trang 6Fig 5 Influence of the number of clusters on mAP A fixed cluster label
update period of 1 was used with an equal learning rate and consistent
scheduling.
Fig 6 Influence of the cluster label update frequency on mAP.
classification The MIC model performs best on the UCM
dataset when the number of clusters is set to 30
Figure 6 illustrates the update frequency for the auxiliary
labels affect the retrieval result Frequently updating the
aux-iliary label of auxaux-iliary encoder Eβ has good results
VI CONCLUSION
Content-based remote-sensing image retrieval is key to
the effective use of the ever-growing remote-sensing images
In this paper, we show that learning the shared features
between classes can enhance the retrieval performance of
remote-sensing images We evaluate the MIC method on the
UCM dataset and achieve potential results compare to the
conventional triplet deep metric learning network
We also achieve the second objective in this work to
significantly reduce the number of parameters in learned
models by applying the knowledge distillation approach After
training a robust teacher network with Resnet-101 as the
backbone, which can achieve the mAP up to 0.9762, we train
a lightweight student network with the backbone as
ResNet-18 or MobileNet v2 Our best found model is with MobileNet
v2 achieving the mAP of 0.9680 on the UCM dataset, even
higher than the mAP of EDML [8], while the model has only 3.8M parameters
However, our approach has some disadvantages, such as the number of hyperparameters and training costs The number
of hyperparameters that need to be tuned is more than that
of deep metric learning with triplet: the number of clusters, frequency of cluster updates, the weight of adversarial loss, and these parameters highly depend on data Some of this hyper-parameter has a high impact on the final result Although the number of parameters does not increase much, the cost of training time increases due to the clustering process
For future works, by evaluate other datasets (e.g., AID [17], PatternNet [16]), evaluation on different network architectures (e.g., EfficientNet [29]) and other metric loss functions and sampling methods may give a comprehensive insight
ACKNOWLEDGMENT
Khanh-An C Quan was funded by Vingroup Joint Stock Company and supported by the Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.ThS.JVN.07
REFERENCES
[1] J Kang, D Hong, J Liu, G Baier, N Yokoya, and
B Demir, “Learning convolutional sparse coding on complex domain for interferometric phase restoration,” IEEE transactions on neural networks and learning systems, vol 32, no 2, pp 826–840, 2020
[2] R Fernandez-Beltran, F Pla, and A Plaza, “Sentinel-2 and sentinel-3 intersensor vegetation estimation via con-strained topic modeling,” IEEE Geoscience and Remote Sensing Letters, vol 16, no 10, pp 1531–1535, 2019 [3] J Segarra, M L Buchaillot, J L Araus, and S C Kefau-ver, “Remote sensing for precision agriculture: Sentinel-2 improved features and applications,” Agronomy, vol 10,
no 5, p 641, 2020
[4] Z Gong, P Zhong, W Hu, and Y Hua, “Joint learning
of the center points and deep metrics for land-use clas-sification in remote sensing,” Remote Sensing, vol 11,
no 1, 2019
[5] Z Gong, P Zhong, Y Yu, and W Hu, “Diversity-promoting deep structural metric learning for remote sensing scene classification,” IEEE Transactions on Geo-science and Remote Sensing, vol 56, no 1, pp 371–390, 2017
[6] G Cheng, C Yang, X Yao, L Guo, and J Han, “When deep learning meets metric learning: Remote sensing im-age scene classification via learning discriminative cnns,” IEEE transactions on geoscience and remote sensing, vol 56, no 5, pp 2811–2821, 2018
[7] J Kang, R Fernandez-Beltran, Z Ye, X Tong,
P Ghamisi, and A Plaza, “Deep metric learning based
on scalable neighborhood components for remote sens-ing scene characterization,” IEEE Transactions on
Geo-2021 8th NAFOSTED Conference on Information and Computer Science (NICS)
222
Trang 7science and Remote Sensing, vol 58, no 12, pp 8905–
8918, 2020
[8] R Cao, Q Zhang, J Zhu, Q Li, Q Li, B Liu, and
G Qiu, “remote sensing image retrieval using a triplet
deep metric learning network,” International Journal of
Remote Sensing, vol 41, no 2, pp 740–751, 2020
[9] K Roth, B Brattoli, and B Ommer, “Mic: Mining
in-terclass characteristics for improved metric learning,” in
Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2019, pp 8000–8009
[10] G Hinton, O Vinyals, and J Dean, “Distilling the
knowledge in a neural network,” in NIPS Deep Learning
and Representation Learning Workshop, 2015
[11] Y Yang and S Newsam, “Bag-of-visual-words and
spa-tial extensions for land-use classification,” in Proceedings
of the 18th SIGSPATIAL international conference on
advances in geographic information systems, 2010, pp
270–279
[12] C.-Y Wu, R Manmatha, A J Smola, and P
Krahen-buhl, “Sampling matters in deep embedding learning,”
in Proceedings of the IEEE International Conference on
Computer Vision, 2017, pp 2840–2848
[13] S ¨Ozkan, T Ates¸, E Tola, M Soysal, and E Esen,
“Performance analysis of state-of-the-art representation
methods for geographical image retrieval and
catego-rization,” IEEE Geoscience and Remote Sensing Letters,
vol 11, no 11, pp 1996–2000, 2014
[14] P Napoletano, “Visual descriptors for content-based
re-trieval of remote-sensing images,” International journal
of remote sensing, vol 39, no 5, pp 1343–1376, 2018
[15] Y Li, Y Zhang, X Huang, and A L Yuille, “Deep
networks under scene-level supervision for multi-class
geospatial object detection from remote sensing images,”
ISPRS journal of photogrammetry and remote sensing,
vol 146, pp 182–196, 2018
[16] W Zhou, S Newsam, C Li, and Z Shao, “Patternnet: A
benchmark dataset for performance evaluation of remote
sensing image retrieval,” ISPRS journal of
photogram-metry and remote sensing, vol 145, pp 197–209, 2018
[17] G.-S Xia, J Hu, F Hu, B Shi, X Bai, Y Zhong,
L Zhang, and X Lu, “Aid: A benchmark data set for
performance evaluation of aerial scene classification,”
IEEE Transactions on Geoscience and Remote Sensing,
vol 55, no 7, pp 3965–3981, 2017
[18] X Lin, Y Duan, Q Dong, J Lu, and J Zhou, “Deep
variational metric learning,” in Proceedings of the
Euro-pean Conference on Computer Vision (ECCV), 2018, pp
689–704
[19] G Cheng, Z Li, J Han, X Yao, and L Guo,
“Explor-ing hierarchical convolutional features for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol 56, no 11, pp 6712–6722, 2018
[20] Y Cao, Y Wang, J Peng, L Zhang, L Xu, K Yan, and L Li, “DML-GANR: Deep metric learning with generative adversarial network regularization for high spatial resolution remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol 58,
no 12, pp 8888–8904, 2020
[21] R Tang, Y Lu, L Liu, L Mou, O Vechtomova, and J Lin, “Distilling task-specific knowledge from bert into simple neural networks,” arXiv preprint arXiv:1903.12136, 2019
[22] A Tarvainen and H Valpola, “Mean teachers are bet-ter role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv preprint arXiv:1703.01780, 2017
[23] Z Meng, J Li, Y Gong, and B.-H Juang, “Adversarial teacher-student learning for unsupervised domain adapta-tion,” in 2018 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) IEEE,
2018, pp 5949–5953
[24] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recog-nition, 2016, pp 770–778
[25] M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp 4510– 4520
[26] F Ye, H Xiao, X Zhao, M Dong, W Luo, and W Min,
“Remote sensing image retrieval using convolutional neural network features and weighted distance,” IEEE geoscience and remote sensing letters, vol 15, no 10,
pp 1535–1539, 2018
[27] J Johnson, M Douze, and H J´egou, “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, 2019
[28] U Chaudhuri, B Banerjee, and A Bhattacharya,
“Siamese graph convolutional network for content based remote sensing image retrieval,” Computer Vision and Image Understanding, vol 184, pp 22–30, 2019 [29] M Tan and Q Le, “Efficientnet: Rethinking model scal-ing for convolutional neural networks,” in International Conference on Machine Learning PMLR, 2019, pp 6105–6114