U-Net Semantic Segmentation of Digital MapsUsing Google Satellite Images Loi Nguyen-Khanh1,2, Vy Nguyen-Ngoc-Yen1,2, Hung Dinh-Quoc1,2 1Ho Chi Minh City University of Technology HCMUT, 2
Trang 1U-Net Semantic Segmentation of Digital Maps
Using Google Satellite Images
Loi Nguyen-Khanh1,2, Vy Nguyen-Ngoc-Yen1,2, Hung Dinh-Quoc1,2
1Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam
2Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
(nkloi, vy.nguyen2711, hung.dinh)@hcmut.edu.vn
Abstract—Satellite images contain an enormous data
ware-house and give us details to the general perspective of what
is happening on the earth’s surface These images are essential
for agricultural development research, urban planning, surveying
and, especially for evaluating the location design of broadcast
stations, the input of coverage simulation and signal quality in
telecommunications The analysis of large amounts of complex
satellite imagery is challenging while the evolving semantic
segmentation approaches based on convolution neural network
(CNN) can assist in analyzing this amount of data In this
paper, we introduce an approach for constructing digital maps
with dataset provided by Google We utilize the efficient U-Net
architecture, which is an efficient combination of EfficientNet,
namely EfficientNet-B0 as the encoder to extract the geographic
features with U-Net as decoder to reconstruct the detailed
features map We evaluate our models using Google satellite
images which demonstrate the efficiency in terms of Dice Loss
and Categorical Cross-Entropy
Index Terms—Satellite Images, Digital Maps, Image
Segmen-tation, Semantic SegmenSegmen-tation, EfficientNet, U-Net
I INTRODUCTION
Digital maps store information on different types of terrain
and are used to analyze map elements about road detection,
forests, buildings, forestry research, urban planning [1]
The authors in [2] performed satellite image segmentation
and classification using convolution neural network (CNN)
with five labels: trees, vacant land, roads, buildings, water
They proposed pixel-by-pixel CNN methods are single-CNN
and multiple CNNs In addition, the study incorporated an
averaged classification method to improve accuracy With
dataset taken from DeepGlobe data, reference [3] proposed
to use stacked U-Nets for line detection, using a hybrid
loss function to solve the problem of unbalanced layers
of training data The other approach in [4] proposed the
attention dilation linkNet (AD-LinkNet) neural network
using an encoder-decoder structure, parallel-serial conjugate
convolution, attention channel-wise, and the encoder has prior
training to semantic segmentation Alternatively, Lim et al
[5] proposed CNN sets with encoding-decoding architecture:
single short network (SSN), single long network (SLN),
double long network (DLN) differentiate between ground
and background, implementing compare topographic changes
from two images Kuo et al [6] proposed a deep aggregation
network used to solve the task of classifying soil layers,
which extracts and combines multi-layered features in the
image partitioning process, introducing soft- graph-based semantic improve segmentation performance
Although there are many approaches to satellite image analysis and the results were very positive, in general, most
of the subjects identified are only object (single-class) Reference [2] performed five classes identification but did not combine results to produce a complete digital map and the classifications are quite simple Moreover, the use of the data available, almost all which are not been updated will reduce the significance of the analysis results for practical applications Therefore, the concern is to find a source of data with high image quality that is regularly updated, along with processing methods and aggregating analysis resulting
in a digital map with high accuracy that meets the needs of the applications
Most recent studies use datasets provided by the Deep Globe [1], [4], [6], [7], [12] for surface segmentation tasks: road detection, building, ship, grass, water, and detecting topographic changes between different times Audebert et al [8] exploited data from the Open Street Map and proved that this data source can effectively integrate into deep learning models Pascal Kaiser et al [9] used Open Street Map to refer
to the semantic segmentation of images to classify buildings, and roads using CNN architectures With Open Street Map, although being supporting with free data, in general, the updating is limited due to contributions by the users [10] With the Deep Globe, they will provide an available dataset
of satellite imagery, which can be used to study specific tasks such as road extraction, building detection, land cover classification [1], [3], [4], [6], [7], which is not frequently updated These data sources are almost not suitable for building a digital map
U-Net architectures are normally considered as one of the most powerful tools for segmentation images [14] To further improve segmentation accuracy, Weng et al [15] proposed a variant U-Net network: NAS-UNet is stacked by some downsampling and upsampling on a U-like backbone network There are many other approaches inspired by the U-Net network architecture The authors in [16] designed the Res-UNet network model based on ResNet’s ability to
Trang 2(a) Fast-growing areas Imagery Date: 6/28/2021 (b) Slow-growing areas Imagery Date: 6/4/2020.
Fig 1: Satellite Image is provided by Google with regular updates
process complex images The authors in [17] build U-Net
network with VGG11 encoder to segment images Reference
[14] compared the U-Net architecture with encoders:
VGG11, VGG13, VGG16, VGG19, Resnet18, Densenet121,
Inceptionv3, and Incetionresnetv2; Mingxing Tan & Quoc
V Le [18] studied a scale-up model from the ConvNet
baseline called EfficienNet and determined that balancing
the depth, width and resolution would increase accuracy
and improve performance compared to previous ConvNets
in image classification This network model has various
versions ranging from B0-B7 with different coefficients
It demonstrated that EfficientNet B7 achieves the highest
accuracy Baheti et al [19] proposed the efficient U-Net
architecture, which combines EfficientNet with functions
such as a decoder and U-Net decoder to create a detailed
segmentation map, and EfficientNet B7 achieved the highest
accuracy in the test suite
Inspired from the above discussions, the present paper aims
at developing an efficient classification architecture to classify
the satellite images To this end, we firstly collect image
data from Google, perform image labeling, then propose an
effective approach to segment satellite images, and initially
build a digital map To obtain high accuracy for classification,
we develop an efficient model to classify satellite images
into 12 classes by invoking EfficientNet [20] and U-Net
segmentation architectures [23]
II SYSTEMDEVELOPMENT
A Data Collection and Manual Label
Due to the ever-changing nature of human activities and the
laws of nature, the satellite imagery will change constantly,
which makes no sense to use the existing ones and not
regularly updated datasets in the construction of digital maps
Meanwhile, Google has an enormous and regularly updated
set of satellite imagery orthoimagery datasets An example is
shown in Fig 1 and Fig 2 Taking advantage of these data,
the topic focuses on researching the segmentation of satellite
images from Google to build digital maps
Fig 2: Examples of tiler layouts and zoom coefficients
We download the satellite image in jpeg format from google server provided by google tiler We do manual satellite labeling with classes: street, tree, water, residential, urban, buildings, industrial and commercial, vacant land in urban, sparse forest park, grass, agricultural, sparse urban
B Description of Applied Architectures
In this section, we will summarize the encoder-decoder architecture for semantic segmentation, with EfficientNet-B0
as the encoder and U-Net as a decoder
1) Encoder-Decoder Architecture: The encoder-decoder ar-chitecture includes a CNN to extracts the features from the input image, details are modern neural networks like ResNet [16], VGG [17], But, these network models reduce the width, the height of the input image to get the final feature map It is challenging to rebuild the segmentation map to the size of the
Fig 3: Examples of tiler layouts and zoom coefficients
2021 8th NAFOSTED Conference on Information and Computer Science (NICS)
Trang 3Fig 4: Architecture efficient U-Net.
original image The decoder section contains a set of layers
that upsamples the feature map of the encoder to restore spatial
information A simple encoder-decoder network for semantic
segmentation is shown in Fig 3
2) Feature Extraction: convolutional neural networks are
evolved from available resources, then scaled to improve
model performance Depth-scaling is the most common way
to capture many complex features [20] However, arbitrarily
increasing the depth makes the training more difficult or
does not increase model performance, or even decrease [21]
Similar to width and resolution Tan et al [20] proposed a
new scaling method: uniformly proportional all dimensions
depth, width, resolution They used a Neural Architecture
Search to design a new baseline network and scale it up
to obtain a family of models, called EfficientNets, which
achieve much better accuracy and efficiency than previous
ConvNets It includes models from B0-B7, each with different
equalization ratios and number of parameters
The basic building block of EfficientNet is based on Mobile
Inverted Bottleneck Convolution (MBConv) [22] is shown in
Fig 5 Here, the architecture is divided into seven blocks
Fig 5: Architecture of EfficientNet-B0 with MBConv as Basic
building blocks
based on filter size, striding, and some channels Different Ef-ficientNet models have different numbers of MBconv blocks From EfficientNetB0- EfficientNetB7, increasing depth, width, resolution, and dimension model lead to an increase in the number of parameters used in a calculation that makes the strong model and accuracy are also gradually improved [20] However, due to limited tool support as well as a limited calculation of a large number of parameters, this takes a lot of work and time to process; our research draws our attention in the encoder test to architectures B0, EfficientNet-B1, EfficientNet-B2
3) Network Architecture: U-Net is one of the most powerful integrated network architectures for fast and precise segmentation of images, first published in 2015 for biomedical image segmentation [23] It consists of 2 encoder-decoders that make the ‘U’ shape The encoder, or contraction path,
is a typical convolutional network that has convolution, activation, and pooling layers to capture the features of the input image During the encoder process, spatial dimension information (height and width) is decreased while feature information is increased The decoder or expansion path part combines the features and spatial information through a series of convolution structures and joins the high-resolution features from the contracting path
In the original U-Net, the expansion path is almost sym-metrical with the contracting path [23] In our research, we propose to use EfficientNet as an encoder instead of a set of conventional convolution layers The decoder module is simi-lar to the original U-Net Details of the proposed architecture are illustrated in Fig 4 The input image size is 1024x1024 The detailed architecture of blocks in the encoder can be found in Fig 5 First, we bilinearly upsample the feature map of the last logits in the encoder by a factor of two, then append the feature map from the encoder with the same spatial resolution This is followed by 3 × 3 convolution layers before again upsampled by a factor of two This process is
Trang 4repeated until the segment map of the same size as the original
size of input image is recovered The proposed architecture is
asymmetric, unlike the original U-Net Here, the contracting
path is deeper than the expansion path Putting a powerful
CNN like EfficientNet as an encoder improves the overall
performance of the algorithm [19]
C Loss Functions
Loss functions play an essential role in determining model
performance and different loss functions can be used under
various circumstances [13] In this study, we select three loss
functions suitable for the model:
1) Dice Loss: is a measure of overlap between
correspond-ing pixel values of prediction and ground truth respectively,
which is widely used to assess segmentation performance [20]
The Dice Loss is defined as:
LDL(y, ˆy) = 1 − 2
Pn i=0yi· ˆyi+ 1
Pn i=1yi+Pn
i=1 ˆi+ 1. (1) Here ˆy is the predicted set of pixels, and y is the ground truth
1 is added in numerator and denominator to ensure that the
function is not undefined in edge case scenarios such as when
y = ˆy = 0 [13]
2) Categorical Cross Entropy: is a measure of the
differ-ence between two probability distributions for a given random
variable or set of events It is widely used for grading purposes,
especially pixel-level grading [13]:
LCCE(y, ˆy) = −
n
X
i=1
C
X
c=1
yic· log( ˆyic), (2) where C is the number of classes, yic is 1 if and only if
sample i belongs to class c and ˆyic is the output probability
that sample i belongs to class c
3) Average Loss: comprising of two weighted probability
distribution is given by:
L = 1
2(LDL+ LCCE). (3) III EXPERIMENTALRESULTS
We tested with the U-Net original decoder with different
backbone used for encoder such as VGG11 [17], ResNet18
[16], EfficientNet-B0, EfficientNet-B1, EfficientNet-B2 The
results are shown in Table I We use the loss functions outlined
above to evaluate the models It can be easily observed that
EfficientNet-B0 gives the best results of 1.110 Categorizcal
cross entropy loss, 0.731 Dice loss and 0.997 Average loss
At the same time, EfficientNet-B0 has several computational
parameters of only 4M much less than other models, which
makes the calculation simpler, minimizes effort and processing
time
To test the efficient U-Net B0 network model, we use 1,317
images for train and 304 images for validation The data
ano-tation tool we use is CVAT which is provided by OpenVINO
Toolkit In the training process, we set the coefficient
learning-rate = 0.0001, as shown in Fig 6 The test result is shown in
Figure 7
Fig 6: Test graph in Efficient U-Net B0 network model TABLE I: Results For Comparison Of Various Encoder Ar-chitecture With Loss Functions
U-Net with Total Categorical cross Dice Average backbone params entropy loss loss loss
IV CONCLUSION
Developing the semantic segmentation architecture to an-alyze the geographic structures in satellite imagery is very challenging, but a meaningful task in real-world applications This paper has conducted the segmentation of satellite im-ages with 12 classes In our research, we have considered a segmentation method, the efficient U-Net architecture, which makes use of the efficiency of EfficientNet as an encoder to extract the feature with U-Net as a decoder to rebuilt detailed feature maps Although there are fewer parameters than other structures, EfficientNet-B0 still gives very positive results in the result table
ACKNOWLEDGEMENTS
This research is funded by Ho Chi Minh City University of Technology - VNU-HCM under grant number
T-ÐÐT-2020-45 We acknowledge the support of time and facilities from
Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for this study
REFERENCES [1] I Demir et al., “Deepglobe 2018: A challenge to parse the earth through satellite images,” in Proc IEEE/CVF Conf Comput Vis Pattern Recognit Workshops (CVPRW), May 2018, pp 172–209.
[2] M L¨angkvist, A Kiselev, M Alirezaie, and A Loutfi, “Classification and segmentation of satellite orthoimagery using convolutional neural networks,” Remote Sensing, vol 8, no 4, p 329, Apr 2016.
2021 8th NAFOSTED Conference on Information and Computer Science (NICS)
Trang 5Fig 7: Results of semantic segmentation on Google dataset with proposed architecture First column shows the input images depicting different scenarios from unstructured environment Second and third column shows the ground truth and predicted segmentation map respectively where different colors signify different classes
[3] T Sun, Z Chen, W Yang and Y Wang, “Stacked U-Nets with
multi-output for road extraction,” 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), 2018, pp
187-1874.
[4] M Wu, C Zhang, J Liu, L Zhou and X Li, “Towards accurate high
resolution satellite image semantic segmentation,” in IEEE Access, vol.
7, pp 55609-55619, 2019.
[5] K Lim, D Jin and C Kim, “Change detection in high resolution satellite
images using an ensemble of convolutional neural networks,” 2018
Asia-Pacific Signal and Information Processing Association Annual Summit
and Conference (APSIPA ASC), 2018, pp 509-515.
[6] T Kuo, K Tseng, J Yan, Y Liu and Y F Wang, “Deep aggregation net
for land cover classification,” 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), 2018, pp
247-2474.
[7] S Aich, W van der Kamp, and I Stavness, “Semantic binary segmenta-tion using convolusegmenta-tional networks without decoders,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
[8] N Audebert, B Le Saux, and S Lefevre, “Joint learning from earth observation and OpenStreetMap data to get faster better semantic maps,”
in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
[9] P Kaiser, J D Wegner, A Lucchi, M Jaggi, T Hofmann, and K Schindler, “Learning aerial image segmentation from online maps,” IEEE Trans Geosci Remote Sens., vol 55, no 11, pp 6054–6068, 2017.
[10] J.-F Girres and G Touya, “Quality assessment of the french Open-StreetMap dataset: Quality assessment of the french OpenOpen-StreetMap dataset,” Trans GIS, vol 14, no 4, pp 435–459, 2010.
Trang 6[11] N Baghdadi, C Mallet, and M Zribi, QGIS and Generic Tools London,
England: ISTE, 2018.
[12] K Zhao, J Kang, J Jung, and G Sohn, “Building extraction from
satellite images using mask R-CNN with building boundary
regulariza-tion,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), 2018.
[13] S Jadon, “A survey of loss functions for semantic segmentation,” in 2020
IEEE Conference on Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), 2020.
[14] S W Chang and S W Liao, “KUnet: Microscopy image segmentation
with deep unet based convolutional networks,” in 2019 IEEE
Interna-tional Conference on Systems, Man and Cybernetics (SMC), 2019.
[15] Y Weng, T Zhou, Y Li, and X Qiu, “NAS-Unet: Neural Architecture
Search for Medical Image Segmentation,” IEEE Access, vol 7, pp.
44247–44257, 2019.
[16] Z Chu, T Tian, R Feng, and L Wang, “Sea-land segmentation with
res-UNet and fully connected CRF,” in IGARSS 2019 - 2019 IEEE
International Geoscience and Remote Sensing Symposium, 2019.
[17] V Iglovikov and A Shvets, “TernausNet: U-Net with VGG11 Encoder
Pre-Trained on ImageNet for Image Segmentation,” arXiv [cs.CV], 2018.
[18] M Tan and Q V Le, “EfficientNet: Rethinking model scaling for
convolutional Neural Networks,” arXiv [cs.LG], 2019.
[19] B Baheti, S Innani, S Gajre, and S Talbar, “Eff-UNet: A novel
archi-tecture for semantic segmentation in unstructured environment,” in 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), 2020.
[20] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
[21] S Zagoruyko and N Komodakis, “Wide Residual Networks,” in
Pro-cedings of the British Machine Vision Conference 2016, 2016.
[22] M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen,
“MobileNetV2: Inverted residuals and linear bottlenecks,” in 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018.
[23] O Ronneberger, P Fischer, and T Brox, “U-Net: Convolutional
Net-works for Biomedical Image Segmentation,” in Lecture Notes in
Computer Science, Cham: Springer International Publishing, 2015, pp.
234–241.
2021 8th NAFOSTED Conference on Information and Computer Science (NICS)