U net semantic segmentation of digital maps using google satellite images

U-Net Semantic Segmentation of Digital MapsUsing Google Satellite Images Loi Nguyen-Khanh1,2, Vy Nguyen-Ngoc-Yen1,2, Hung Dinh-Quoc1,2 1Ho Chi Minh City University of Technology HCMUT, 2

Trang 1

U-Net Semantic Segmentation of Digital Maps

Using Google Satellite Images

Loi Nguyen-Khanh1,2, Vy Nguyen-Ngoc-Yen1,2, Hung Dinh-Quoc1,2

1Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam

2Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam

(nkloi, vy.nguyen2711, hung.dinh)@hcmut.edu.vn

Abstract—Satellite images contain an enormous data

ware-house and give us details to the general perspective of what

is happening on the earth’s surface These images are essential

for agricultural development research, urban planning, surveying

and, especially for evaluating the location design of broadcast

stations, the input of coverage simulation and signal quality in

telecommunications The analysis of large amounts of complex

satellite imagery is challenging while the evolving semantic

segmentation approaches based on convolution neural network

(CNN) can assist in analyzing this amount of data In this

paper, we introduce an approach for constructing digital maps

with dataset provided by Google We utilize the efficient U-Net

architecture, which is an efficient combination of EfficientNet,

namely EfficientNet-B0 as the encoder to extract the geographic

features with U-Net as decoder to reconstruct the detailed

features map We evaluate our models using Google satellite

images which demonstrate the efficiency in terms of Dice Loss

and Categorical Cross-Entropy

Index Terms—Satellite Images, Digital Maps, Image

Segmen-tation, Semantic SegmenSegmen-tation, EfficientNet, U-Net

I INTRODUCTION

Digital maps store information on different types of terrain

and are used to analyze map elements about road detection,

forests, buildings, forestry research, urban planning [1]

The authors in [2] performed satellite image segmentation

and classification using convolution neural network (CNN)

with five labels: trees, vacant land, roads, buildings, water

They proposed pixel-by-pixel CNN methods are single-CNN

and multiple CNNs In addition, the study incorporated an

averaged classification method to improve accuracy With

dataset taken from DeepGlobe data, reference [3] proposed

to use stacked U-Nets for line detection, using a hybrid

loss function to solve the problem of unbalanced layers

of training data The other approach in [4] proposed the

attention dilation linkNet (AD-LinkNet) neural network

using an encoder-decoder structure, parallel-serial conjugate

convolution, attention channel-wise, and the encoder has prior

training to semantic segmentation Alternatively, Lim et al

[5] proposed CNN sets with encoding-decoding architecture:

single short network (SSN), single long network (SLN),

double long network (DLN) differentiate between ground

and background, implementing compare topographic changes

from two images Kuo et al [6] proposed a deep aggregation

network used to solve the task of classifying soil layers,

which extracts and combines multi-layered features in the

image partitioning process, introducing soft- graph-based semantic improve segmentation performance

Although there are many approaches to satellite image analysis and the results were very positive, in general, most

of the subjects identified are only object (single-class) Reference [2] performed five classes identification but did not combine results to produce a complete digital map and the classifications are quite simple Moreover, the use of the data available, almost all which are not been updated will reduce the significance of the analysis results for practical applications Therefore, the concern is to find a source of data with high image quality that is regularly updated, along with processing methods and aggregating analysis resulting

in a digital map with high accuracy that meets the needs of the applications

Most recent studies use datasets provided by the Deep Globe [1], [4], [6], [7], [12] for surface segmentation tasks: road detection, building, ship, grass, water, and detecting topographic changes between different times Audebert et al [8] exploited data from the Open Street Map and proved that this data source can effectively integrate into deep learning models Pascal Kaiser et al [9] used Open Street Map to refer

to the semantic segmentation of images to classify buildings, and roads using CNN architectures With Open Street Map, although being supporting with free data, in general, the updating is limited due to contributions by the users [10] With the Deep Globe, they will provide an available dataset

of satellite imagery, which can be used to study specific tasks such as road extraction, building detection, land cover classification [1], [3], [4], [6], [7], which is not frequently updated These data sources are almost not suitable for building a digital map

U-Net architectures are normally considered as one of the most powerful tools for segmentation images [14] To further improve segmentation accuracy, Weng et al [15] proposed a variant U-Net network: NAS-UNet is stacked by some downsampling and upsampling on a U-like backbone network There are many other approaches inspired by the U-Net network architecture The authors in [16] designed the Res-UNet network model based on ResNet’s ability to

Trang 2

(a) Fast-growing areas Imagery Date: 6/28/2021 (b) Slow-growing areas Imagery Date: 6/4/2020.

Fig 1: Satellite Image is provided by Google with regular updates

process complex images The authors in [17] build U-Net

network with VGG11 encoder to segment images Reference

[14] compared the U-Net architecture with encoders:

VGG11, VGG13, VGG16, VGG19, Resnet18, Densenet121,

Inceptionv3, and Incetionresnetv2; Mingxing Tan & Quoc

V Le [18] studied a scale-up model from the ConvNet

baseline called EfficienNet and determined that balancing

the depth, width and resolution would increase accuracy

and improve performance compared to previous ConvNets

in image classification This network model has various

versions ranging from B0-B7 with different coefficients

It demonstrated that EfficientNet B7 achieves the highest

accuracy Baheti et al [19] proposed the efficient U-Net

architecture, which combines EfficientNet with functions

such as a decoder and U-Net decoder to create a detailed

segmentation map, and EfficientNet B7 achieved the highest

accuracy in the test suite

Inspired from the above discussions, the present paper aims

at developing an efficient classification architecture to classify

the satellite images To this end, we firstly collect image

data from Google, perform image labeling, then propose an

effective approach to segment satellite images, and initially

build a digital map To obtain high accuracy for classification,

we develop an efficient model to classify satellite images

into 12 classes by invoking EfficientNet [20] and U-Net

segmentation architectures [23]

II SYSTEMDEVELOPMENT

A Data Collection and Manual Label

Due to the ever-changing nature of human activities and the

laws of nature, the satellite imagery will change constantly,

which makes no sense to use the existing ones and not

regularly updated datasets in the construction of digital maps

Meanwhile, Google has an enormous and regularly updated

set of satellite imagery orthoimagery datasets An example is

shown in Fig 1 and Fig 2 Taking advantage of these data,

the topic focuses on researching the segmentation of satellite

images from Google to build digital maps

Fig 2: Examples of tiler layouts and zoom coefficients

We download the satellite image in jpeg format from google server provided by google tiler We do manual satellite labeling with classes: street, tree, water, residential, urban, buildings, industrial and commercial, vacant land in urban, sparse forest park, grass, agricultural, sparse urban

B Description of Applied Architectures

In this section, we will summarize the encoder-decoder architecture for semantic segmentation, with EfficientNet-B0

as the encoder and U-Net as a decoder

1) Encoder-Decoder Architecture: The encoder-decoder ar-chitecture includes a CNN to extracts the features from the input image, details are modern neural networks like ResNet [16], VGG [17], But, these network models reduce the width, the height of the input image to get the final feature map It is challenging to rebuild the segmentation map to the size of the

Fig 3: Examples of tiler layouts and zoom coefficients

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Trang 3

Fig 4: Architecture efficient U-Net.

original image The decoder section contains a set of layers

that upsamples the feature map of the encoder to restore spatial

information A simple encoder-decoder network for semantic

segmentation is shown in Fig 3

2) Feature Extraction: convolutional neural networks are

evolved from available resources, then scaled to improve

model performance Depth-scaling is the most common way

to capture many complex features [20] However, arbitrarily

increasing the depth makes the training more difficult or

does not increase model performance, or even decrease [21]

Similar to width and resolution Tan et al [20] proposed a

new scaling method: uniformly proportional all dimensions

depth, width, resolution They used a Neural Architecture

Search to design a new baseline network and scale it up

to obtain a family of models, called EfficientNets, which

achieve much better accuracy and efficiency than previous

ConvNets It includes models from B0-B7, each with different

equalization ratios and number of parameters

The basic building block of EfficientNet is based on Mobile

Inverted Bottleneck Convolution (MBConv) [22] is shown in

Fig 5 Here, the architecture is divided into seven blocks

Fig 5: Architecture of EfficientNet-B0 with MBConv as Basic

building blocks

based on filter size, striding, and some channels Different Ef-ficientNet models have different numbers of MBconv blocks From EfficientNetB0- EfficientNetB7, increasing depth, width, resolution, and dimension model lead to an increase in the number of parameters used in a calculation that makes the strong model and accuracy are also gradually improved [20] However, due to limited tool support as well as a limited calculation of a large number of parameters, this takes a lot of work and time to process; our research draws our attention in the encoder test to architectures B0, EfficientNet-B1, EfficientNet-B2

3) Network Architecture: U-Net is one of the most powerful integrated network architectures for fast and precise segmentation of images, first published in 2015 for biomedical image segmentation [23] It consists of 2 encoder-decoders that make the ‘U’ shape The encoder, or contraction path,

is a typical convolutional network that has convolution, activation, and pooling layers to capture the features of the input image During the encoder process, spatial dimension information (height and width) is decreased while feature information is increased The decoder or expansion path part combines the features and spatial information through a series of convolution structures and joins the high-resolution features from the contracting path

In the original U-Net, the expansion path is almost sym-metrical with the contracting path [23] In our research, we propose to use EfficientNet as an encoder instead of a set of conventional convolution layers The decoder module is simi-lar to the original U-Net Details of the proposed architecture are illustrated in Fig 4 The input image size is 1024x1024 The detailed architecture of blocks in the encoder can be found in Fig 5 First, we bilinearly upsample the feature map of the last logits in the encoder by a factor of two, then append the feature map from the encoder with the same spatial resolution This is followed by 3 × 3 convolution layers before again upsampled by a factor of two This process is

Trang 4

repeated until the segment map of the same size as the original

size of input image is recovered The proposed architecture is

asymmetric, unlike the original U-Net Here, the contracting

path is deeper than the expansion path Putting a powerful

CNN like EfficientNet as an encoder improves the overall

performance of the algorithm [19]

C Loss Functions

Loss functions play an essential role in determining model

performance and different loss functions can be used under

various circumstances [13] In this study, we select three loss

functions suitable for the model:

1) Dice Loss: is a measure of overlap between

correspond-ing pixel values of prediction and ground truth respectively,

which is widely used to assess segmentation performance [20]

The Dice Loss is defined as:

LDL(y, ˆy) = 1 − 2

Pn i=0yi· ˆyi+ 1

Pn i=1yi+Pn

i=1 ˆi+ 1. (1) Here ˆy is the predicted set of pixels, and y is the ground truth

1 is added in numerator and denominator to ensure that the

function is not undefined in edge case scenarios such as when

y = ˆy = 0 [13]

2) Categorical Cross Entropy: is a measure of the

differ-ence between two probability distributions for a given random

variable or set of events It is widely used for grading purposes,

especially pixel-level grading [13]:

LCCE(y, ˆy) = −

n

X

i=1

C

X

c=1

yic· log( ˆyic), (2) where C is the number of classes, yic is 1 if and only if

sample i belongs to class c and ˆyic is the output probability

that sample i belongs to class c

3) Average Loss: comprising of two weighted probability

distribution is given by:

L = 1

2(LDL+ LCCE). (3) III EXPERIMENTALRESULTS

We tested with the U-Net original decoder with different

backbone used for encoder such as VGG11 [17], ResNet18

[16], EfficientNet-B0, EfficientNet-B1, EfficientNet-B2 The

results are shown in Table I We use the loss functions outlined

above to evaluate the models It can be easily observed that

EfficientNet-B0 gives the best results of 1.110 Categorizcal

cross entropy loss, 0.731 Dice loss and 0.997 Average loss

At the same time, EfficientNet-B0 has several computational

parameters of only 4M much less than other models, which

makes the calculation simpler, minimizes effort and processing

time

To test the efficient U-Net B0 network model, we use 1,317

images for train and 304 images for validation The data

ano-tation tool we use is CVAT which is provided by OpenVINO

Toolkit In the training process, we set the coefficient

learning-rate = 0.0001, as shown in Fig 6 The test result is shown in

Figure 7

Fig 6: Test graph in Efficient U-Net B0 network model TABLE I: Results For Comparison Of Various Encoder Ar-chitecture With Loss Functions

U-Net with Total Categorical cross Dice Average backbone params entropy loss loss loss

IV CONCLUSION

Developing the semantic segmentation architecture to an-alyze the geographic structures in satellite imagery is very challenging, but a meaningful task in real-world applications This paper has conducted the segmentation of satellite im-ages with 12 classes In our research, we have considered a segmentation method, the efficient U-Net architecture, which makes use of the efficiency of EfficientNet as an encoder to extract the feature with U-Net as a decoder to rebuilt detailed feature maps Although there are fewer parameters than other structures, EfficientNet-B0 still gives very positive results in the result table

ACKNOWLEDGEMENTS

This research is funded by Ho Chi Minh City University of Technology - VNU-HCM under grant number

T-ÐÐT-2020-45 We acknowledge the support of time and facilities from

Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for this study

REFERENCES [1] I Demir et al., “Deepglobe 2018: A challenge to parse the earth through satellite images,” in Proc IEEE/CVF Conf Comput Vis Pattern Recognit Workshops (CVPRW), May 2018, pp 172–209.

[2] M L¨angkvist, A Kiselev, M Alirezaie, and A Loutfi, “Classification and segmentation of satellite orthoimagery using convolutional neural networks,” Remote Sensing, vol 8, no 4, p 329, Apr 2016.

Trang 5

Fig 7: Results of semantic segmentation on Google dataset with proposed architecture First column shows the input images depicting different scenarios from unstructured environment Second and third column shows the ground truth and predicted segmentation map respectively where different colors signify different classes

[3] T Sun, Z Chen, W Yang and Y Wang, “Stacked U-Nets with

multi-output for road extraction,” 2018 IEEE/CVF Conference on Computer

Vision and Pattern Recognition Workshops (CVPRW), 2018, pp

187-1874.

[4] M Wu, C Zhang, J Liu, L Zhou and X Li, “Towards accurate high

resolution satellite image semantic segmentation,” in IEEE Access, vol.

7, pp 55609-55619, 2019.

[5] K Lim, D Jin and C Kim, “Change detection in high resolution satellite

images using an ensemble of convolutional neural networks,” 2018

Asia-Pacific Signal and Information Processing Association Annual Summit

and Conference (APSIPA ASC), 2018, pp 509-515.

[6] T Kuo, K Tseng, J Yan, Y Liu and Y F Wang, “Deep aggregation net

for land cover classification,” 2018 IEEE/CVF Conference on Computer

Vision and Pattern Recognition Workshops (CVPRW), 2018, pp

247-2474.

[7] S Aich, W van der Kamp, and I Stavness, “Semantic binary segmenta-tion using convolusegmenta-tional networks without decoders,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.

[8] N Audebert, B Le Saux, and S Lefevre, “Joint learning from earth observation and OpenStreetMap data to get faster better semantic maps,”

in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.

[9] P Kaiser, J D Wegner, A Lucchi, M Jaggi, T Hofmann, and K Schindler, “Learning aerial image segmentation from online maps,” IEEE Trans Geosci Remote Sens., vol 55, no 11, pp 6054–6068, 2017.

[10] J.-F Girres and G Touya, “Quality assessment of the french Open-StreetMap dataset: Quality assessment of the french OpenOpen-StreetMap dataset,” Trans GIS, vol 14, no 4, pp 435–459, 2010.

Trang 6

[11] N Baghdadi, C Mallet, and M Zribi, QGIS and Generic Tools London,

England: ISTE, 2018.

[12] K Zhao, J Kang, J Jung, and G Sohn, “Building extraction from

satellite images using mask R-CNN with building boundary

regulariza-tion,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern

Recognition Workshops (CVPRW), 2018.

[13] S Jadon, “A survey of loss functions for semantic segmentation,” in 2020

IEEE Conference on Computational Intelligence in Bioinformatics and

Computational Biology (CIBCB), 2020.

[14] S W Chang and S W Liao, “KUnet: Microscopy image segmentation

with deep unet based convolutional networks,” in 2019 IEEE

Interna-tional Conference on Systems, Man and Cybernetics (SMC), 2019.

[15] Y Weng, T Zhou, Y Li, and X Qiu, “NAS-Unet: Neural Architecture

Search for Medical Image Segmentation,” IEEE Access, vol 7, pp.

44247–44257, 2019.

[16] Z Chu, T Tian, R Feng, and L Wang, “Sea-land segmentation with

res-UNet and fully connected CRF,” in IGARSS 2019 - 2019 IEEE

International Geoscience and Remote Sensing Symposium, 2019.

[17] V Iglovikov and A Shvets, “TernausNet: U-Net with VGG11 Encoder

Pre-Trained on ImageNet for Image Segmentation,” arXiv [cs.CV], 2018.

[18] M Tan and Q V Le, “EfficientNet: Rethinking model scaling for

convolutional Neural Networks,” arXiv [cs.LG], 2019.

[19] B Baheti, S Innani, S Gajre, and S Talbar, “Eff-UNet: A novel

archi-tecture for semantic segmentation in unstructured environment,” in 2020

IEEE/CVF Conference on Computer Vision and Pattern Recognition

Workshops (CVPRW), 2020.

[20] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image

recognition,” in 2016 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 2016.

[21] S Zagoruyko and N Komodakis, “Wide Residual Networks,” in

Pro-cedings of the British Machine Vision Conference 2016, 2016.

[22] M Sandler, A Howard, M Zhu, A Zhmoginov, and L.-C Chen,

“MobileNetV2: Inverted residuals and linear bottlenecks,” in 2018

IEEE/CVF Conference on Computer Vision and Pattern Recognition,

2018.

[23] O Ronneberger, P Fischer, and T Brox, “U-Net: Convolutional

Net-works for Biomedical Image Segmentation,” in Lecture Notes in

Computer Science, Cham: Springer International Publishing, 2015, pp.

234–241.

Tiêu đề	U net semantic segmentation of digital maps using google satellite images
Tác giả	Loi Nguyen-Khanh, Vy Nguyen-Ngoc-Yen, Hung Dinh-Quoc
Trường học	Ho Chi Minh City University of Technology and Vietnam National University Ho Chi Minh City
Chuyên ngành	Information and Computer Science
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	6
Dung lượng	5,34 MB