An Application Improving the Accuracy of Image Classification An application improving the accuracy of image classification Pham Tuan Dat Faculty of Information Technology Vietnam Maritime University[.]
Trang 1An application improving the accuracy
of image classification
Pham Tuan Dat
Faculty of Information Technology Vietnam Maritime University
Hai Phong, Vietnam datpt@vimaru.edu.vn
Nguyen Kim Anh
Faculty of Information Technology Vietnam Maritime University
Hai Phong, Vietnam anhnk@vimaru.edu.vn
Abstract—There have been various research approaches to
the problem of image classification so far For image data
containing kinds of objects in the wild, many machine learning
algorithms give unreliable results Meanwhile, deep learning
networks are appropriate for big data, and they can deal with
the problem effectively Therefore, this paper aims to build an
application combining a ResNet model and image manipulation
to improve the accuracy of classification The classifier performs
the training phases on CIFAR-10 in a feasible time In addition,
it achieves around 93% accuracy of the test data This result is
better than that of some recently published studies
augmentation, cutmix, normalization
I INTRODUCTION Social networks have stored and managed a massive
information volume on the Internet To meet the needs of
users, social networks have to build useful applications From
a given keyword, the search services need to find relevant
information on the same subject exactly and fast Obviously,
relevant information does not just contain text but also
includes images A challenge of applications is that they must
develop an effective mechanism that can classify patterns into
the same subject if they represent a kind of object.
In fact, the problem of image classification is not a new
issue Machine learning algorithms are actually applied to
cope with this problem For instance, K-Nearest Neighbor and
Support Vector Machines solve the problem of handwritten
digit classification on MNIST very well [12] But many
conventional algorithms only achieve poor performances on
data sets such as CIFAR-10 and CIFAR-100, which contain
kinds of objects in the wild [12,13]
In recent years, deep learning networks have overcome
the weaknesses of machine learning algorithms Deep
learning networks can train big data and get optimal training
results A problem of deep learning networks is that when
increasing the number of layers, they generate more training
errors This will make the accuracy get saturated ResNet is a
typically deep learning network, the key point of ResNet is
the residual block that may cope with the degradation
problem [1,2] Residual blocks reduce the above drawback
and allow ResNet to achieve impressive accuracy in the case
of adding layers
On the other hand, the performance effectiveness does not
just depend on network architecture but also comes from data
The lack of data diversity makes deep learning networks
work inefficiently By modifying patterns of training data,
augmented images will represent a more comprehensive data
set [6] Consequently, image augmentation minimizes the
difference between patterns in training data and validation
data, as well as test data
Therefore, the objective of this paper is to propose an application combining a ResNet model and image manipulation to improve the accuracy of classification on CIFAR-10 The estimated accuracy of the classifier is around 93% on the test set, and this result is better than that of the CNN and Attentive CutMix ResNet-34
II THEORETICAL BACKGROUND
A Image Augmentation
In some cases, deep learning networks may give too high accuracy on the training data but achieve unreliable results on the test data Image augmentation is a solution to this situation It generates new data from original data, but new patterns still keep the original nature of patterns On the basis
of data diversity, deep learning networks decrease validation errors and increase test accuracy
There are two practical approaches to image augmentation: image manipulation and deep learning
However, the experiments in this paper and published studies [7,8,9] apply image manipulation to the problem of image classification Thus, the paper only presents an overview of image manipulation
Image manipulation needs a small amount of memory to transform and store data It takes a lower computational cost
if compared with the deep learning approach Generally, image manipulation [6] includes geometric transformations, color jitter, mixing images, and several other techniques
Typical geometric transformations are shifting, flipping, cropping, and rotation When images are taken in the wild, they do not just contain informative regions of objects, so a classifier sometimes predicts labels of patterns incorrectly
Cropping can reduce the confusion possibility of classification for such images The use of geometric transformations does not guarantee effectiveness for every data set For a data set including patterns of letters and digits, rotation or flipping changes shapes of patterns, so labels of patterns are incorrectly classified Nevertheless, for images
of objects in the wild, rotation or flipping does not lose labels
of patterns In Fig.1, observers can see a kind of object in some images after a series of transformations
Color jitter is another technique of image manipulation
For the problem of letter classification, images of letters are relatively simple, and they are usually converted into binary images, so color jitter is not really necessary By contrast, images of objects in the wild are much more sophisticated and the poor quality of images will reduce the effectiveness
of classification In this case, color jitter may bring noticeable effects for data augmentation Color jitter consists of brightness change, hue, and saturation adjustment Brightness change makes dark images get brighter Over-saturated
Trang 2
Fig 1 A series of translations and rotations for a pattern
images look so artificial, whereas many actual images often
give impure colors Hence, the brightness, saturation, and hue
of such images need to be adjusted
Mixing images has been seen as a potential technique for
data augmentation It combines patterns into new training
instances CutMix [7] is a typical example of this technique
For each pair of images, it replaces a removed region on the
first image with a patch from the second image The ground
truth labels are mixed proportionally to the area of the
patches New training instances of CutMix do not lose nature
if compared with a few regional dropout strategies [10,11]
But CutMix is unable to capture the most informative regions
on images Attentive CutMix [8] adjusts the strategy of
CutMix, it takes out a 7×7 grid map from the first image and
picks top N (the optimal value is in the range of 1 to 15)
attentive patches These patches are pasted onto the second
image at their respective original locations (images have the
same size)
B Batch Normalization
Training neural networks might become ineffective if
they encounter high learning rates or too small weights [14,
16] when carrying out back-propagation This loses the
learning ability and does not enhance the performance of
networks An ordinary solution to the problem of vanishing
gradient is using ReLu and choosing small learning rates But
this way is not good enough for vanishing gradients Batch
Normalization [5] (BN) is a better alternative to ReLu, it
normalizes input data and speeds up the convergence of
learning networks In fact, BN stabilizes the growth of
parameters during training phases, so networks are able to
work with a broader range of learning rates without the risk
of divergence
There are opposing viewpoints about the link between BN
and ICS [17], or the link between BN and the exploding
gradient problem [14] One viewpoint indicates that the use
of BN improves the accuracy of networks, but it does not
decrease ICS in several test cases Another opinion shows
that adding BN layers may exacerbate the problem of
exploding gradient [14] Nonetheless, the experiments in [17]
do not deny a clear improvement in terms of gradient change
and loss variation for VGG networks Furthermore, BN
allows a VGG network (different learning rates) to achieve
acceptable results on the test data
In practice, BN does not carry out normalizing the entire
training set at a time Instead, it splits the training set into
mini-batches Next, BN calculates the mean and the variance
over each mini-batch, as described in (1) and (2) Afterward,
BN normalizes each activation, then each normalized activation will become the input for each transformation in
1
1
2 1 m( B)
i i
2
ˆ
B B i i
x
x
y i xˆi (4) For deep learning networks such as CNN [3], BN operates as a layer, which usually goes with ReLu functions and convolutional layers In learning networks, one convolutional layer can receive BN(x) as the input data instead of x
C The Overview of ResNet
As mentioned above, the problem of vanishing gradient
in learning networks can be addressed by a solution such as
BN However, there are still difficulties in optimizing deep learning networks The degradation problem is exposed when the deep of networks increases: networks generate more training errors and the accuracy gets saturated In this situation, over-fitting [15] is not a reason
ResNet is a deep learning network overcoming the degradation problem It shares the idea of LSTM and components of CNN Nevertheless, it does not have gates controlling data flow in units ResNet builds residual blocks
in which the activation of any deeper block is the sum of the activation of a shallower block and a residual function
Kaiming He and partners investigate the benefits of identity shortcuts [1,2], which make ResNet get higher accuracy ResNet includes residual blocks, and each one has
an overview structure as illustrated in Fig.2 In one residual block, a ReLu and weight layers are placed alternatively To accelerate the convergence of ResNet, batch normalization may be inserted into each block Moreover, ResNet also includes pooling layers
Let xl and f(F(xl) + h(xl)) denote the input for the lth
residual block and the output of this block, respectively F is defined as a residual function, which includes two or three convolutional layers If F only contains a layer, it will bring fewer advantages The identity mapping is h(xl) = xl and f is one ReLu function
From these hypotheses, the authors indicate that the
output of the lth (l from 0 to L-1) unit is the summation of the
outputs of all previous residual functions In an extremely
deep learning network, when the identity mapping in the lth
layer is replaced with h(xl) = λlxl, the authors obtain an equation as follows:
( 1 ) 1( 1 ) ( , )
1 i i
L l
L l L
l i
The factor
1
l i
in (5) can be exponentially large if all λi
> 1, and the factor can be exponentially small if all λi < 1
(i from l to L-1) This result will cause exploding or vanishing
Trang 3Fig 2 A residual block
λi = 1, the gradient will not vanish in each layer when the
weights are arbitrarily small [2]
Different techniques do not perform better than identity
shortcuts For example, the use of exclusive gating generates
more test errors than identity shortcuts in the ResNet-110
The authors also investigate 1×1 convolutional shortcuts
giving the poor performance of the ResNet-110
Identity shortcuts do not take extra parameters and do not
increase too much computational complexity ResNets are
able to be trained by optimization algorithms, and they are
easy to be implemented with basic libraries without needing
much modification
III EXPERIMENT AND COMPARISON
A The Application and Network Model
To build the ResNet and the experimental application, this
paper uses Python language and the necessary libraries such
as PyTorch, Keras, etc As shown in Fig.3a, the application
consists of the data augmentation, training, and classification
functions Before performing the training phase, the patterns
in the training set are augmented to minimize the difference
between patterns of the training and validation data After
finishing the training phase, the classifier can predict the
output for the test set
The function of image manipulation combines
transformations including horizontal flipping, random
cropping, random rotation, and color jitter Geometric
transformations change the directions and shapes of patterns
while color jitter is used to adjust the brightness, saturation,
and hue of patterns Patterns are randomly rotated with small
angles in the range of -5o to 5o The application uses the vision
library of PyTorch to implement image manipulations on the
training data, and this function takes a short time to finish the
task
Fig.3b represents the ResNet containing six convolutional
blocks and three residual blocks, it seems like an abridged
version of ResNet-18 Although the ResNet has a smaller
number of residual blocks, two models have an insignificant
difference in the number of filters on each convolutional
layer Besides, when training CIFAR-10, the ResNet takes
less time than ResNet-18
In the ResNet, each convolutional block includes one
convolutional layer while each residual block includes two
convolutional layers and an identity shortcut Every block has
at least a BN layer and a ReLu activation, but only several
blocks have pooling layers In each block, convolutional
layers and BN layers are placed alternatively The 3×3
convolutional layers have from 64 to 512 filters The last
layer of the model acts as one fully connected layer, which
converts data in the previous layers into one-dimensional
data From that, the classifier will estimate the output labels
Like other learning networks, the ResNet needs to
Fig 3 (a) The functions of application; (b) the ResNet model integrate with an optimizer, which allows the training process
to decrease the number of training errors and validation errors This leads to an increase in terms of accuracy on the test set In this model, the application chooses Adam [4]
B The Experiments and Comparison
The application experiments with the ResNet classifier on CIFAR-10 It contains 60000 samples, which are divided into three sets (the training, validation, and test data) in the ratio
of 4:1:1 The validation set is used for tuning hyper-parameters in training phases and making the performance of the test set better The effectiveness of the classifier is evaluated by the loss and the accuracy on both the training set and the test set
The position of BNs in the blocks makes slightly different outcomes on the validation set: If the BNs are first executed
in the blocks, the accuracy of the ResNet stably increase during the training phase; if the BNs are executed after the convolutional layers, the ResNet generates the fluctuating accuracy in the middle epochs, as illustrated in Fig.5 Nonetheless, the first choice does not give better overall results of the validation data, and the accuracy of the test data also decreases a slight amount
As shown in Fig.4, the training error reduces quickly, so after finishing the phase, the training loss is approximately 6% In other words, the ResNet gives very high classification accuracy on the training set (over 98%) This result does not reflect the benefits of image manipulation because the non -augmentation ResNet also gives an extremely low loss (below 0.5%) In Table I, the results show a little increase in the accuracy of the ResNet on the validation data Furthermore, the loss of the ResNet is much lower than that
of the non-augmentation ResNet on the validation set (0.26
Trang 4
Fig 4 The loss of the ResNet during a training phase
Fig 5 The accuracy of the ResNet on the validation set
Fig 6 The confusion matrix of the ResNet on the test data
with 0.45), and the accuracy of the ResNet increases by 3.4%
on the test data (from 89.6% to 93.0%)
According to the confusion matrix in Fig.6, incorrectly
classified rates of ten classes are really low Generally, the
maximal correct classification rate belongs to the class of
automobile (over 0.96) In contrast, the minimal correct
classification rate usually belongs to the class of cat (over
0.84) because the ResNet confuses many cat objects with dog
objects
To make the comparison fair, the application compares
the ResNet with the CNN (unchanged proportions for three
TABLE I I MAGE M ANIPULATION I MPROVES T HE A CCURACY
Classifier Training data Loss on Validation data Accuracy on
Non-augmentation
TABLE II C OMPARING T HE R ES N ET W ITH T HE C NN
Classifier Training data Loss on Accuracy on Test data
TABLE III A PPLYING M IXING I MAGES T O R ES N ET -34
Method Accuracy on Test data
Attentive CutMix 0.9040 CutMix 0.8875
sets of CIFAR-10) Both classifiers take the augmented images as the training data The CNN has 8 convolutional layers, 4 max-pooling layers, 1 fully connected layer, and some BN layers The 3×3 convolutional layers of this network also have from 64 to 512 filters
In the experiment, the ResNet outperforms the CNN on both the training and test data, as shown in Tables II
Although the CNN classifier has a quick convergence in the first half of the training phase, its accuracy on the validation data gets saturated in the last epochs Finally, it achieves 87%
accuracy on the test data Meanwhile, after 30 epochs, the ResNet gets around 93% accuracy
Applying mixing images to ResNet improves the accuracy of classification on CIFAR-10 From the reports in
a recent study [8], the method of Mixup has the most ineffective performance, but it also gains 1.58% accuracy improvement over the baseline method Attentive CutMix is able to capture the most informative regions on images, so its accuracy improvement exceeds that of CutMix (3.28% with 1.63%) Consequently, CutMix ResNet-34 only gains 88.75% accuracy while Attentive CutMix ResNet-34 gains 90.40% accuracy (Table III) However, mixing images does not bring more advantages than geometric transformation and color jitter
IV CONCLUSION This paper aims at building an application combining a ResNet model and image manipulation to improve the accuracy of classification on CIFAR-10 The experiments in the paper and the reports from recently published studies show that the use of geometric transformation and color jitter
is a suitable alternative to mixing images The ResNet achieves the high accuracy of image classification, with around 93% on the test data The classifier obtains an accuracy increase of 3.4% over the non-augmentation ResNet Additionally, this growth outweighs that of Attentive CutMix ResNet-34
Trang 5REFERENCES [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun “Deep
residual Learning for Image Recognition”, Conference on Computer
Vision and Pattern Recognition, June 2016
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity
Mappings in deep residual Networks”, European Conference on
Computer Vision, September 2016
[3] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir
Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Li Wang, Gang
Wang, Jianfei Cai, and Tsuhan Chen, “Recent Advances in
convolutional neural Networks”, Elsevier, October 2017
[4] Diederik P.Kingma and Jimmy LeiBa, “Adam: a Method for Stochastic
Optimazation”, ICLR, 2015
[5] Sergey Ioffe and Christian Szegedy, “Batch Normalization:
accelerating Deep Network Training by Reducing internal covariate
Shift”, vol.37, Proceedings of the 32 nd International Conference on
Machine Learning, July 2015
[6] Connor Shorten and Taghi M Khoshgoftaar, “A survey on Image Data
Augmentation for Deep Learning”, Journal of Big Data, 2019
[7] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun,
Junsuk Choe, and Youngjoon Yoo, “CutMix: Regularization Strategy
to train strong Classifiers with localizable Features”, International
Conference on Computer Vision, August 2019
[8] Devesh Walawalkar, Zhiqiang Shen, Zechun Liu, and Marios
Savvides, “Attentive Cutmix: An enhanced Data Augmentation
Approach for deep Learning based Image Classification”, International
Conference on Acoustics, Speech and Signal Processing, May 2020
[9] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz, “Mixup: Beyond Empirical Risk Minimization”, ICLR, April
2018
[10] Terrance DeVries and Graham W.Taylor, “Improved Regularization of convolutional neural Networks with Cutout”, arxiv.org/abs/1708.04552, November 2017
[11] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang,
“Random Erasing Data Augmentation”, arxiv.org/abs/1708.04896, November 2017
[12] Sonika Dahiya , Rohit Tyagi , and Nishchal Gaba, “Comparison of ML
https://easychair.org/publications/preprint_open/KnC4, July 2020 [13] Karttikeya Mangalam and Vinay Prabhu, “Do deep neural Networks learn shallow learnable Examples First?”, Proceedings of the Workshop on Identifying and Understanding Deep Learning Phenomena at 36 th International Conference on Machine Learning,
2019
[14] George Philipp, Dawn Song, and Jaime G Carbonell, “Gradients explode - Deep Networks are Shallow - ResNet explained”, International Conference on Learning Representations, 2018 [15] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A Simple Way to prevent neural Networks from Overfitting”, Journal of Machine Learning Research,
2014
[16] Yoshua Bengio, Patrice Simard, and Paolo Frasconi, “Learning long- term Dependencies with Gradient Descent is difficult”, IEEE Transactions on Neural Networks, February 1994
[17] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander M adry, “How does Batch Normalization help Optimization?”, 32 nd
Conference on Neural Information Processing Systems, 2018