MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY OPEN UNIVERSITY Ae MASTER OF SCIENCE IN COMPUTER SCIENCE A computer vision-based method for breast cancer histopathological image
Trang 1
MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY OPEN UNIVERSITY
Ae
MASTER OF SCIENCE IN COMPUTER SCIENCE
A computer vision-based method for breast cancer histopathological image classification
by deep learning approach
Student: Mai Bui Thuy Huynh Thesis Advisor: Dr Vinh Truong Hoang
Ho Chi Minh City, October 2019
Trang 2
Contents
Acknowledgment
Abstract
Notations
Abbreviations
11 IntroducHon and general consideratons
1.2 Goalsofthethesis Q LH HQ HQ HQ Q Q Q Q1 v2 13 Contributionofthethesis Q Q LH va 14 Structure ofthethesis Q Q LH Q Q Q Q Q QVv2 15 Methodology 0.0.0.0 0000002 eee eee 2 Foundational theory 2.1 DeepneuronnefWOTkK HQ HH HH HH 2.11 Introduction to deep neuronnefwortk
2.1.2 Present the techniques of neuron network training
2.1.3 Present the popular deep network models
2.2_ Generative Adversarial Networks(GAN)
2.2.1 IntroductiontoGAN .0 5.02.00 200 000 2.2.2 PresentthepopularGANmodels
iii
vii
ix
10
10
10
13
13
13
17
20
21
21
24
Trang 33.22 BACHdataset Q Q Q Q Q Q Q o
Trang 4Acknowledgment
11
Trang 5Abstract
Computer vision field has became more active in the recent decades when scientists
found to apply mathematical and quantitative analysis Various applications have
been using computer vision techniques to improve their productivity such as visual
surveillance, robotic, autonomous vehicle, and specially medical image processing
Until Geoffrey Hinton and Yann LeCun, both known as “Godfather of deep learning”
used Neural Networks and Back Propagation in characters and handwritten predic-
tion given the best result comparing to previous works, the techniques has been be-
came prominent
In this thesis, we focus to detect the breast cancer with high accuracy in order to
decrease the examination cost in accepted time 50, we choose the deep learning to re-
search and evaluate our approach on three datasets such as BreaKHis, BACH and IDC
Due to some limitations of deep learning and dataset sizes, we propose the composi-
tion of popular techniques to be boosting the efficient classification, they are transfer
learning, Generative Adversarial Network (GAN) and neural networks VGG16 &
VGG19 are the base models which are applied to extract the high level features space
from patch cropped images, naming as multi deep features before being trained by
neuron nets So far, there are not any works to leverage GAN power to generate
the fake BreaKHis and in our thesis, we use Pix2Pix and StyleGAN model as gen-
erator model With the proposed approach, the cancer detection results achieve the
better performance to some existing works with 98% in accuracy for BreaKHis, 96%
for BACH and 86% for IDC
Trang 6Notations
Number of block of layers are stacked together The hypothesis space in traditional machine learning Loss function for each hypothesis
Activation function in deep learning Mapping function in deep learning Input feature
Feature’s weight Output feature Loss function in GAN model Discriminator model Generator model Noise input Mean
Variance
Vii
Trang 7Global Cancer Incidence, Mortality and Prevalence
Clinical Breast Exam Completed Local Binary Pattern
Local Phase Quantization
Gray Level Co-Occurrence Matrices
Free Threshold Adjacency Statistic Oriented FAST and Rotated BRIEF
k-Nearest Neighbor
Support Vector Machines Random Forest Quadratic discriminant analysis Graphic Processing Unit Convolution neuron network Convolutional layer Fully connected layer Manifold Persevering Autoencoder Decision Tree
Logistic Regression Generative Adversarial Network
Magnetic Resonance Image
Scale Invariant Feature Transform Speeded Up Robust Features Stochastic Gradient Descent
1X
Trang 8
Chapter 1
Literature review of breast cancer
histopathological image classification
Cancer is a public health problem in the world today Among them, breast cancer is the
most common invasive cancer in women and have a significant impact to 2.1 million
people yearly In 2018, the World Health Organization (WHO) estimated 627,000 death
cases because of breast cancer, be getting 15% death causes As a result in 2018 from
Global Cancer Incidence, Mortality and Prevalence, GLOBOCAN [1] about a number
of new and death cases of 36 cancer types from 185 countries though 5 continents
shown in Table 1.1, new breast cases is 11.6% and second leading cause of death cancer
(% of all sites) (% of all sites
Trang 9with low per capital income about 3,200$/year and 20$/year for voluntary medical
expense — that this breast cancer was 23/100,000 and had the risen trend [2]
Early cancer detection has many changes to treat and increase survival rate for pa-
tients WHO finds that there are the effective diagnostic methods such as X-ray, Clin-
ical Breast Exam (CBE) but this needs to have the professional physicians or experts
Beside the diagnostic result is not always 100% accuracy because of some reasons such
as subjective experiments, expertise, emotional state
In recent years, trend of image processing field and machine learning proved that
physician can employ this technology to make diagnosis via medical image Medical
image processing method has been applied much on cancer diagnosis [3] and other
diseases [4] with high accuracy in short time Image diagnosis by machine learning
is cost efficient method in Vietnam’s urban region where there is no any professional
medical teams
For the most part, researches demonstrated the improvement of breast cancer clas-
sification accuracy [5, 6,7, 8,9] but it doesn’t achieve to the significantly high rate A
main reason is a limited on training dataset, as well collecting and annotating suffi-
cient data by pathological expert is time-consuming and expensive
Nowadays there are open-access breast cancer databases for research and litera-
ture such as BreaKHis, ICIAR 2018 BACH Challenge, Kaggle breast histopathological
image, Tumor Proliferation Assessment Challenge 2016 But almost works experi-
mented on BreaKHis [8] built on collaboration with the P&D Laboratory - Pathologi-
cal Anatomy and Cytopathology, Parana, Brazil, it means that those results can’t gain
the same accuracy on new dataset
Deep learning is a branch of machine learning, representing data characteristic by
layers from simple symbols of point, line to complex, abstract structure of polygon
In 1986, Rina Dechter first introduced deep learning in machine learning community
In 1970, multiplayer perceptrons algorithm stimulated the capacity of human brain to
recognize and discriminate objects, especially many applications in computer vision
Then, Yann Lecun achieved the good result of digit handwriting classification by back-
propopation in deep learning in 1987 Nowadays, deep learning has been developing
so quickly and widely as well applications in many fields
Indeed, although BreaKHis is breast cancer benchmark database, it is not as large
as ImageNet which are built on collaboration with Stanford University, Princeton Uni-
versity, Google Company and A9 Research ImageNet includes 14,197,122 images and
20,000 categories used much on deep learning
New machine learning researches have been getting high accuracy breast cancer
2
Trang 10classification by many various supervised and unsupervised learning algorithms For
literature reviews from 2016 to May 2019, this thesis studies three main techniques:
handcrafted and /or deep feature, transfer learning, generative adversarial network
Handcrafted-feature or deep feature: Spanhol et al [10], Badejo et al [11] made
comparison handcrafted features extractions such as Local Binary Patterns (LBP), vari-
ant LBP - Completed Local Binary Pattern (CLBP), Local Phase Quantization (LPQ),
Gray Level Co-Occurrence Matrices (GLCM), Free Threshold Adjacency Statistic (PF-
TAS), Oriented FAST and Rotated BRIEF (ORB) and classifiers as 1-Nearest Neighbor
(1-NN), Support Vector Machines (SVM), Random Forest (RF) To improve the accu-
racy to a range of 98.5% - 100%, the Spanhol combined the boosting techniques of
1-NN, QDA, RE, SVM But the best result is efficient for 40x and 400x magnification
Author concluded that PFTAS feature is suitable for medical image Graphic Pro-
cessing Unit (GPU) development for big data processing, Spanhol et al [5] proposed
deep learning algorithm which is convolution neuron network (CNN) as CONV-Max
Pool-CONV-Average Pool-FC-FC with 32x32 and 64x64 window patch size in se-
explored the jointly color-texture information (RGB and HSV color space) with and
without stain normalization and various contemporary classifiers used in Spanhol’s
popular in general computer vision But this descriptor is high dimensionality In
two years, the authors presented different encoding methods to get it more discrim-
inated features such as intra-embedding algorithm and Fisher Vector descriptor 2 x
512 x N extracted from VGG19 and GMM model with N Gaussian components Qi et
al [15] have pointed to use entropy-based and confidence-boosting strategy as deep
active learning method for small training dataset classification which reduces anno-
tation costs up to 66.67% with high accuracy from 88.29% to 91.61% Mukkamala et
al [16] has built a deep learning technique based on principal component analysis
for each channel of LAB color space and SVM with accuracy from 85.85% to 96.12%
Kumar et al [17] built a CNN model as 3CONV[5x5]-3CONV][3x3]-ReLU-Pool-FC to
extract the deep features from medical images Gupta et al [18] found that histopatho-
logical stain normalization before using handcrafted-feature extraction will get the
cancer classification to be more efficient than gray scale image Feng et al [19] ex-
ploited the unsupervised learning capacity by using autoencoder network named as
manifold persevering autoencoder (MAE) to learn the encoded features from input
and then decoding hidden presentations to output For a new algorithm, Feng et al
archived accuracy from 82% to 99.16% Reza et al [20] experimented the sampling
techniques such as Under-sampling, Over-sampling, ADASYN, SMOTE with CNN
3
Trang 11network and found that unbalance data will be effecting to accuracy Deploying over-
sampling method to unbalance BreaKHIs dataset gets the better performance Angara
et al [21], Guillén-Rondon et al [22] proposed the CNN network such as 3-[Conv-
ReLU-Pool]-2FC-Softmax Alom et al [6] combined the strength of Inception, ResNet
and Recurrent Convolutional Neural Network with 95% and 97% accurate classifica-
tion with/without augmentation for 4 magnification factors Two core ideas of Zhang
et al [23] are to use skip connection in Resnet to solve the optimization issues when
network becomes deeper and CBAM to refine Resnet features This method gained
the 92.6% highest accuracy for 200x and lowest 88.9% accuracy for 400x Sudharshan
et al [5] compared various multiple Instance Learning (MIL) together and concluded
that non-parametric MIL,which extracted MIL feature space using a Parzen window
technique and k-NN classifier, is higher accuracy than MILCNN and Single Instance
Learning, with 40x magnification rate its accuracy is 92.1% Roy et al [24] proposed
a patch-based classifier using CNN network consisting of 6CONV-5POOL-3FC Au-
thor experimented this model with 512x512 patch size based data which contained
more information and was efficient size, gained 92.% accuracy on ICIAR 2018 Alireza-
zadeh et al [25] learned features space from two different domains using LBP, LPQ,
PFTAS and then forming a projection matrix, in this case that domain space is be-
nign and malignant This method gave the better performance than using each sep-
arated LBP, PFTAS, LPQ or CNN feature and classifier as Spanhol’s work Fondén
et al [26] extracted 3 features type such as the nuclei-based feature by transform-
ing to CMYK color space and K-mean clustering algorithm; region-based vector of
pink/violet, pink/white, white/violet; texture feature consisting of first order statistic
vector, LBP and sparse texture descriptor Fond6n used 9 classifier to detect cancer
tumor using dataset through Bioimaging 2015 Grand Challenge
Transfer learning technique: Weiss et al [27] evaluated different feature extrac-
tor using VGG, ResNet and Xception in training a limited number of samples and
achieved state-of-the-art results on BACH dataset This method downsized BACH
image to 1024x768 in order to train the classification model Vo et al [7] applied the
augmentation method as rotate, cut, transform image to increase the training data
volumes before extracting deep feature from Inception-ResNet-v2 model in order to
avoid the over-fitting They trained the model with multi-scale input images 600x600,
450x450, 300x300 to extract local and global feature Then Gradient Boosting Trees
model again was trained to detect breast cancer Fusion model will vote the higher ac-
curacy classifier The accuracy rate archived to 93.8% — 96.9% at low cost computation
Murtaza et al [28] used Alexnet as feature extraction hierarchical classification model
by combination of 6 algorithms: kKNN, SVM, NB, DT, LDA and LR and finally feature
4
Trang 12reduction to increase overall accuracy 92.45% to 95.48% Li et al [29] deployed the
transfer learning Xception network to avoid model over-fitting Li applied the Resnet
technique to transfer prior knowledge to latter layer in order to achieve accurate and
precise classification Cascianelli at el [50] proposed a new dimension reduction -
Principal Component Analysis, Gaussian Random Projection, Correlation based Fea-
ture Selection after applying the pre-train VGG-F, VGG-S and VGG-veryDeep network
for limited dataset as BreaK His to takeover over-fitting issues with accuracy from 84%
to 94.7% Brancati at el [31] chosen the fine-tuning ResNet network strategy of 3 dif-
ferent configurations, 34, 50, 101 layers and then voted which classification is getting
highest class probability from these configuration This work gets 97.2% accuracy for
benign and malignant tumor on BACH dataset Awan at el [32] used ResNet-50 lay-
ers to extract descriptors from overlapped patch-based images and then applied PCA
dimension reduction Shallu et al [33] proved that transfer learning from VGG16,
VGG19, ResNet50 is better than fully scratch training because these networks have
utilized as discriminated features and meanwhile VGG16 will be better feature gen-
erators Gandomkar et al [34] used ResNet-152 layers to extract features from five
overlapping patches in a stained normalized image This technique is applied to each
magnification rate 40x, 100x, 200x, 400x to detect malignant/benign and subtype
cancer with 97.66% - 98.52% and 94.60% - 95.40%
Generative Adversarial Network (GAN) technique: Shin et al [35] used Image-to-
Image Conditional GAN mode (pix2pix) to generate synthesis data and discriminate
T1 brain tumor class on ADNI dataset Then author continued to apply this model
on other dataset BRATS to classify Tl This GAN yielded 10% increased accuracy
compared to train on the real image dataset Iqbal et al [36] proposed a new Gen-
erative Adversarial Network for Medical Imaging (MI-GAN) to generate synthetic
retinal vessel images from STARE and DRIVE dataset This method generated pre-
cise segmented image better than existing techniques Author declared that synthetic
image contained the content and structure from original images Senaras et al [37]
employed a conditional Generative Adversarial Network (cGAN) to generate syn-
thetic histopathological breast cancer images G model used the modified version
of U-net D model used a CNN based classifier patch GAN Author’s experiments
showed that synthetic images are indistinguishable from real ones Six readers (three
pathologists and image analysts) tried to differentiate 15 real from 15 synthetic im-
ages and the probability that the average reader would be able to correctly classify
an image as synthetic or real more than 50% of the time was only 44.7% Mahapatra
et al [38] proposed P-GANs network to generate a high-resolution image of defined
scaling factors from a low-resolution image This research suggest the multi-stage
5
Trang 13network with a triple loss function based correction mechanism Output from pre-
vious stage will be baseline to improve next stage’s output This technique helped
to recover the degradation of image quality at each stage The final super resolu-
tion image obtained the close accuracy to original magnetic resonance image (MRI)
in landmark and pathology detection Cai et al [39] studied the cross-model volume-
to-volume translation technique from the pancreas classification to breast lesion seg-
mentation domain by two different medical image types Frid-Adar et al [40] fol-
lowed DC-GAN, AC-GAN network for synthesizing high quality liver lesion ROIs and
then used CNN network as CONV-SUBSAMPLING-CONV-SUBSAMPLING-CONV-
SUBSAMPLING-FC-DROPOUT to classify liver lesion Wu et al [41] proposed the
conditional-GAN (ciGAN) to generate the fully contextual in-filling image of breast
lesion This work observed that ResNet-50 classifier trained with GAN augmented
dataset produced a higher AUROC curve to traditional augmentation with the same
classifier
Both handcrafted and deep feature demonstrate the good cancer detection capabil-
ity Various researches combine numerous color features and local texture descriptors
to improve the performance [42,43] Modak at al [43] did comparative analysis of sev-
eral multi-biometric fusions consisting levels of feature-mostly feature concatenation,
score or rules/algorithms level Authors statistically analyzed that fusion approach
represents many advantages than single mode such as accuracy improvement, noise
data and spoof attack reduction, more convenience A at al [42] exploited the pow-
erful transfer-learning technique from popular models such as Alexnet, VGGNet-16,
VGGNet-19, GoogleNet and ResNet to design the fusion schema at feature level for
satellite images classification It is said that fusion from many ConvNet layers are bet-
ter than feature extracted from single layer Features extracted from CNN network is
less effected by different conditions such as edge of view, color space; it is an invariant
feature and getting the better generalization Thus data augmentation methods might
affect the accuracy if it is applied inadequately In order to save low computation cost
from scratch, transfer learning technique can be considered to employ in medical field
It needs to be retrained or fine-tuning in some layers so that these networks can detect
the cancer features Furthermore, GAN is the effective data augmentation method in
computer vision but GAN training process is still a difficult problem These method
have been investigated intensively for common data and rarely for medical data To
overcome this limitation, we propose a composition method of three techniques to be
boosting the breast cancer classification accuracy in a limited training data
6
Trang 141.2
The
Goals of the thesis
objectives of this study are:
This thesis will use Generative Adversarial Networks (GAN) to build the syn-
thetic breast cancer images Goodfellow et al [44] [45] proposed new genera-
tive model which trained the model by adversarial process GAN included 2
separated models such as generative model G and discriminative model D, but
trained them concurrently G learn the distribution of training dataset while D
tried to discriminate which is true or fake image generated by G D estimated the
conditional probability of p(ylx) G tried to optimize the conditional probability
of p(xly) in order to make fool of D We can understand that D and G play the
two-player minimax game with function in equation 1.1
8y lạ
Loss function 6(d) maximize D(x) to gain 100% probability on true image and
D(G(z)) gained 0% on fake image Otherwise, loss function 6(g) minimize
D(G(z)) to gain 100% on fake image
¬ sư Generated image D: D(Gz)) ~ 0
[ Noise z nal Generator G mal from G imal Discriminator D bl G: D/GÉ)) ~ 1 |
Figure 1.1: GAN network
In Fig 1.1, noise z as Gaussian or uniform distribution is input to train G model
Conceptually, z is latent feature extracted from generated image Output from
G is used as input to train D model to discriminate either real or fake image
Mini batch stochastic gradient descent (SGD) trains GAN model to optimize
4(d), @(g) To speed up the training process, GAN can use Adam algorithm
as well
Previous year, many researches [46] [47] are about the efficiency of handcrafted
feature such as Scale Invariant Feature Transform (SIFT), Speeded Up Robust
Features (SURF) and deep network and/or deep feature such as features ex-
tracted from VGG16, VGG19 (developed by research group in Oxford univer-
sity), ResNet (developed by research group in Microsoft company) The thesis
apply the basic deep network model to extract the breast cancer feature, instead
of handcrafted features which can’t extract the complex cancer characteristic in
medical image
Trang 15Figure 1.2: Illustration of BreaKHis database at different magnification factors of be-
nign cell 40x (a), 100x (b), 200 (c), 400x (d) and malignant cell 40x (e), 100x (f),
200x (g), 400 (h)
* This thesis proposes a new algorithm to classify the breast cancer image in three
databases: BreaKHis, Breast Cancer Classification Challenge 2018, Kaggle in or-
der to improve the classification performance
WHO declared that there were many image types used in cancer diagnosis such as
X-ray image find abnormal region but can’t identify the cancer region or not; biopsy
image can define the cancer region or not but can’t identity cancer subtype, shape
or other characteristic such as distribution or balance of cell For histopathological
image, experts can classify cancer region, its levels This work propose method to
detect cancer from histopathological image in three databases
1 BreaKHis: The Break His is benchmark database to study the breast cancer clas-
sification problem There are 7,909 images from 82 patients using 4 magnifi-
cations (40x, 100x, 200x, 400x) This dataset is divided into 2 main groups:
benign and malignant tumors, 8 sub cancer type as well totally size is 4GB This
dataset has been built in collaboration with the P&D Laboratory — Pathological
Anatomy and Cytopathology, Parana, Brazil
2 Breast Cancer Classification Challenge 2018: The BACH 2018 is been built on
collaboration with Universidade do Porto, Instituto de Engenharia de Sistemas e
Computadores, Tecnologia e Ciéncia (INESC TEC) and Instituto de Investigagio
and Inovacão em Satide (i135), Portugal This dataset consists of 400 images which
8
Trang 16Table 1.2: Image distribution per magnification, class and subclass in BreaKHis
is divided into four groups as well totally its size is 13.2 GB Each image also
classified into two main groups, non-carcinoma and carcinoma, by grouping the
normal and benign class into non-carcinoma and grouping the in situ and inva-
sive into carcinoma class
Figure 1.3: Illustration of BACH database for tumor types (a) Normal, (b) Benign, (c)
In-situ, (d) Invasive cell in
Trang 17
Breast Cancer (BCa) specimens scanned at 40x in Hospital of the University of
Pennsylvania and The Cancer Institute of New Jersey From that, 277,524 patches
of size 50 x 50 were extracted (198,738 Invasive Ductal Carcinoma-IDC negative
and 78,786 IDC positive)
Table 1.4: Image distribution per magnification, class in BCa(Kaggle) database
1.3 Contribution of the thesis
The study proposes a composition method of three techniques, transfer learning, deep
learning and GAN to be boosting the breast cancer classification accuracy in a limited
training dataset
1.4 Structure of the thesis
The thesis consists of 5 main chapters: chapter 1 is to literature review of breast cancer
histopathological image, chapter 2 is to foundational theory about deep neuron and
generative adversarial network, chapter 3 is to propose the combination method of
three techniques, chapter 4 is to setup experiment Finally, the achievement, drawback
and future works is on chapter 5
The whole slide image will be divided into patch image After that, the patch images
will be normalized to [0,1] scale and then resized to 256x256 pixel values VGG16 &
10
Trang 18'VGG19 base networks are used as feature extraction techniques to extract the discrim-
inated characteristics of benign or malignant tumor Our classification model is the
CNN network of 7 layers
11
Trang 19With innovation of high performance computing system such as GPU or grid of mas-
sive clusters, forward and backward propagation applied in neuron network proved
that this technique improves the classification error rate than machine learning ap-
proaches such as SVM, Random forest,Bayesian network,etc These networks com-
prise many layers into deep neuron network architecture to learn features from low
to higher via a stack of layers Nowadays deep learning is a remarkable technique
and mostly be considered to apply in many fields such as computer vision, natural
language processing, video
2.1.1 Introduction to deep neuron network
For machine leaning approach, we have to collect dataset, analyze and understand
what data is or how distribution it is Applying the feature extraction and selection
such as feature ranking, dimension reduction to get the shaped dataset before build-
ing model We make various questions to select the hypothesis space ®(x) and ac-
cordingly loss function L(®(x)) to generalize our data at the best During a training
model process, minimizing the loss function is very important so that our predicted
result reaches to target value But the neuron network is driven to definitely different
way, instead of choosing the best one, this method be learned to find the hypothesis
Figure 2.1 shows the differences between the learning approaches
Conceptually, neuron nets is aspired from how human brain has been working
Figure 2.2 describes the physical brain structure that there mainly are three compo-
nents such as dendritic tree, axon hillock and axon Dendritic tree collects input in-
formation from other neurons via axon; axon contacts dendritic trees of other neurons
13
Trang 20
Addional )utput Mapping Mapping layer of
mm from feature from feature abstract
feature
Hand Hand designed designed Feature —e program feature
nput nput nput nput
Rule base Machine Representation Deep system learning learning earning
Figure 2.1: Summary of learning approaches by Goodfellow, Bengio, Courville Geen
box are thing to be learned
at synapses ;axon hillock receives output from dendritic tree and generates outgoing
a spike of activity at synapses into post-synaptic neuron To summary, each neuron
receives input from other neurons and flow of information in input line is controlled
by synaptic weight This connection weight can be adjusted efficiently to receiver dur-
ing cognition process Then the main principal of a neuron is simulated in computer
science as in figure 2.3 defined y = o(x1 x w1+x2x w2-+ x3 x w3-+ b) Input signal
from other neurons are transferred and their weights W can be adjusted accordingly
Final output information will be summarized at output node Activation function 7
makes the neuron to be able to do the complicated computation Mathematically, acti-
vation function makes affine transformation from linear to non-linear Neural network
comprise a thousand of this simple node or neuron which computes for tasks
So deep learning is an algorithm that have many layers of processing together to
resolve particular tasks such as classification, object detection,etc Each layer consists
of many neurons (nodes or units) as demonstrated network in figure 2.3b Deep learn-
ing technique generated the mapping functions by studying the relation of features
It is not definitely fixed function as traditional machine learning The function f will
map an input x to the intermediate output y defined as y = f(x;w) and then study
parameter values w to get the best approximated function f The model 2.4 is also
14
Trang 21a Simulation of a neuron in computer science b Sample of deep learning network
Figure 2.3: Simulation a neuron and neuron network in computer science
called as feed forward and extremely important in deep learning networks Feedfor-
ward deep network comprise many different functions together as chain structures to
learn the abstract features, defined as u = g(h(f(x)))
input x " Mapping sanction L——>| Output y
Figure 2.4: Feedforward neuron network
Convolutional network is terminology in neuron network architecture Convolu-
tional network consists of three typical stages First stage is a combination of many
convolutional layers to do affine transform on input layer and next stage is to run the
nonlinear activation to detect the complexity object and final stage is to pooling layer
It is defined as below description
15
Trang 22
Convolutional layer: in general form, convolution is an operation on two real
value x & w to measure the weighted average denoted as s(t) = (x « w)(t)
In neuron network terminology, is matrix multiplication between input of pro-
cessed image and weight (kernel) to get output as feature map 2.5 To improve
tion (sparse interaction), parameter sharing For sparse connection, one input
units can effect to many output input and otherwise because kernel size is rather
small than input image and result is some connections to be zero out In a
case of fully connection, convolutional layer is called as dense layer or full con-
nected layer It proved that when processing large image with a million pixels,
small meaningful characteristic can be detected as edges, important points with
small reduced parameters and efficient computation Secondly sharing param-
eter idea, single parameter can be used for many inputs Composition of both
ideas, convolution can improve greatly object detection
Pooling layer: is a kind of convolutional layer that is used to adjust unit’s value
by statistically summary of neighbor units such as max, average Pool purpose
is to reduce the small variances from whole neighborhood Pool operator is a
good candidate to define the features regardless of particular position or vari-
able input size Pooling layer is normally designed after dense layer to produce
invariance to translation such as rotation transformation In other hand, pooling
layer can be used as the downsampling technique that reduces the presentation
size computationally on next layer Currently tensorflow supports many pooling
types: maxpooling 1D, 2D, 3D and averagepooling 1D, 2D, 3D
* Activation layer: there are popular activation functions used much in neuron
16
Trang 23ponential function to normalize input vector into probability distribution of K
components denoted as
evi dja
Softmax function is used in final output to classify a multi-class Meanwhile,
that scale the real number to continuously range of value between 0 and 1 Rec-
tified linear activation function (ReLu) denoted as
will zero out all negative number and keep positive number ReLU has charac-
teristics as less train time because of simple math Secondly it doesn’t suffer the
vanish gradient problem as sigmoid or softmax and quickly converges
2.1.2 Present the techniques of neuron network training
* Activation function selection in each layer has to be considered carefully because
it impacts to the computationally effects, quickly converge to local/global mini-
mum in training process In practice, ReLU is advised to use in neuron network
After years, researchers focus to study the activation function by improve ReLU’s
limitation such as
LeakyReLU = max(0.01x,x) (2.4)
return negative number and
is ReLU and Leaky ReLU composition
* Batch normalization: batch norm is a kind of regularization technique that nor-
malizes each output dimension to expectation and variance defined as:
xí) ~ E[x(9|
q
Var|x®)]
It is proven to speed up convergence loffe and Szegedy suggested to use the
batch norm after dense or convolutional layer and before nonlinearity function
17
Trang 24There are some efficients when using batch norm in the neuron nets such as al-
low to setup higher learning rate to quickly reach to local/ global minima, reduce
the dependence on weight initialization as standard distribution or dropout Re-
cently, there are additional normalization techniques as layer, instance and group
norm
Dropout: dropout’s principal is to turn some neurons off randomly by multiply-
ing its output value by zero and then they don’t take action in forward propa-
gation Dropout technique is rather similar to bagging approach that compos-
ites various different networks with shared parameters instead of training many
large architecture with highly computation The probability p hyper-parameter
is used as optimized wide range of networks; higher probability is, less dropout
is This technique makes the network less over-fitting As author’s experiments,
applying 20% dropout in input units and 50% of hidden layer is optimal selec-
tion
Data augmentation: deep learning always needs a huge train set to reduce the
over-fitting but in practice to get large volume is heavy task as medical images
So data augmentation is considered selection Traditionally, horizontal or ver-
tical rotate, randomly cropped images, scaling, resizing, changing color spaces
or combining all things are common technique and recently GAN is nominated
candidate in data augmentation It learns distribution of input data and then
generate fake output which is getting the approximated features as input In our
work, we combine both of types
Transfer learning: it bases on volume or industry of dataset to decide how many
layers is freezed or re-trained accordingly Mostly vision models are trained on
ImageNet dataset so these extracted features are generic and sometime it needs
to re-trained for particular issue In principal, these bottom layers extract generic
features while going further to top of model is more specified features As statis-
tically summary, a small dataset in similar domain, we can freeze many bottom
layers and train a little top layers When dataset becomes bigger and different
domain, it has to be trained more layers
Optimization: reaching to global minima and avoid local manima or saddle
point in training neuron network, there are many optimization approaches such
as stochastic gradient descent (SGD), momentum, nesterov momentum, Ada-
Grad, RMSProp, Adam Actually, these techniques are used much in machine
learning but Adam is mostly used in training neuron network Adam is the tech-
18
Trang 25nique of compositing between momentum and AdaGrad or RMSProp Momen-
tum build up velocity to accelerate the SGD and step over the minimun local
or saddle point Selecting efficient learning rate is not easy task as well either
large or small value can take a long time training to achieve the best loss So
AdaGrad calculates the adaptive learning rate to gradient function by summing
of historical squared of gradient in each dimension For Kingma and Ba, hyper-
parameters in Adam are advised as beta1=0.9, beta2=0.999, learning rate=10~? or
5x10”! for starting to train model Adam optimizer is used in our experiments
Early Stopping: after many train iterations, validation accuracy to be gradually
decreasing or no any changes means that training process has to stop at that
time This process is called as early stopping While training model though
many loops, we often find that training loss is much down and train accuracy
is up but validation accuracy is suddenly down at a iteration, we should stop
this process and store the model weight The approach is built available on both
tensorflow and keras
Summarily, the deep learning training process is broken into these basic steps:
Pre-processing data: it is the first step that must be done in computer vision
or machine learning fields Data is normalized to zero-mean and variance on
whole image or each channel Normalized dataset helps to train quickly and
high accuracy
Selecting architecture: we can design a simple small network such as two layers
with about less neurons Or upon on the specified problems, we can choose the
popular pre-trained models such as R-CNN family for object detection or Alex,
VGG, ResNet for classification
Training model: during a model training with default hyper-parameters as learn-
ing rate, batch size, learning rate decay or small regularization,etc Then we val-
idate if loss function is getting down and validation accuracy is good or not If
loss value goes barely down or explode, we can adjust learning rate to more or
less bigger accordingly
Optimizing hyper-parameters: to get the better hyper-parameters, we use grid of
parameter search After some tries, we adjust a range of learning rate or regular-
ization rate, then train a model with each parameters With each set of parame-
ters, there are validation accuracy and select which one gives the better accuracy
In high level library -Keras or PyTorch, this optimization process is ready for
usage
19
Trang 26combination principal, some popular blocks such as [(CONV-ReLU)x]-POOL/NORM]
(I <= 5), [((FC-ReLU)xI] (1 <= 2), [ResNetxl], [Inceptionxl], [CONV-BATCHNORM- RELU] or apply a fusion approach at facet of feature extraction, evaluation met-
rics and algorithms
2.1.3 Present the popular deep network models
Training a neuron network needs a lots of data but mostly medical datasets are so small
to use the deep learning techniques The transfer learning is a considerable approach
to resolve this matter So far there are some popular networks as VGG16 [48], Incep-
tion [49], ResNet [50], Inception-Resnet [51], DenseNet [52] which is rather efficient in
medical classification in general and particularly in cancer detection
* VGG network: VGG is first deep neuron architecture after a success of Alexnet
VGG team did stack of many convolutional and full connected layers together
and archived better performance by utilizing the smallest inception filter of 3x3
convolutional filters They proved that deeper network increased the classifica-
tion accuracy on a large Imagenet dataset Table 2.1 is a summarized architecture
of VGG-16 layers and VGG-19 layers
Input Block 1 Block 2 Block 3 Block 4 Block 5 Layer
conv3-256 | conv3-512 | conv3-512 conv3-64 | conv3-128
VGG16 | Image conv3-256 | conv3-512 | conv3-512
conv3-64 | conv3-128 FC-4096 | FC-4096 | FC-1000 | Soft-max Nets input conv3-256 | conv3-512 | conv3-512
maxpool | maxpool
maxpool maxpool maxpool conv3-256 | conv3-512 | conv3-512 conv3-64 | conv3-128 | conv3-256 | conv3-512 | conv3-512 VGG19 | Image
conv3-64 | conv3-128 | conv3-256 | conv3-512 | conv3-512 | FC-4096 | FC-4096 | FC-1000 | Soft-max
Nets input
maxpool | maxpool conv3-256 | conv3-512 | conv3-512
maxpool maxpool maxpool
Table 2.1: VGG16 & VGG19 nets architecture
The ReLU activation function is always used though VGG nets Technique of
3x3 inception filter-stride 1 pixel is better than 7x7 or 5x5 inceptions filter -
stride 2 pixels on two factors about discrimination capability and number of
weighted parameters It’s an important work’s contribution The 3x3 convo-
lutional filters can learn the local features and then after many of stacked layers
to combine the localized low space, the nets synthesis higher feature spaces with-
out missing characteristics The incorporation of 1x1 convolution layer is other
20
Trang 27approach to increase discrimination function and still keep the inception fields
in layer In recently years, VGG16 & VGG19 nets are used in transfer learning
techniques because of its shared low level features extraction and medium sized
architecture Two top full connected layers - 4096 of network is good discrimi-
nated deep features that can be used in combination or independent way with
handcrafted features in classification network
Inception network: Inception nets concentrated on efficient deep neuron net
Author used 1x1 convolutional operator to rise the deep architecture and re-
duce high dimensional spaces Inception module’s idea is to concatenate many
optimal local structure with high correlation analyzed from previous layer de-
fined in 2.6 Author used various different sized convolutional operator such as
1x1, 3x3, 5x5 combined together They are likely to be types of multi-scale pre-
sentation in pyramid scheme For the inception with reduction design, it allows
for increasing a number of nodes at each layer without effecting to next compu-
tation layer Totally, Inception network has 22 layers with trained parameters
* ResNet network: when neuron network become deeper, the accuracy will begin
to saturate and more than that it is facing degradation problem Authors from
ResNet’s work proposed to stack additional identity mappings as defined in 2.7
Author declared that originally H (x) is predicted mapping function which learns
a mapping from input to output Alternatively, we let define an another mapping
F(x) = H(x)-—x and so again H(x) = F(x) +x Now H(z) - residual function
- is easier to optimize with reference to the layer input This formula is also a
type of shortcut connection which borrows in long short term memory (LSTM)
network Residual block brings a flow of memory from input layer to output
layer
For our experiment, although Inception, ResNet achieved the better result than VGG
on ImageNet classification but to BreaKHis medical image, VGG transfer learning give
more discriminated features
2.2 Generative Adversarial Networks (GAN)
Basically, GAN composites two networks (see figure 2.8), codename as generator net-
work G(x) and discriminator network D(G(x)) G will generate the fake images
21
Trang 28Figure 2.6: a) Native version of Inception net; b) Reduction version of Inception net
from studying input data distribution while D will discriminate either real from train
dataset and fake from G model For GAN, G resolves the difficult tasks than D that
recognizes correlation or distribution between nearly similarly objects and categorize
them into correct features space From initial stage, input dataset consists of random
z and real x data used to train G network Recently, GAN is focus much by research
community and there are variant GAN which gradually generate the realistic images
such as face, animal, natural picture,etc These techniques are dominated in GAN as
conditional-GAN or style transfer GAN
As 2.8, both generator and discriminator are neuron networks and trained simul-
taneously Discriminator loss function is optimized by back-propagation to adjust
discriminator’s weight Otherwise, training generator is rather complexity which in-
corporate to D’s feedback on output classification as well penalized if fake image is
classified as un-real image Thoroughly for GAN training principal, either G or D will
22
Trang 29freeze while another is training with purpose of optimizing its loss as two-player game
denoted as below algorithm:
* Loop though some iterations
- Loop though batch size
+ Do m sampling for noise {2), ,2(")} from distribution p¢(z) + Do m sampling for train data {x), ,x(")} from distribution pgata(x)
+ Update discriminator by gradient:
mt
5n, = >, [IogDaa(x)) + log(1 — Doa(Gea(z")))]
1=1
- Do m sampling for noise {z), ,z()} from distribution p.(z)
- Update generator by gradient:
Trang 30From generator’s deputy, the mostly used important technique is to upsampling Up-
sampling is a process of learning from the sequence of data and generate the approx-
imately sequences of data by capturing input’s density function Upsampling convo-
lution is available in tensorflow library Beside of GAN’s common principal, Radford
et al [53] suggested some guides to develop the stable GAN architecture:
Replace any pooling layers with strided convolutions (discriminator) and fractional- strided convolutions (generator)
Use batch norm in both the generator and the discriminator
Remove fully connected hidden layers for deeper architectures
Use ReLU activation in generator for all layers except for the output, which uses
Tanh
Use LeakyReLU activation in the discriminator for all layers
To understand how GAN network is designed, we will introduce some popular GANs
2.2.2 Present the popular GAN models
Pix2Pix: Pix2Pix is published in 2016 and used widely in many applications in-
cluding arts such as converting edge maps to cat photo, translating sketches to
Pokemon or portrait,etc Model’s concept is to translate image to image The
generator is U-Net network combined to skip connection between layer i and
layer n —i as ResNet and discriminator is convolutional PatchGAN classifier
Loss function is denoted
- L1 distance:
L(G) = Exyzlly ~ G(z,z)ll (2.9)
- Conditional GAN loss:
- Final loss:
Author decided to choose L1 distance instead L2 distance because D is easy to
detect the blurred image as fake PatchGAN discriminator’s idea is to return
24
Trang 31average of all patch’s output, where each NxN patch image is fake or real image
Our chosen classification’s evaluation metric comes from PatchGAN’s concept
CycleGAN: for pix2pix framework, it performs well on using training set of
aligned image pairs such as translating from sketches to shoes that exists y’s
characteristic on input x So CycleGAN is proposed to translate image from do-
main A to domain B where doesn’t present a paired of images Author assumes
that there is some hidden relations between two domains and instead of learn-
ing a pair of images, author can try to discover on a set of images in both two
domains A state of the art in CycleGAN is to define a new loss function: cycle
consistency loss and adversarial loss Let define them as:
- Generator G:G: X —> Y
- Generator F: F: Y > X
- To capture latent characteristic from domain A’s image set and transform them
to domain B’s image collection, mathematically G & F should be inverse together
F(G(x)) xx and G(F(y)) ~ y It’s also the cycle consistency loss
- Adversarial loss function consists of
+ CycleGAN also used L1 distance:
+ Final loss:
Lean(G, Dy, X,Y) + Lean(F,Dx,Y,X) + ALcye(G,F) (2.15)
About GAN architecture, the author uses neural style transfer and super-resolution
as generator network and 70x70 PatchGAN as discriminator network
StyleGAN: it borrows an idea from style transfer works and skip connection in
ResNet to build a generator model as 2.9 Instead of training full image, author
25
Trang 32encodes input image by 8 dense layers in mapping network into intermediate
feature w and then specialize w to style vector A Author used the adaptive in-
stance normalization technique to scale input feature to mean-variance in order
to match those of style y defined as AdaIn(x,y)=0(y) (1) + p(y) in which it
is scaled the normalized x with o(y) and shifted with ji(y) StyleGAN inherit
all hyper-parameters from Progressive GAN generator but replace the nearest
neighbor layers by bilinear for upsampling Injection of noise B before Adaln
operator helps to increase regularization factor to each layer We choose Style-
GAN to generate the fake benign and malignant images in our work
Latent z © Z ` Noise
Synthesis network g Mapping
i
Figure 2.9: StyleGAN generator network
26
Trang 33Image preprocessing: a whole slide image will be divided into many patch im-
ages on both BreaKHis and BACH dataset, particularly IDC dataset is still kept
as an original image size For each patch, image pixel in each channel is normal-
ized to the [0,1] range to decrease the colored intensive rate Then patch image is
resized to 256x256 pixels, using the bilinear interpolation method Each image
in train set comprises the all patches of an original image so that our network
can learn the multi deep features and increase the performance
Feature extraction using deep feature: the discriminated features extracted from
fine-tuning VGG16 and concatenated of fine-tuning VGG16 & VGG19 transfer
learning is classified by our novel approach In this work, all layers before 17"
layer of VGG16 & VGG19 is freezed and the rest of layers is re-trained
GAN and Data augmentation: We choose two GAN architectures to test data
augmentation capability on histopathological cancer images
to generate cancer images for each magnification factor Diagram 2.9 de- scribed the state-of-the-art generator network than traditional generator ar- chitecture in style transfer image problem Synthesized images from Style- GAN shows that it’s rather similar to original images in first eye looking
- Conditional GAN: we choose the Pix2Pix [55] as another data augmentation
method Firstly we think that combination of U-net with skip connector in generator network is the best fitting to histopathological images Secondly
27
Trang 34Figur
CNN
we expected to copy all features from both input and conditional images to
increase the discriminated characteristic of benign and malignant tumor
We applied the StyleGAN and Pix2Pix for BreaKHis dataset and just Pix2Pix for
BACH dataset
Convolution neuron network for classification: In the recent years, Convolu-
tional Neural Network (CNN) proved as an efficient approach in computer vi-
sion and have significantly improved in cancer classification Both VGG16 and
VGG19 are proven to be a good candidate in transfer-learning technique To
get the discriminated benign and malignant from the tumor features, the base
networks have to retrained on datasets and then be used as an input for CNN
Concatenate layer
FC layer (4096 BatchNormalization filter)
Y _ ReLU layer
FC layer (4096 filter) Dropout (prod=0.2) ReLU layer
FC layer (512 filter)
‘Dropout (prod=0.2) |
ReLU layer IFC layer (512 filter) TC ]
FC layer (1 filter)
| ReLU layer
Sigmoid layer
FC layer (filter)
Sigmoid layer Malignant
e 3.1: (a) Fine-tuning VGG16 and CNN, (b) Fine-tuning VGG16 &VGG19 and
A combination of different feature extraction methods can increase the classifica-
tion accuracy This work uses VGG16 network and then both VGG16 & VGG19
28