Techniques of Neural Network Training - Brain tumo- 123docz.net

• Activation Function

Activation function is just a thing function along with each Conv that is used to derive the output from node, which is also known as Transfer Function. The reason why it should be used is to determine the output of CNNs such as yes or no. More- over, It maps the resulting values in between 0 to 1 or -1 to 1 depending upon the function. Basically, the activation function could be divided into two main categories:

Linear Activation Function and Non-linear Activation Functions. There are a load of ubiquitous activation functions namely Sigmoid, tanh, Softmax, ReLU, Leaky ReLU.

However, in this thesis, ReLU is mainly implemented in almost recommended CNN models and be discussed below.

As we could see, the ReLU is half rectified (from the bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero. The operating range of ReLU’s values fluctuate from zero to infinity, and the function and its derivative both are monotonic. Unfortunately, there is a weakness in ReLU, which is that all the negative values become zero immediately, leading to a decline in the fitting or training ability of the model from the data properly. In other words, any negative input value given to the ReLU’s activation function would be immediately turned into zero in the graph, which in turn influences the resulting graph and results in mapping the negative values inappropriately.

0 2 4 6 8 10

-10 -5 0 5 10

𝑅 𝑧 = max 0, 𝑧

Figure 2.20: ReLU activation.

Rectified linear activation function (ReLU) denoted as:

σ(x) = max(0, x) (2.22)

Another activation that the thesis would like to discuss simply as it is a strong activation used in the last layer (the output layer) for taking whatever values and transforms them into a probability distribution, which is Softmax (also called Soft- argmax or Normalized Exponential Function). Softmax is known as normalized exponential function to normalize input vector into probability distribution of K com- ponents denoted as:

σ(z)i = ezi PK

j=1ezi (2.23)

• ~z is the input vector to the softmax function, made up of (z0, ... zk).

•zi is all the zi values.

•ezi is the standard exponential function applied to elements of input vector.

•PK

j=1 ezi is the normalization term.

•K The number of classes in the multi-class classifier.

• Batch Normalization

Being proposed by Sergey Ioffe and Christian Szegedy in 2015, BN (also known as the batch norm) is a technique utilized not only to assist ANN faster but also to remain more stable through normalization of the input layer by re-centering and re- scaling. In other words, batch normalization declines the amount by what the hidden unit values shift around (covariance shift), which defined as:

x(k) = x(k)−E[x(k)]

pV ar[x(k)] (2.24)

As a result of applying BN, consent is made for each layer of the network to learn itself a bit more independently than others. Moreover, a higher than usual Learning Rate (LR) could be set without any doubts about some activations would be too high or too low, and by that, things that previously could not get to train would start to train over again. Besides, having the same role as Dropout, which is to have slight regularization effects, BN adds Noise to each hidden layer activations to prevent overfitting.

As mentioned earlier, to enhance DL networks’ stability, BN affects the produc- tion of the previous activation layer by deducting the batch mean and then dividing by the standard deviation of the batch.

• Dilution

As being mentioned earlier, in ANN, another effectual regularization technique for preventing overfitting by preventing complex co-adaptations on the training database is described as Dilution (also called Dropout). While the term dilution refers to the thinning of the weights, dropout regards to randomly "dropping out" or omitting units (both hidden and visible) over the database training process of ANN. By "dropping out", it means these units are not considered during a particular forward or backward pass. More technically, at each training stage, individual nodes are either dropped out of the net (the hidden layer) with probability1−por kept with probabilityp, so that a simpler network is left for continuous layers.

(a) (b)

Figure 2.21: Before and after implementing dilution.

• Optimization Algorithms

There is a method that is used frequently when training an ANN called optimization. In other words, optimizers are algorithms or techniques that are employed to modify the attributes of ANN like weights and Learning Rates (LR) to diminish the losses (loss function). In other words, optimizers are algorithms or techniques that are employed to modify the attributes of ANN like weights and Learning Rates (LR) to diminish the losses (loss function). Therefore, optimizers play a crucial role in determining fitting weights and learning rates to lessen the losses as much as possible and provide the most accurate results.

There is a load of robust optimizers that could be named like Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum, Nesterov Ac- celerated Gradient, Adagrad, AdaDelta, and especially Adam. Adam optimizer would be the technique that the thesis would implement in most algorithms.

First published in the year 2014 at a very prestigious conference for deep learning practitioners, Adam (Adaptive Moment Estimation) is an extensive version of SGD and could be utilized in place of original SGD to update ANN weights more productively.

M(t) and V(t) below are the values of the first moment, which is the Mean and The Second Moment, respectively.

mt= mt

1−β1t (2.25)

vt= vt

1−β2t (2.26)

The mean of M(t) and V(t) is taken so that E[m(t)] could be equal to E[g(t)], where E[f(x)] is a desirable value off(x). Below is an update of parameters:

θt+1 =θt− η

√v˙t+m˙t (2.27)

• Data Augmentation

Usually, there are cases when we do not have much data for models to be trained.

The problem could be solved by applying data augmentation, which is a technique through which the size of the data could be increased for the training of the model without adding the new data. Techniques like padding, cropping, rotating, and flip- ping are the most universal methods that are utilized over the given images to expand the data quantity.

• Model Checkpoint

Assuming that during the training neural network period, a power outage happens suddenly, which leads to disappearances of information like training weights, the trained model, there is nothing to do to save such an unwanted situation like that retraining the whole model from the beginning. Luckily, MC was born to combat said issues. Moreover, it could retrain the model from when the computer lost electricity or save the best-performing weights for the model during training. ModelCheckpoint callback class has the following arguments:

§ File-path: The path that you want your file to be saved at.

§ Monitor:The metrics that you want to base on to make decisions.

§ Verbosity: 0 for debug mode and 1 for information.

§ Save-weights-only:If set to True, then only model weights will be saved but the full model including the model architecture, weights, loss function, and optimizer.

§ Save-best-only: If set to True, then only the best model will be saved relied on the quantity you are monitoring. If you are monitoring accuracy and save-best-only is set to True, then the model will be saved every time you get higher accuracy than the previous accuracy.

§ Mode: It has three options- auto, min, or max. If you are monitoring accuracy, then set it to the max, and if you are monitoring loss, then set it to min. If you set the auto mode, then the direction is inferred automatically based on the quantity being monitored.

§ Save-freq or period: set it equal to the number of epochs or a random number, which is a cycle number that after that number of epochs, the saving model information job would be carried out again if the monitoring value is showing no evolution.

• Early Stopping

Being used widely to tackle the overfitting issue as a regularization technique, ES is one of the most extremely different ways to regularize the machine learning model.

The way it works is to cease training as soon as the validation error reaches the bottom.

Apparently, as the epochs go by, the more the algorithm learns, the more its error on the training set falls, and so does the error on the validation database. Neverthe- less, instead of keep falling down in the figure for errors, it turns back on and starts rising, which is obviously overfitting. With the assistance of early stopping, the whole training would be ceased before overfitting comes out.

• Reduce Learning Rate On Plateau

It is knowledgeably asserted that the Learning Rate (LR) is an extremely vital parameter in determining the model’s test accuracy. Adam updates every parameter during the training stage with an individual learning rate, which means every ANN parameter has a specific LR associated. However, the single LR for a parameter is calculated by using lambda (the initial learning rate) as the upper limit, which allows LRs could fluctuate from 0 (no update) to lambda (maximum update). Therefore, LR that is set by the optimizer is not always performing well, which is a very time that the model needs the interruption of the human mind, us! If you want every update step do not exceed the lambda value, you could cease it happens by using exponential decay like ReduceLROnPlateau class.