Convolutional Neural Network (CNN)

Một phần của tài liệu Authentication via deep learning facial recognition with and without mask and timekeeping implementation at working spaces (Trang 21 - 28)

In deep learning, the Convolutional Neural Network (CNN, or ConvNet) is a class of the Artificial Neural Network (ANN), most commonly applied to analyze visual images [7]. CNNs are well known for the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation- equivariant responses known as feature maps. Counter-intuitively, most CNNs are not invariant to translation, due to the downsampling operation that they apply to the input. CNN has several applications in image and video recognition, naming recommender systems, image classification, image segmentation, medical image

analysis, natural language processing, brain–computer interfaces, and financial time series.

Thanks to the nature of CNN for latent feature extraction, CNN has been widely used for image classification problems. Take the below one as an instance on how a human being is capable of using the neurons in the brain to think, remember and process the image and make a comparison with an artificial neural network.

Figure 2.2: Human brain processes the image and recognizes

Figure 2.23 illustrates the way a machine can be able to reproduce the process native to the human brain to identify the correct face among the others. The input parameters consist of multiple components on the face such as eyes, nose, mouth, etc. However, these would not be playing the most important role on deciding the truthy of the face but the hidden attributes ranging from sinus to jaw or hairstyles will be the one in charge. This takes a step further to enhance the ability of machines onwards since

3 https://towardsdatascience.com/step-by-step-guide-to-building-your-own-neural-network-from-scratch- df64b1c5ab6e

human beings over time may suffer from memory loss or keeping their focus only on the main characteristics of the face. Therefore, these input parameters as stated above combined with hidden attributes extracted from the machine learning model would make a strong step up towards accuracy, authority and security, thus encouraging the effectiveness of face recognition in this thesis.

Figure 2.3: The process of extracting hidden attributes from of the face Figure 2.3 shows the basic learning process of an artificial neural network4. From the input values, the neural network performs processing operations to extract the latent features of the data. These hidden attributes can be taken to become the next input parameters in subsequent layers until the final result shows up. There will also be a shared weight available (described in details in the section 2.2.1) for the Convolutional Neural Network in which reducing the training time for the model instead of involving a huge number of parameters, thus bringing a huge advantage over fully-connected layers.

4 https://www.javatpoint.com/artificial-neural-network

In a normal artificial neural network, a hidden layer is formed by different neural nodes in series and through information processing will create neural nodes in the next layer. This process comes with the aid of fully-connected layers where each node of the current layer connects with each node of the next layer. However, the problem occurs when the number of input parameters grows rapidly over time, which in turn puts the burden on complexity and performance. For instance, when inputting the image of size 64x64x3 to the network, this requires all pixels to be converted into nodes, meaning 64 x 64 x 3 = 12288 is the number of nodes for the input at the moment. In addition, by multiplying the number of nodes to the weight (take 1000 for the weight as an example) for the first hidden layer, it brings the figure for nodes well surpasses 12 million. That is such a huge number considering the image as stating is captured in a low resolution compared to the standard nowadays, let alone this one only has a single hidden layer. With that being said, specific mathematical calculations need to get involved in mitigating the cumbersome. Outstandings are the works of convolution and pooling.

2.2.1. Convolution

The Convolutional Neural Network (illustrated in Figure 2.5) starts with an idea of performing convolution calculation as depicted in the figure below:

Figure 2.4: The sample calculation of convolution

Take the input image as a matrix with size 7x7 consisting of the number either 0 or 1 and the filter matrix of 3x3. The formula to perform the convolution calculation is as follows:

𝑆𝑖𝑗 = (𝐼 ∗ 𝐾)𝑖𝑗 = ∑  

𝑚

∑  

𝑛

𝐼(𝑚, 𝑛)𝐾(𝑖 − 𝑚, 𝑗 − 𝑛) (2.1)

where 𝑖, 𝑗 address the position of the result element, 𝑚 and 𝑛 is the size of the input matrix.

Figure 2.5: The sample Convolutional Neural Network for image classification One of the most recognizable features of the Convolutional Neural Network is the use of shared weights. Shared weights is basically defined as the same weight is used for each kernel and neurons in the first hidden layer will precisely discover the similarities between different regions or in other words, the latent features of the input (as shown in Figure 2.6). This action is handled with the purpose of reducing the number of input parameters while encouraging the finding of different main features of the image. For instance, a matrix image with size 7x7 in Figure 2.4 with 2 kernels of 3x3 then each feature map would need 3 x 3 = 9 weights and 25 neurons in the second layer. Next, 2 feature maps would bring the total of parameters to 2 x 9 = 18 parameters which is significantly less than the same figure for the fully-connected layer (assume that the number of neurons in the first layer is 10 then the number of input parameters is 7 x 7 x 10 = 490 >> 18). This comparison indicates that the convolutional layer needs just a fraction of parameters but retrieves the latent features as many as the fully-connected layer.

Figure 2.6: A depiction of shared weights in Convolutional Neural Network

2.2.2. Pooling

As the convolutional layer outputs the matrix, the network would need to reduce the dimensions as long as noise is concerned then the pooling was introduced to perform the matter. There are several poolings such as average, max or sum but the max pooling has been proved to be efficient in terms of noise reduction. Take the following example (Figure 2.7) as steps on how to reproduce the max pooling by extracting the maximum number of each colored region:

Figure 2.7: A sample calculation of max pooling

2.2.3. Cross-Entropy Loss

The Convolutional Neural Network outputs the classification given the input then the loss function is extremely important in assessing the closeness between an observation (𝑦, which is 0 or 1) and the outcome of the network itself (𝑦̂ = 𝜎(𝐰 ⋅ 𝐱 + 𝑏)), the denoted ℒ is as follows:

ℒ(𝑦̂, 𝑦) = How much 𝑦̂ differs from the true 𝑦 (2.2)

The most commonly used loss function that can be taken into consideration in classification problem is the Least Squared Error (LSE) [8] with the equation as below:

ℒ(𝑦̂, 𝑦) = 1

2(𝑦̂ − 𝑦)2 (2.3)

However, since this equation produces the loss of 0 on a regular basis for the label of 𝑦̂ to 𝑦 (0 or 1 in the groundtruth) then this loss function needs to be modified to accommodate the measurement of differentiation between many distributions (in probability). In other words, the loss function should be able to correct the likelihood of the class label and this is called conditional maximum likelihood estimation or maximize the log probability of the true 𝑦 labels in the training data given the observations 𝑥. Since there are only two discrete outcomes (0 or 1), following the Bernoulli distribution [9], the probability 𝑝(𝑦 | 𝑥) can be expressed for one observation as follows:

𝑝(𝑦 ∣ 𝑥) = 𝑦̂𝑦(1 − 𝑦̂)1−𝑦 (2.4)

Taking a log for both sides of equation (4) would bring the log of the probability:

log 𝑝(𝑦 ∣ 𝑥) = log [𝑦̂𝑦(1 − 𝑦̂)1−𝑦]

= 𝑦log 𝑦̂ + (1 − 𝑦)log (1 − 𝑦̂) (2.5)

Flipping the sign on this log likelihood would give us the formula called the Binary Cross-Entropy loss which should be minimized to do the back propagation with parameters 𝑤, 𝑏 and receive a better outcome:

𝐿CE(𝑦̂, 𝑦) = − log 𝑝( 𝑦 ∣∣ 𝑥 ) = −[𝑦 log 𝑦̂ + (1 − 𝑦) log(1 − 𝑦̂)] (2.6)

Một phần của tài liệu Authentication via deep learning facial recognition with and without mask and timekeeping implementation at working spaces (Trang 21 - 28)

Tải bản đầy đủ (PDF)

(80 trang)