Authentication via deep learning facial recognition with and without mask and timekeeping implementation at working spaces

INTRODUCTION

Introduction

In the wake of the Covid-19 pandemic, businesses are adapting to new social distancing measures, including mask-wearing and contactless authentication The resurgence of Covid-19 has highlighted the inefficiencies of manual timekeeping methods, leading to delays in reporting This has prompted a shift towards Artificial Intelligence (AI) and Machine Learning to address these challenges A promising solution lies in the use of face recognition technology, specifically the Siamese Neural Network, integrated with time-tracking systems However, the effectiveness of this biometric method is hindered by mask-wearing, despite recent studies showing potential in masked face recognition with DeepMaskNet Further experimentation is necessary to evaluate the true capabilities of the Siamese Neural Network in recognizing masked faces.

Problem statement

The face recognition model processes images of human faces, whether they are wearing masks or not, captured in formats such as PNG or JPG.

1 https://hbr.org/2020/09/adapt-your-business-to-the-new-reality

Workforce management software integrates various features such as scheduling, payroll, and HR functions into a single platform, enhancing operational efficiency It offers tools for labor forecasting, task management, and real-time communications, ensuring that businesses can optimize staffing levels based on demand Additionally, it streamlines onboarding processes and automates payroll, making it easier to manage time and attendance while ensuring compliance with labor laws This comprehensive approach not only improves productivity but also helps in retaining talent by providing a transparent and fair work environment.

• The face recognition result – assert True or False and if True states the name of the person

• Potentially with or without mask result

The face recognition model is presented in details as below:

Figure 1.1: The face recognition and time-keeping application pipeline architecture

To ensure high-quality face recognition, it is essential that the input image captures the entire face, whether masked or unmasked Extensive research has been conducted on face recognition models, leading to numerous proposals aimed at enhancing accuracy, even in challenging environments Typically, the output of these models includes a critical assertion of True or False regarding identity verification Additionally, some approaches utilize face matching percentages to measure, anticipate, and predict human facial recognition in future applications.

Take the below motivating example starting with a human face image captured from the live camera and an output for each step:

Table 1.1: The output use cases of the face recognition model

Input Output at step 1 Output at step 2 Output at step 3

Mr John McCarthy: True (with mask)

10th, 2023 07:49:05AM Human face image

10th, 2023 09:16:23AM Human face image

Access Denied Timein: No record found

Objectives and missions

This thesis opted for the comprehension of face recognition model based on machine learning and deep learning with details as follows:

• Understand the fundamental principles of machine learning and deep learning

• Identify the problems with face recognition (especially with masks) and the ways to get it resolved based on recent face recognition articles

• Analyze all the procedures, assess the feasibility of each solution and draw a conclusion on the pros and cons of the proposed solution

• Research on the popular and appropriate human face datasets (with and without mask) and collect in advance for later usage

• Put the face recognition model in real test, understand and suggest an enhancement for better accuracy and performance

Maximize your understanding of algorithmic logic, deep learning, and face measurement and recognition strategies presented in the thesis, while also exploring future research opportunities and the potential for product deployment on a large scale.

The thesis involves several key tasks: first, it requires the collection of recent papers on face recognition, focusing on challenges related to recognizing faces with masks Next, it entails researching both historical and current obstacles in this area The project will include experiments with various face recognition methods, particularly those effective for masked individuals, to identify the most suitable approach for recognizing faces both with and without masks, considering feasibility and scope Additionally, it aims to define the model's expected outputs to assist in gathering relevant datasets The model will be established using existing frameworks, libraries, and tools, followed by rigorous testing and validation of results Finally, the thesis will conclude with insights and recommendations for future research directions.

Scope of the thesis

There should be a long line of face recognition researches and applications hence the scope of the thesis has been inquired as below:

• Build up the face recognition model using the Convolutional Neural Network for recognizing the human face with and without mask (stating the name of the user)

• The used datasets have to be popular for evaluation and include a variety of facial components

• Machine learning technology: representation using Euclidean distance, evaluation using Precision, Recall, F1, optimizing using Adam, Stochastic

Gradient Descent and relevant parameters configuration

• The basic Python UI web application for user experience using Flask and Streamlit.

Thesis contributions

In this thesis, the author proposes a solution with the machine learning model so that:

• It can be applied for face recognition with both masked and unmasked human face datasets

• The model acts as a baseline to improve security for face recognition with ensemble learning

• The quality of the model behind a ubiquitous Python web application is proved to be advanced and fulfill the needs of enterprises.

Thesis structure

The structure of the thesis consists of 5 chapters:

Chapter 1 provides an overview of face recognition technology and its current applications across various sectors of information technology It highlights the pressing issue of time-keeping challenges faced by firms today, which compels developers to engage in multiple processes to leverage advanced neural networks and deep learning techniques Additionally, this chapter will outline the plan, orientation, targets, and milestones that are essential for understanding the direction of this technology.

Chapter 2: Background Knowledge provides essential insights into neural networks and deep learning techniques, alongside an overview of Python web-based development and database management systems, all of which are crucial for the successful execution of the project.

In Chapter 3, we will explore related works that inform the implementation of face recognition for widespread use, focusing primarily on the contributions of esteemed authors and publishers This section will adhere to strict citation methods to ensure credibility and accuracy in presenting these influential approaches and models.

Chapter 4 outlines the proposed model and its implementation, presenting the most suitable models and methods supported by clear justifications This chapter emphasizes the potential for breakthrough findings and improvements, highlighting the adaptability of the proposed approach to accommodate future advancements in the field.

The effectiveness of a model hinges on the quality and quantity of the datasets used for training, necessitating a comprehensive guide on dataset formation and utilization This chapter emphasizes the implementation phase as crucial, providing an in-depth exploration of the existing model to enhance the author's understanding of the project's objectives It will cover the coding and compilation processes essential for model construction The subsequent section will present the experimental results, comparing the applied model with other relevant models, while clearly defining loss functions and optimization strategies to pave the way for future research opportunities.

In the conclusion of Chapter 5, the author emphasizes the importance of reflecting on all phases of the project, highlighting both the merits and demerits observed during implementation Additionally, it is crucial to provide recommendations for future improvements to enhance the project's effectiveness.

BACKGROUND KNOWLEDGE

Face recognition

A facial recognition system is a technology that identifies and verifies individuals by comparing their facial features from digital images or video frames against a database of known faces This system is commonly used for user authentication in ID verification services, utilizing precise measurements of facial characteristics to ensure accurate identification.

Facial recognition technology identifies human faces in two-dimensional images by first isolating the face from a noisy background The face is then cropped to a specific size and converted to grayscale, facilitating accurate landmark localization This process is crucial for feature extraction, where multiple neural networks and filters work together to create a comprehensive representation of the face for comparison against existing database entries.

Traditional mathematical computations have been used to assess facial features like the eyes, nose, cheeks, and chin; however, these methods struggle with accurately identifying complete faces due to limitations in matching capabilities and vulnerability to variations caused by overfitting in pre-trained datasets This challenge has prompted data scientists to explore new machine learning techniques known as deep-learning architectures, which will be discussed in detail in the following section.

Figure 2.1: The flowchart of Face Recognition

Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNNs), a type of Artificial Neural Network (ANN), are primarily used for analyzing visual images They feature a shared-weight architecture of convolutional kernels or filters that slide across input data, generating translation-equivariant responses known as feature maps Interestingly, most CNNs are not translation-invariant due to the downsampling operations applied during processing CNNs have diverse applications, including image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain-computer interfaces, and financial time series analysis.

Convolutional Neural Networks (CNNs) are highly effective for image classification tasks due to their ability to extract latent features This process mirrors how the human brain utilizes neurons to think, remember, and process visual information, drawing a comparison between biological and artificial neural networks.

Figure 2.2: Human brain processes the image and recognizes

Figure 2.2 illustrates how machines can mimic the human brain's process of identifying faces While input parameters like eyes, nose, and mouth are essential, it is the hidden attributes—such as the shape of the jaw, sinus structure, and hairstyles—that play a crucial role in accurately recognizing faces This advancement significantly enhances the capabilities of machines in facial recognition technology.

Human beings may experience memory loss and focus primarily on key facial features By integrating these input parameters with hidden attributes derived from machine learning models, we can significantly enhance accuracy, authority, and security This advancement will ultimately bolster the effectiveness of face recognition technologies discussed in this thesis.

Figure 2.3 illustrates the fundamental learning process of an artificial neural network, where input values undergo processing to uncover latent features within the data These hidden attributes serve as input parameters for subsequent layers, ultimately leading to the final output Additionally, a shared weight mechanism is utilized, as detailed in section 2.2.1.

Convolutional Neural Network in which reducing the training time for the model instead of involving a huge number of parameters, thus bringing a huge advantage over fully-connected layers

4 https://www.javatpoint.com/artificial-neural-network

In a typical artificial neural network, hidden layers consist of interconnected neural nodes that process information and generate nodes for subsequent layers This structure relies on fully-connected layers, where each node in one layer is linked to every node in the next However, as the number of input parameters increases significantly, it can lead to challenges in complexity and performance For example, an image sized 64x64x3 results in 12,288 input nodes, and when multiplied by weights—such as 1,000 for the first hidden layer—the total number of nodes can exceed 12 million This is particularly concerning given that the image is captured at a low resolution compared to modern standards and is processed through only a single hidden layer To address these issues, advanced mathematical techniques, particularly convolution and pooling, are essential for reducing complexity.

The Convolutional Neural Network (illustrated in Figure 2.5 ) starts with an idea of performing convolution calculation as depicted in the figure below:

Figure 2.4: The sample calculation of convolution

Take the input image as a matrix with size 7x7 consisting of the number either 0 or

1 and the filter matrix of 3x3 The formula to perform the convolution calculation is as follows:

𝐼(𝑚, 𝑛)𝐾(𝑖 − 𝑚, 𝑗 − 𝑛) (2.1) where 𝑖, 𝑗 address the position of the result element, 𝑚 and 𝑛 is the size of the input matrix.

Figure 2.5: The sample Convolutional Neural Network for image classification

A key characteristic of Convolutional Neural Networks (CNNs) is the use of shared weights, where the same weight is applied across each kernel This approach enables neurons in the first hidden layer to effectively identify similarities and latent features within different regions of the input By doing so, CNNs significantly reduce the number of input parameters while still capturing essential features of an image For example, a 7x7 matrix image processed with two 3x3 kernels results in only 18 parameters, compared to 490 parameters required by a fully-connected layer with 10 neurons This illustrates that convolutional layers achieve similar feature extraction capabilities as fully-connected layers while utilizing a fraction of the parameters, enhancing efficiency in image processing.

Figure 2.6: A depiction of shared weights in Convolutional Neural Network

In convolutional neural networks, the pooling layer is essential for reducing dimensions and minimizing noise in the output matrix Various pooling methods exist, including average, max, and sum pooling, with max pooling being particularly effective for noise reduction For example, max pooling extracts the highest value from each designated region, as illustrated in Figure 2.7.

Figure 2.7: A sample calculation of max pooling

The Convolutional Neural Network (CNN) generates classifications based on input data, making the loss function crucial for evaluating the accuracy of predictions It measures the difference between the actual observation (𝑦, which can be 0 or 1) and the network's output (𝑦̂ = 𝜎(𝐰 ⋅)).

𝐱 + 𝑏)), the denoted ℒ is as follows:

ℒ(𝑦̂, 𝑦) = How much 𝑦̂ differs from the true 𝑦 (2.2)

The most commonly used loss function that can be taken into consideration in classification problem is the Least Squared Error (LSE) [8] with the equation as below:

The standard loss function frequently results in a loss of zero when comparing predicted labels 𝑦̂ to actual labels 𝑦 (0 or 1), necessitating a modification to effectively measure the differentiation between various probability distributions This adjustment aims to enhance the likelihood of class labels through conditional maximum likelihood estimation, which seeks to maximize the log probability of the true 𝑦 labels in the training dataset based on the observations 𝑥 Given the binary nature of outcomes (0 or 1) and adhering to the Bernoulli distribution, the probability 𝑝(𝑦 | 𝑥) can be articulated for a single observation as follows:

Taking a log for both sides of equation (4) would bring the log of the probability: log 𝑝(𝑦 ∣ 𝑥) = log [𝑦̂ 𝑦 (1 − 𝑦̂) 1−𝑦 ]

To optimize the performance of a model during backpropagation, the Binary Cross-Entropy loss formula is derived by flipping the sign on the log likelihood This loss function, which depends on parameters 𝑤 and 𝑏, must be minimized to achieve improved results.

Siamese Neural Network (SNN)

2.3.1 Overall of Siamese Neural Network

The Siamese Neural Network (SNN) is a unique neural network architecture that consists of two or more identical sub-networks, each sharing the same configuration, parameters, and weights Any updates made to the parameters of one sub-network are instantly reflected across all sub-networks, ensuring consistency and coherence in the learning process.

SNN, or Similarity Neural Networks, are primarily utilized to assess the similarity between input data by analyzing their feature vectors Key applications of SNN include face recognition, signature verification, and anti-spoofing measures.

Neural networks typically utilize hidden layers to predict problem classes, but adding or removing layers requires retraining the model on the entire dataset, including both new and existing data Furthermore, deep neural networks often necessitate a substantial amount of data for accurate predictions In contrast, Spiking Neural Networks (SNN) identify similarities in input data streams, enabling the classification of new data classes without the need to retrain the entire neural network.

Reproduce steps of SNN (as shown in Figure 2.8):

• Select a pair of images (or anything that needs to be classified) from the dataset

• Bring each image through each sub-network of the SNN for processing The output of the sub-networks is an embedding vector

• Calculate the Euclidean distance between those two Embedding vectors

The Sigmoid function can be utilized to calculate a similarity score between two Embedding vectors, producing a value within the range of [0,1] A score nearing 1 indicates a high level of similarity between the vectors, while a score closer to 0 signifies greater dissimilarity.

Figure 2.8: The sample Siamese Neural Network for face recognition

SNN operates by learning from pairs of input data, necessitating the consideration of two additional loss functions alongside Binary Cross-Entropy loss, as these alternatives are deemed more suitable for specific use cases.

The idea of Triplet Loss is to use a set of three input data including: Anchor (𝐴),

Positive (𝑃) and Negative (𝑁) where the distance from 𝐴 to 𝑃 is minimized while the distance from 𝐴 to 𝑁 is maximized during training

The loss function ℒ(𝐴, 𝑃, 𝑁) is defined as the maximum of the difference between the squared distances of the anchor input 𝐴 and positive input 𝑃, and the squared distances of 𝐴 and negative input 𝑁, adjusted by a margin α, or zero In this context, 𝐴 serves as the anchor, 𝑃 is a positive input from the same class as 𝐴, and 𝑁 is a negative input from a different class, while f represents an embedding vector.

Contrastive Loss is a technique that, like Triplet Loss, focuses on comparing input data, but it operates on pairs of data rather than triplets When the paired inputs are of the same type, the method aims to minimize the distance between their feature vectors, while it seeks to maximize the distance when the inputs are of different types during the training process.

2𝑚𝑎𝑥(0, 𝑚 − 𝐷 𝑤 ) 2 (2.8) where 𝐷 𝑤 is the Euclidean distance between input data and 𝑚 is the margin

When tackling a specific problem with neural networks, it's essential to evaluate the advantages and disadvantages This article presents the preliminary findings of Spiking Neural Networks (SNN) to guide effective decision-making.

The training of Spiking Neural Networks (SNN) necessitates a smaller dataset than traditional neural networks, thanks to innovative approaches like One-Shot Learning and Few-Shot Learning.

 Data-imbalanced issues are hardly a concern

 SNN has a strong sense of uniquity, hence a hybrid model will be possible when collating SNN with other classification models

 Learn from semantic similarity since SNN focuses on learning features in deeper layers, where similar features are placed close to each other

 Training takes longer since it marks the similarity for each pair

 The probability for each distribution is unknown

The primary concern for the SNN project has been its performance; however, as the project progresses and accuracy is ensured through specific evaluation metrics, it will be crucial to focus on optimization to achieve the best possible solution.

To evaluate the Siamese Neural Network (SNN), which is comparing the similarity between each pair of input data, ones would either use:

• Extrinsic evaluation [13]: Compare the output of SNN model with the other models in terms of the accuracy e.g matched/not matched face

• Intrinsic evaluation: Measure and assess the outcome of SNN model in each training epoch via validation checkpoints and verification threshold (the threshold is a decisive factor)

However, within these types of evaluation, it is still ambiguous to select whichever method applicable for this model or other models This is the main reason why

Precision and Recall should be considered

Precision, also known as positive predictive value, measures the proportion of relevant instances among the instances retrieved, while Recall assesses the fraction of relevant instances that were successfully retrieved Both metrics are fundamentally rooted in the concept of relevance In the context of classification tasks, key terms include true positives (TP), true negatives (TN), and false positives.

The confusion matrix, illustrated in Figure 2.9, compares a classifier's results against trusted external judgments, highlighting the concepts of false positives (FP) and false negatives (FN) In this context, "positive" and "negative" refer to the predictions made by the classifier, while "true" and "false" indicate whether these predictions align with actual observations The key metrics for evaluating classifier performance—Precision, Recall, and Accuracy—are derived from these concepts.

Figure 2.9: A confusion matrix and its actual denotation

Ensemble learning

Machine learning has become essential in addressing various challenges, including image classification, face recognition, and natural language processing As the demand for advanced models grows, the need for optimal performance in producing accurate outputs from datasets becomes critical However, the practicality of individual models, each with its own strengths and weaknesses, raises concerns about the efficiency of selecting the most suitable model for specific problems In response, ensemble learning emerges as an appealing solution, combining multiple models to generate a final output based on the consensus of each model's predictions.

Ensemble learning demonstrates its effectiveness in various applications, as highlighted by de Condorcet's principle, which states that if each voter has a probability greater than 0.5 of being correct and the voters are independent, increasing the number of voters enhances the likelihood of the majority vote being accurate, ultimately approaching certainty.

Ensemble learning, while not originally a concept in machine learning, offers an intuitive approach to tackle complex problems by combining multiple models Its success can be attributed to factors such as statistical advantages, computational efficiency, and enhanced representation learning, as well as the principles of bias-variance decomposition and strength-correlation.

Ensemble learning continues to generate concerns and speculation, particularly as machine learning and deep learning encounter issues such as vanishing and exploding gradients These challenges may enhance the reliability of ensemble learning, as multiple models collaborate to support each other during instances of vanishing gradients Numerous surveys in the literature have focused on ensemble learning, particularly in classification [19], regression [20], and clustering [21] problems A thorough review of both classification and regression models has been conducted, alongside a comprehensive examination of ensemble methods and their associated challenges [22].

Ensemble learning, as depicted in Figure 2.10, involves the collaboration of multiple models (1, 2 to n) to analyze a single input This approach aims to capture all aspects of the datasets, ultimately delivering the most relevant output tailored to address specific problem requirements.

Recent research has focused on the effectiveness of ensemble learning, which can be categorized into three main methods: bagging, boosting, and stacking.

Bagging, or bootstrap aggregating, is a key strategy for enhancing classification performance It begins with a single dataset input and generates multiple sample datasets, known as bags, which can be replaced as needed Each of these bags is processed through corresponding models, ultimately leading to a consolidated set of predictions.

Ensemble prediction outperforms single predictions across entire datasets, leading to improved results The analysis of integrating multiple predictions will be discussed in relation to the voting capabilities of models later in this thesis.

Since its introduction in 1996, bagging has gained popularity for various applications Notably, Kim applied bagging alongside support vector machines (SVM), training each dataset bag independently and determining the final decision through majority voting and least squares estimation Additionally, another study utilized bagging with decision trees, combining outputs using the Kaplan-Meier curve for enhanced performance.

In 2006, the introduction of asymmetric bagging marked a significant advancement in addressing the imbalance issues faced by support vector machines, as highlighted by theoretical and experimental analyses of online bagging and boosting By 2015, ensemble learning saw a shift towards majority voting, with notable contributions from researchers focusing on neural networks and neighborhood balanced bagging.

Table 2.1: The development of Bagging concept

[23] The idea of Bagging proposed

[30] Case study of bagging, boosting and basic ensembles

Decision trees and ensembling outputs via majority voting

[32] Study of Bayesian regularization, early stopping and Bagging

[24] Bagging with SVM’s and ensembling outputs via SVM’s, majority voting and least squares estimation

[33] Theoretical justification of Bagging, proposed subbagging and half subagging

[25] Bagging with decision trees and ensembling outputs via Kaplan–Meier curve

[27] Theoretical and experimental analysis of online bagging and boosting

SVM’s and ensembling outputs SVM’s

[34] Roughly balanced bagging on decision trees and ensembling outputs via majority voting

[29] Bagging with neural networks and ensembling outputs via majority voting

[35] Neighbourhood balanced bagging ensembling outputs via majority voting

The bagging method in ensemble learning enables parallel computation, optimizing hardware usage to save time and costs while enhancing the accuracy and quality of the ensemble output.

The second method, known as boosting, aims to enhance the learning model by constructing a robust classifier from initially weak classifiers This technique operates in a sequential manner, where each subsequent model learns from the errors of the previous one The process continues until the performance exceeds a predefined threshold or a set limit of models is reached Popular boosting methods such as AdaBoost and Gradient Boosting are widely used to improve machine learning performance AdaBoost minimizes misclassification loss through a greedy approach by weighting predictors at each iteration, while Gradient Boosting applies a similar strategy for arbitrary differential loss functions.

Table 2.2: The development of Boosting concept

[38] Boosted deep belief network (DBN) as base classifiers for facial expression recognition

[39] Decision trees as base classifiers for binary class classification problems

[40] Decision trees as base classifiers for multiclass classification problems

[41] Ensemble of CNN and boosted forest for edge detection, object proposal generation, pedestrian and face detection

[43] CNN Boosting applied to bacteria cell images and crowd counting

[44] Boosted deep independent embedding model for online scenarios

[45] Transfer learning based deep incremental boosting

[46] Boosting based CNN with incremental approach for facial action unit recognition

[47] Deep boosting for image denoising with dense connections

[47] Deep boosting for image restoration and image denoising

[48] Hierarchical boosted deep metric learning with hierarchical label embedding

Stacking in ensemble learning is an integration technique that combines the outputs of baseline models to achieve optimal predictions Originally proposed by [50], this method involves randomly splitting the input dataset into equal parts for training at different levels Each level's predictions contribute to the creation of a meta-model, which ultimately makes the final decision Notable network architectures utilizing stacking include [51], which features a stacking-based deep neural network (S-DNN) trained without backpropagation, and [52], which presents a model combining a conditionally restricted Boltzmann machine with a deep neural network, resulting in significant performance improvements with fewer training datasets.

Ensemble learning leverages a voting approach to arrive at final decisions, with the most common method being the averaging of predictions This technique effectively addresses issues related to bias and variance, enhancing overall performance while balancing the contributions of different models Predictions can be averaged using softmax or directly based on their respective probabilities or outputs.

The probability outcome \( P_{ij} \) for the i-th unit on the j-th base learner is determined by the formula \( P_{ij} = \frac{exp(O_{ij})}{\sum_{k=1}^{K} exp(O_{kj})} \), where \( O_{ij} \) represents the output of the i-th unit from the j-th base learner, and \( K \) denotes the total number of classes This equation illustrates how the outputs from multiple learners contribute to the overall probability distribution across different classes.

RELATED WORKS

Global feature support

The initial method for face recognition, known as "Global Feature Support," emphasizes analyzing the entire face rather than its individual components This approach focuses on key facial elements, such as the eyes, nose, and mouth, along with their predefined relationships, without dividing the face into separate parts Within this framework, both appearance-based and model-based feature extraction techniques are defined.

Feature extraction at this level involves transforming input data into a lower-dimensional space through statistical methods, making it effective for handling variations in facial features, occlusions, and emotional changes without focusing on the structural aspects of the face This appearance-based feature extraction can be categorized into three types: linear methods like Principal Component Analysis (PCA), non-linear methods such as kernel Principal Component Analysis, and multilinear approaches like generalized Principal Component Analysis.

Model based approach is largely considered as making the most out of global feature support since it derives features based on geometrical characteristics of the face

The model-based approach to facial recognition is less sensitive to changes in appearance over time, as it relies on the structural information of the face This method can be categorized into two types: graph-based techniques, such as Elastic Bunch Graph Matching, which represent the face using a graph of nodes and edges, and shape-based techniques, like the 3D Morphable Model, which visualize the face in a three-dimensional format.

Local feature support

Local feature support differs from global feature support by segmenting the face or its components into multiple regions This approach focuses on the size and topology of these areas while neglecting the relationships between them, utilizing both learning-based and hand-crafted methodologies.

The approach currently prevalent in the data science industry leverages the advancements in deep neural networks, particularly Convolutional Neural Networks (CNNs), which demonstrate enhanced robustness against facial variations This method requires careful initialization, training, hyperparameter tuning, and optimization, establishing it as the state-of-the-art solution Learning-based techniques in this domain can be classified into categories such as one-shot learning, exemplified by Siamese Neural Networks that predict future classes with minimal examples, dictionary learning through Kernel Extended Dictionary, and decision trees.

(Decision Pyramid) [67], regression (Logistic Regression) [68], Bayesian (Bayesian Patch Representation) [69]

This solution to face extraction derives features by processing the scales or frequency of the visual information One major example of this method is the Apriori algorithm

[70] in which it performs the elementary extraction to gain profit for the supermarket

Interestingly, this face recognition strategy requires significantly less computation complexity (no need for training) when placing among other methods.

One-shot learning

One-shot learning emerged as a solution for scenarios with limited data and high computational costs in machine learning This approach focuses on learning object categories from just a single training example, inspired by humans' remarkable ability to quickly acquire and recognize new patterns Consequently, this has motivated researchers to further explore the potential of one-shot learning.

Table 3.1: The summary of recent works relating to one-shot learning

Reference Year Task Classification method

[71] 2012 Path planning algorithms for robots

One-shot learning is currently underutilized, but ongoing research efforts are paving the way for its optimization in deep learning applications This approach holds promise for enhancing face recognition technologies, providing valuable insights for further advancements in the field.

Discussion

To make a decision on whether the Global feature support and the Local feature support orientation should be suitable for the thesis, we first look at:

- General advantages and disadvantages of the Global feature support:

 Retaining the most relevant information of the face

 Make use of geometrical characteristics, dimensionality reduction techniques

 Balance between structural and face variations

- General advantages and disadvantages of the Local feature support:

 Offer robustness against facial variations

 Include the state-of-the-art face recognition method

 Identify the relations of face components by modeling and learning (no presets required)

In selecting an effective method for time-keeping application, the author concluded that utilizing local feature support is ideal, as recognizing faces from various angles relies more on the facial components than on overall structure Since time-keeping typically requires only a single authentication attempt, employing one-shot learning through a Siamese Neural Network and Ensemble Learning presents a promising solution to address this challenge, which will be elaborated on in the following chapter of this thesis.

THE PROPOSED MODEL AND IMPLEMENTATION

Reference model

The outcome and success of the face recognition in this thesis will be based on the pre-trained Siamese Convolutional Neural Network in the paper “Siamese Neural

The article discusses "Networks for One-shot Image Recognition," focusing on a reference model built using twin neural networks Each network comprises L layers and N units, with 𝐡 1,1 denoting the input vector for the first layer of the first twin network.

In the second twin's first layer, the input vector is denoted as 𝐡 2,1 The model employs the rectified linear (ReLU) activation function for the first L − 2 layers, while the sigmoid function is utilized for the remaining layers It consists of a series of convolutional layers, each featuring filters of varying sizes and a consistent stride of 1 The convolutional filters are set as multiples of 16 to enhance performance Each convolutional layer applies the ReLU activation function to the output feature maps, which may be followed by a max-pooling layer with a designated filter size and a fixed stride of 2 Consequently, the 𝑘 th filter map in every hidden layer is structured accordingly.

𝑎 2 (𝑘) = 𝑚𝑎𝑥 − pooling(𝑚𝑎𝑥(0, 𝐖 𝑙−1,𝑙 (𝑘) ⋆ 𝐡 2,(𝑙−1) + 𝐛 𝑙 ), 2) (4.1) where 𝐖 𝑙−1,𝑙 (𝑘) is the 3-dimensional matrix representing the feature maps for the layer

The convolutional operation, denoted as 𝑙 and ⋆, processes the input through a final convolutional layer that flattens it into a single vector This is succeeded by a fully-connected layer, which ultimately computes the Euclidean distance metric between the Siamese twins The resulting value is then passed through a sigmoid function to produce the final output, as illustrated in Figure 4.1 Additionally, the prediction vector is represented as 𝑝 = 𝜎(∑ 𝑗 𝛼 𝑗 |𝐡 1,𝐿−1 (𝑗) −.

The calculation of 𝐡 2,𝐿−1 (𝑗) utilizes the sigmoid activation function, 𝜎, along with parameters 𝛼 𝑗, which are learned during model training to emphasize the significance of pairwise distances This approach establishes a metric within the learned feature space of the (L − 1) th hidden layer, effectively scoring the similarity between two feature vectors.

Figure 4.1: The complete reference model for Face Recognition

Datasets and pre-process

4.2.1 Labeled Faces in the Wild (LFW) datasets

This thesis utilizes the Labeled Faces in the Wild database, which comprises over 13,000 images of faces sourced from the web, each labeled with the corresponding individual's name This dataset is specifically designed to facilitate research in unconstrained face recognition.

The dataset includes 1680 individuals, each represented by two or more distinct facial images, with each photo focused on a single face Each pixel in the RGB color channels is encoded as a float within the range of [0, 1] The images are cropped to a size of 100 x 100 pixels, necessitating pre-processing (resizing) of the surveillance camera images before training Additionally, the datasets will be systematically organized into designated folders for anchors, positives, and negatives.

• Anchors: a small and pre-defined number of the matched person images

• Positives: a dataset of the matched person images

• Negatives: a dataset of non-matched people images

Figure 4.2: Labeled Faces in the Wild datasets

4.2.2 Masked Labeled Faces in the Wild (MLFW) datasets

Current face recognition datasets may be insufficient due to the widespread use of masks for public health safety When adapting a model for this new challenge, it's crucial to focus on the dataset and its performance Initially, I considered collecting my own masked datasets using a front-facing camera However, the Siamese Neural Network requires a twin network setup, necessitating a robust dataset for learning false mask instances Having previously tested my model against the LFW dataset, I recognized the potential to replicate this structure with a new dataset This led to the development of the Masked LFW (MLFW) dataset, which includes 12,000 images, as illustrated in Figure 4.3.

The COVID-19 pandemic has led to widespread mask usage, posing challenges for existing face recognition systems that struggle with masked faces To address this issue, researchers developed a tool to automatically generate masked faces from unmasked images, creating a new database known as Masked LFW (MLFW) based on the Cross-Age LFW (CALFW) database This tool ensures that the masks visually align well with the original faces and includes a variety of mask templates that reflect common styles encountered in daily life, resulting in diverse generation effects Additionally, the researchers designed three combinations of face pairs to simulate realistic scenarios, driven by three key motivations for establishing the MLFW benchmark.

• Establishing a relatively more difficult database to evaluate the performance of masked face verification so the effectiveness of several face verification methods can be fully justified

MLFW highlights that despite age differences, individuals with the same identity may present varying facades, while those with distinct identities can share similar appearances This observation underscores the significant intra-class variance and minimal inter-class variance within the dataset.

The MLFW protocol enables effective evaluation of masked face verification by maintaining data size and providing consistent benchmarks for identity verification, similar to CALFW This ensures reliable performance assessment in face verification tasks.

Figure 4.3: MLFW is constructed by adding mask to the images in LFW with perturbation for achieving diverse generation effect

Application architecture

The model training process begins when a company admin requests a new hire to pose for multiple photos, which are then preprocessed by resizing and organizing into folders labeled as anchors and positives These processed images form datasets that are utilized by the Siamese Neural Network for training Once the model is trained specifically for that employee, it is saved in a PostgreSQL database Importantly, each new hire is trained independently, ensuring that the Siamese Neural Network recognizes only that individual, separate from other employee models.

The verification process begins when an individual stands in front of the door and presses a button to capture their facial photo This image undergoes preprocessing, including resizing, before being analyzed by an authentication job that utilizes pre-trained models to determine the highest verification score Based on this evaluation, access is either granted or denied, and the door lock is adjusted accordingly If the individual is successfully identified, their verification time is recorded in the database for tracking purposes The door will remain open for a specified duration before automatically engaging a force lock once the time limit is reached.

Figure 4.6: The admin portal for time-keeping boards and visualizations

Administrators should have the capability to view or export tracking boards, collect photos from new hires, and initiate the training process Time-keeping records will be calculated based on the difference between check-in and check-out times within a 24-hour period and will be maintained for a predetermined number of months.

In today's application landscape, databases play a vital role in securely storing and managing records, enhancing customer privacy Enterprises depend on databases to understand data relationships and ensure authorized access to necessary information Given these factors, my application requires a Database Management System (DBMS) to effectively manage time-keeping records, with PostgreSQL emerging as a promising database solution.

PostgreSQL is a powerful, open source object-relational database system with over

35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance 5

PostgreSQL is a unique relational database that operates without a dedicated owner, allowing enterprises to fully control their Database Management System (DBMS) This autonomy enables businesses to manage database construction, continuous deployment, and permission settings effectively Additionally, essential PostgreSQL services like Superuser Access and Maintenance Windows are offered by the Global Development Group.

Community-driven development has fostered a vibrant PostgreSQL user community, leading to the creation of numerous high-quality extensions and applications that enhance the core software's functionality These community-developed tools enable organizations to effectively manage PostgreSQL servers, generate business intelligence reports, handle diverse data types, and integrate PostgreSQL with various programming languages and platforms, including Linux, Mac OS X, Windows, and Ubuntu.

5 https://www.postgresql.org/about/

6 https://www.postgresql.org/community/contributors/

PostgreSQL is a powerful and feature-rich relational database management system (RDBMS) that excels in implementing core relational features while also expanding beyond traditional boundaries Although no single database can fulfill every requirement, PostgreSQL stands out as a versatile option for managing relational data effectively.

PostgreSQL is an excellent option that is versatile enough to suit many use cases

Streamlit is a free, open-source framework that enables machine learning engineers to quickly build and share visually appealing web applications for data science Designed specifically for those familiar with artificial intelligence and big data, Streamlit simplifies the process of creating stunning applications with minimal coding, making it accessible for data scientists who may not have extensive web user interface experience.

Streamlit is the easiest way especially for people with no front-end knowledge to put their code into a web application:

• No front-end (HTML, Javascript, CSS) experience or knowledge is required

• Beautiful machine learning or data science app can be created in only a few hours or even minutes

• Compatible with the majority of Python libraries (e.g pandas, matplotlib, seaborn, plotly, Keras, PyTorch, SymPy (latex))

• Less code is needed to create amazing web apps

• Data caching simplifies and speeds up computation pipelines

• While not complicated, Streamlit does require some time to learn its own syntax

• Streamlit is not that flexible due to being solely based on Python, offers a limited set of widgets and does not integrate with Python Notebooks

• The data upload limit is only 50MB

• There is limited support for video and animation

Flask is a lightweight micro web framework developed in Python that does not include built-in tools or libraries, such as a database abstraction layer or form validation Instead, it relies on the extensive support of third-party libraries for common functionalities The vibrant Flask community contributes numerous extensions that enhance its capabilities, enabling the development of complex web user interfaces, including features like biometric authentication, form data handling, object-relational mapping, and file uploads.

• Scalable from small to big applications

• Flexible to all walks of scenarios

• Easy to negotiate between front end and back end

• Documentations are well-capture in the Flask homepage

• Not a lot of tools are offered

• Difficult to get familiar with a larger Flask application

Proposed Model

Face recognition technology has evolved beyond simple biometric authentication to encompass privacy protection and adaptability in various contexts Researchers have developed numerous architectures and pre-trained models, such as VGG-Face, Google FaceNet, Facebook DeepFace, and ArcFace, to meet increasing consumer demands The Covid-19 pandemic has further highlighted the need for adaptable face recognition systems, leading to advancements like DeepMaskNet and MobileNet, which deliver impressive results even with face coverings Apple's introduction of mask-on capabilities for Face ID underscores the significance of occlusion in contemporary authentication methods However, challenges persist, particularly in training models on both masked and unmasked datasets, as well as the impracticality of requiring users to confirm their mask status for verification.

Ensemble learning allows for independent training of models for mask and no-mask recognition, ultimately combining their outputs for a final decision The choice lies between using multiple distinct models for enhanced accuracy or a single model trained on diverse datasets Among various face recognition approaches, the Siamese Neural Network emerges as a strong candidate due to its simplicity and one-shot learning capability, which is ideal for scenarios where only a few user photos are available for training Additionally, the application of Euclidean distance is effective in accurately capturing the essential features of partially masked faces.

As everything is all set, the application should be organized in such a way that all the necessary information regarding timekeeping in PostgreSQL appears on the UI and

I don't know!

Experiments were initially conducted to identify the optimal parameters that would enhance the performance of the proposed model, with some parameters also provided alongside the reference model in the original paper.

Table 4.1: The originally given parameters

Original parameter Original value epoch 200 batch size 128 image size 35x35 verification 400 (images)

To achieve optimal performance for the reference model, the number of epochs was set to 200 with a batch size of 128; however, due to the complexity of the datasets where facial features significantly impact model success compared to MNIST datasets, the batch size and epochs were adjusted to 16 and 100, respectively, to manage training time without early stopping The image size was increased to 100x100 pixels, enhancing the processing of human faces compared to smaller character images While the reference paper utilized 400 images for verification, my model selects only 10 images to accommodate time constraints in real-world face recognition The training process begins with 300 selected images to optimize the model efficiently while adhering to gradient descent principles The Adam optimizer was chosen with a learning rate of 1e-4, and additional parameters include a detection threshold of 0.9, with input images strictly cropped to 100x100 pixels for improved analysis These configurations will be applied to both the LFW and MLFW datasets.

Parameter Value trainable params 38.964.545 batch size 10 epoch 30 learning rate 1e-4 optimizer Adam detection threshold 0.9 image size 100x100 loss function Binary Cross Entropy loss

This section presents a performance comparison of various baseline models in face recognition, utilizing the same LFW and MLFW datasets for training, validation, and testing to ensure a fair evaluation The selected methods include FaceNet, the advanced Pairwise Differential Siamese Network (PDSN), and the proposed Siamese Neural Network model The precision achieved by each model post-training serves as the key metric for comparison A summary of the training, validation, and testing image sets utilized across these models is provided in Table 4.3.

Table 4.3: Overview of training, validation and testing image set

Siamese Neural Network + Ensemble Learning (our model)

Table 4.4 presents a comparison of model performance, focusing on post-training precision evaluated against 40 images from the LFW and MLFW testing sets Notably, the Siamese Neural Network experiment differs from others by utilizing four distinct models, each representing a unique aspect of the analysis.

The use of ensemble learning in conjunction with a Siamese Neural Network, primarily designed for non-masked face recognition, achieves a precision of 70% for both masked and unmasked face recognition In comparison, FaceNet, based on Inception-Resnet V1, attains a precision of 77.5% when trained with MLFW datasets Notably, the state-of-the-art PDSN reaches an impressive 97.5% precision by utilizing pairs of images from both LFW and MLFW datasets, although this high accuracy comes at the cost of increased complexity and computational resources, as detailed in Table 4.5.

Table 4.4: Summary of performance outcome on different face recognition baselines “#Models” is the number of models used in the method for evaluation

In machine learning, computational efficiency plays a crucial role in evaluating performance, encompassing both the cost and complexity of models during training and testing Our primary objective is to achieve high prediction accuracy while minimizing computational costs, which is vital for maintaining privacy and security in office environments This study focuses on optimizing models for small office capacities, highlighting the cost-effectiveness and accuracy of using Convolutional Neural Networks (CNN) as the backbone of a Siamese Neural Network, particularly when trained with masked-face datasets The results of this experiment will demonstrate the advantages of this approach over other advanced models like Inception and ResNet.

# Methods Backbones #Models Datasets Training

The MLFW 600 44 20 0.700 model demonstrates the effectiveness of a single model trained specifically for a small group of employees Table 4.5 presents the model-training time per epoch and testing time per input image for each of the 20 classes analyzed, with the fastest results highlighted in bold and the second-best outcomes underlined.

Table 4.5: Comparison of model-training and model-testing time in seconds of each epoch for different face recognition models

During the model-training phase, FaceNet required 54 seconds for the MLFW dataset, while PDSN took around 72 seconds In contrast, the Siamese Neural Network combined with Ensemble Learning trained a small group of employees in just 12 seconds per epoch, highlighting the efficiency of re-training when adding new classes In testing, FaceNet demonstrated the lowest cost with reasonable precision, despite being trained solely on masked face datasets Our model further reduced the testing time to under 1 second per image, while PDSN took over 1 second, both achieving high precision in identification.

While the performance and accuracy of the model are promising, training individual models for each employee becomes impractical as businesses expand and hire more staff Additionally, CNNs struggle with variations in human facial features due to factors like masks, glasses, and hats Future research should focus on incorporating multisource training data and exploring ensemble learning, which enhances generalization by combining multiple models Finally, it's crucial to identify strategies that reduce the time and computational costs associated with the PDSN model to improve efficiency.

Ablation study

The initial evaluation of the Siamese Neural Network for face recognition while wearing masks raised concerns about the practicality of training a single model for up to five classes To address this issue, I conducted two additional experiments to explore the effects of increasing the number of classes trained per model and utilizing different datasets The goal was to optimize the proposed model for improved precision and to enhance its capacity for accommodating more classes during individual model training.

Training and testing the Siamese Neural Network on image datasets with 10 and 15 classes revealed a notable decline in precision as the number of classes increased Specifically, the model achieved a precision of 0.8 when trained with 5 classes, but this dropped to 0.5 for 10 classes and further decreased to 0.2 for 15 classes This reduction in precision can be attributed to the architecture of the Siamese Neural Network, which relies on a CNN backbone, leading to signal attenuation through each layer during training and resulting in minimal changes to the loss over time.

Recent advancements in CNN models, such as Inception and ResNet, utilize skip connection techniques to link each layer with others, effectively preserving crucial information and gradients that might be lost through sequential processing This innovative approach highlights the effectiveness of applying Siamese Neural Networks for small groups, like departmental teams, achieving a commendable precision of 0.8, which is essential for ensuring security and safety within enterprises.

Training Set Testing Set Loss Precision

The MaskedFaceNet dataset [85] In search of relevant datasets for training the

The Siamese Neural Network has proven effective in analyzing masked human faces, particularly utilizing the MaskedFaceNet dataset This dataset is extensively applied in various tasks, including face mask detection and the prediction of mask removal.

[86] Nevertheless, this dataset is mostly opted for the face mask detection not face recognition hence MLFW was chosen to be trained instead.

Time-keeping application

Effective timekeeping is crucial for businesses aiming to achieve employee milestones, yet many still face challenges with traditional methods Accurate tracking of work hours is essential for invoicing, managing overtime, and handling days off Despite the reliance on spreadsheet applications like Microsoft Excel and Mac OSX Numbers, enterprises often struggle with these outdated systems To address these issues, a user-friendly interface for human resources administration is essential.

8 https://www.deepdetect.com/blog/15-face-masks-gan/

I have considered and decided to take advantage of the Streamlit UI library in the simple Python web application to visualize all the time-keeping records in a table

Figure 4.7: Timesheet in the application

Employee time records will be individually logged in the database, serving as the baseline for calculating actual working hours and wages Common scenarios include tracking time in and out for accurate payroll management.

The face recognition device will be strategically positioned at the entrance for convenient access and verification Upon successful identification of an individual, the current date and time will be recorded in the database Each entry will create a new record, with the time out field set to "null" initially.

To log a time out from the workplace, an employee must enter their name and press the designated button This action will capture the current date and time, updating the most recent time in record or creating a new entry with the time in marked as "null."

• Working hours: Sum of delta of time out and time in in a single date

• Exception: Alert may happen to the following scenarios for inspection:

1 Lack either time in or time out in any record

2 Time in and time out are on different dates.

Multiple models for recognizing employees

In recent years, a significant concern in machine learning has been the duration of training activities While accuracy and performance are crucial for model evaluation, there is a pressing need to minimize training time to reduce costs Specifically, in the context of face recognition with masks, relying on ensemble learning makes it nearly impossible to train a single model that effectively captures the identity of an average individual.

The proposed solution involves training a model for each new employee upon their official onboarding, allowing for accurate identification by processing all departmental models after verification This approach significantly enhances face recognition capabilities, especially in scenarios where masks are worn Notably, Apple has introduced a mask verification feature for iPhone users, but it is limited to the device's owner.

Figure 4.8: Face ID with a Mask in an iPhone

Tiêu đề	Authentication via deep learning facial recognition with and without mask and timekeeping implementation at working spaces
Tác giả	Lê Đức Huy
Người hướng dẫn	Assoc. Prof. Quản Thành Thơ, Dr. Nguyễn Tiến Thịnh
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	Master’s thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	80
Dung lượng	1,36 MB