HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION FACULTY FOR HIGH QUALITY TRAINING BUI MINH TRI - 18151041 Ho Chi Minh City, August 2022 GRADUATION PROJECT MAJOR: AUTOMATION AN
INTRODUCTION
Overview
As the economy rapidly evolves, businesses face challenges in managing human resources, with timekeeping becoming a top priority for accurate employee salary calculations Traditional timekeeping methods, including paper cards, magnetic cards, biometric fingerprints, and facial recognition, are widely used in offices, agencies, and factories due to their convenience These systems are integrated with a database that analyzes and stores all relevant information, allowing for timely salary reports to be generated and money transfer orders to be sent to banks during salary reviews.
Figure 1 1 An attendance machine using magnetic cards, and biometric fingerprints
Neglecting the importance of honesty in the aforementioned systems can lead to significant issues Outdated employee time-tracking methods merely record check-in and check-out times, allowing unmotivated individuals to exploit policy gaps and evade their responsibilities during work hours.
Designing an automatic student attendance system faces challenges similar to addressing student truancy While previous research has touched on the issue, it has not fully resolved the underlying problems and often relies on complex hardware for real-time functionality.
Aims and objectives
Research, design and construct an automatic student attendance and monitoring system using facial recognition, records check-in and check-out times
Build a database for the information interaction with the system
Design graphical user interface (GUI)
Run the entire system in real-time on the Jetson Nano board
Limitation
The system is designed on the scale of a simulated lesson of 10 students in the dataset in full light conditions with a fixed camera position
The project is a system built on face recognition technology, so it can only recognize correctly when nothing covers too many facial features such as facemasks, and sunglasses
The distinction between real and fake faces is ignored
Assume that all students are moving at an average speed.
Thesis structure
The structure of this thesis is arranged as follows:
This chapter introduced the topic, the objectives, the limitations, the related works of the research, and the layout of this thesis.
LITERATURE REVIEW
Convolutional neural network
The emergence of convolutional neural networks (CNNs) has significantly transformed the machine learning landscape, particularly in tasks such as detection, classification, and recognition CNN architecture is designed to process image inputs, enabling the encoding of specific features within the model A typical CNN consists of three main types of layers: convolutional layers, pooling layers, and fully-connected layers The first two layers focus on feature extraction, while the final layer is responsible for classification Figure 2.1 illustrates a standard convolutional neural network.
Convolution, the first layer to extract features from the input image, maintains relationships between pixels by learning image features using small squares of input data
Convolution is a process that involves two inputs: an image matrix and a filter or kernel This technique allows for various operations, including edge detection, blurring, and sharpening, by merging an image with different filters The size of the kernel, typically 3x3 or 5x5, significantly affects the receptive fields, feature extraction capabilities, convolutional speed, and weight sharing An illustration of convolution in Convolutional Neural Networks (CNN) is provided in Figure 2.2.
In a convolutional block, the stride, as illustrated in figure 2.3, is a crucial parameter of the CNN filter that influences the movement across the image An increase in stride results in a reduction of the encoded output volume, which limits the number of layers in the CNN model and hinders the construction of the desired deep networks.
Figure 2 2 Convolution in the convolutional layer
Figure 2 3 An illustration of convolution with stride equals 1 in CNN
Sometimes, the kernel does not scan through the input image owing to dimensional conflict We can add zeros to the borders of the image (which is known as padding)
Figure 2 4 An example using zero padding
The Rectified Linear Unit (ReLU) is a widely used nonlinear activation function in Convolutional Neural Networks (CNNs) Its popularity stems from its computational simplicity, which effectively mitigates the vanishing gradient problem and enhances overall performance.
5 activation functions, it is located immediately after the convolution layer In activation, ReLU will assign negative values to zero and keep non-negative values
Figure 2 5 The graph of ReLu function
Using the ReLU activation function can lead to issues, including the absence of a derivative at zero and the potential for the function to output positive infinity If weights are not initialized properly or the learning rate is excessively high, neurons may enter a "dead state," resulting in consistently negative values.
Figure 2 6 An example of using ReLu
The pooling layer minimizes the number of parameters in large images, as illustrated in Figure 2.7 Also referred to as subsampling or downsampling, space pooling decreases the size of each map while preserving essential information Various types of pooling methods are available.
In the CNN model, it is crucial to focus on two key aspects: location invariance and compositionality The algorithm's accuracy is affected when the same object is presented at varying degrees of transformation, such as translation, rotation, and scaling.
6 will be remarkably affected This layer gives invariant to translation, rotation and scaling or even can reduce input dimensionality, limit overfit, and reduce training time
Figure 2 7 An example of using max pooling
The final layer of a CNN model used for image classification is the fully connected layer, which transforms the feature matrix from the preceding layer into a vector that represents the probabilities of the objects to be predicted.
Figure 2 8 The fully connected layer in CNN
YOLOv4
You Only Look Once (YOLO) is a convolutional neural network (CNN) model designed for object detection, including the ability to detect multiple objects within a single image It stands out for its remarkable speed, achieving near real-time performance without compromising accuracy compared to leading models YOLO's primary goal is to predict labels for various objects while also accurately locating them, enabling the detection of multiple objects with different labels rather than just classifying a single label for an entire image.
In YOLO terminology, bounding boxes are frames that encircle objects, while anchor boxes serve as predefined size references for predicting these bounding boxes The feature map is an output block divided into a grid of squares, which is utilized to search for and detect features within each cell Additionally, non-max suppression is a technique used to eliminate overlapping bounding boxes, retaining only the one with the highest probability.
Understanding the YOLO output is crucial for configuring the correct parameters when training models using open-source platforms like Darknet The output is influenced by the number of classes, as indicated by equation 1, especially when users apply 3 anchors per cell, following equation 2.1: \$$\text{output} = (\text{num of classes} + 5) \times 3\$$
In the feature map, we choose three anchor boxes of varying sizes—box 1, box 2, and box 3—ensuring that their centers align with each cell Consequently, YOLO's output consists of a concatenated vector representing these three bounding boxes, with their attributes illustrated in Figure 2.9.
Figure 2 9 The output tensor of YOLO
Like SSDs, YOLOv4 makes predictions across multiple feature maps Specifically, smaller initial feature maps are effective for detecting large objects, while subsequent feature maps, which are larger in size, assist in predicting smaller objects, all while maintaining a fixed anchor box size.
Figure 2.10 illustrates that while a feature map contains three anchor boxes, the total number of anchor boxes in an image is significantly higher, contributing to the slow training speed of the YOLO model This is due to the simultaneous prediction of labels and bounding boxes Additionally, the YOLO training process demands substantial RAM, limiting the batch size to prevent out-of-memory errors.
Figure 2 10 Multi-scale feature maps for detection The output is 3 feature maps with different sizes 13x13, 26x26 and 52x52 respectively
To accurately locate objects, YOLO relies on predefined anchor boxes that surround the objects with precision Subsequently, the regression bounding box algorithm fine-tunes these anchor boxes to generate predicted bounding boxes for the objects This process is visually represented in Figure 2.11.
Figure 2 11 Anchor box in object detection
In the training image, each object is assigned an anchor box, and when multiple anchor boxes surround the object, the one with the highest Intersection over Union (IoU) with the ground truth bounding box is selected.
Each object in the training image is assigned to a cell on the feature map containing the object’s midpoint
Therefore, whenever people want to identify an object, they need to seek out 2 components associated with it (cell, anchor box)
YOLO's loss function, akin to SSD, consists of two components: L loc, which quantifies the error in bounding box localization, and L cls, which assesses the error in the probability distribution of classes.
S B S obj obj 2 2 cls ij noobj ij ij ij i i i 0 j 0 i 0 c C ˆ ˆ
Llocis the loss function of the predicted bounding box compared to the actual value
The loss function, denoted as L, quantifies the probability distribution's accuracy It consists of two components: the first component measures the loss associated with predicting the presence of objects within a cell, while the second component assesses the loss of the probability distribution when an object is indeed present in the cell.
1 i has a value of 0, 1 to determine if the cell contains an object or not Equals 1 if it contains the object and 0 if it doesn't obj
1 ij indicate whether the jth bounding box of cell i is the bounding box of the predicted object
Cijis trust score of cell i, P(contain object) * IoU (predict bounding box, ground truth bounding box) ˆCijis predicted confidence score
Cis all the classes in the dataset p (c)i is conditional class probability, whether or not cell i contains an object of class c C ˆp (c)i predicted class conditional probability
To optimize the loss function for bounding box predictions, we introduce an adjustment coefficient, \$\lambda_{coord}\$, which helps minimize the loss when a cell does not contain any objects, using the adjustment factor \$noobj\$.
We rely on a transformation between the anchor box and the cell to predict the bounding box for an object
YOLOv2 and YOLOv3 are designed to predict bounding boxes that remain close to the center position Allowing the predicted bounding box to be placed anywhere in the image, similar to region proposal networks, can lead to instability during model training.
In the context of object detection, an anchor box of size (p, p) is positioned at a cell on the feature map with its top left corner at (c, c) The model predicts offsets \(t_x\), \(t_y\) and dimensions \(t_w\), \(t_h\), where the first two values represent offsets from the cell's top left corner, and the latter two are proportional to the anchor box These parameters are crucial for determining the bounding box's center \((b_x, b_y)\) and size \((b_w, b_h)\) using the sigmoid and exponential functions, as described in equations (2.5) to (2.8): \[b_x = \sigma(t_x) + c \quad (2.5)\]\[b_y = \sigma(t_y) + c \quad (2.6)\]\[b_w = p e^{t_w} \quad (2.7)\]\[b_h = p e^{t_h} \quad (2.8)\]
Figure 2 12 Formula to estimate bounding box from anchor box The outer dashed rectangle is the anchor box of size (p , p ) w h
The adjusted coordinates, which are based on the image's width and height, consistently fall within the range of [0, 1] This ensures that when the sigmoid function is applied, the coordinates remain within these defined thresholds.
In figure 2.12, the coordinates of a bounding box are defined by the anchor box and its corresponding cell, ensuring that the bounding box prediction remains within these limits This approach enhances the stability of the training process compared to YOLO version 1.
Tracking
Object Tracking is the problem of tracking one or more moving objects over time
It is a higher-level problem than object detection when the object being processed is not simply an image but a sequence of images
Tracking not only determines the bounding boxes but also takes care of a lot of different factors or noise such as:
▪ The ID of each object needs to be kept constant across frames
▪ When the object is obscured or disappears after a few frames, the system still needs to ensure that the correct ID is recognised when the object appears
▪ Issues related to processing speed to ensure real-time and high applicability
Object Tracking divides into two main approaches:
Single Object Tracking (SOT) focuses on tracking a single object within a scene, regardless of the presence of multiple objects The identification of the tracked object occurs during the initialization phase in the first frame of the video.
▪ Multiple Object Tracking (MOT): All objects that appear are tracked over time, it can even track new objects that appear in the middle of the video
Figure 2 25 The description of Meanshift
Besides this approach, the methods of solving these problems are also divided in a variety of ways, the most popular are:
▪ Online Tracking: While processing video, Online Tracking only uses the current frame and the immediately previous frame for tracking
▪ Offline Tracking: Offline methods usually use the entire frame of the video That is also divided by:
▪ Detection-based Tracking: Focuses on the close relationship between object detection and object tracking, thereby relying on detection results to track objects across frames
▪ Detection Free Tracking: Treat video as a type of string data, from there, apply methods specific to “sequences” such as RNN, LSTM,
Meanshift is a machine learning clustering algorithm that automatically identifies clusters in data without requiring prior knowledge of the number of clusters, unlike K-Means One notable application of Meanshift is in object tracking.
Meanshift is a straightforward concept that involves a set of points, such as a pixel distribution from histogram back projection The method utilizes a small window, often circular, and aims to shift this window to the location with the highest pixel density, which corresponds to the maximum number of points.
In Figure 2.25, the blue window labeled "C1" has its center at "C1_o" and its centroid at "C1_r," which do not align To address this, we will shift the window so that the center of the new window matches the centroid of the previous one This process continues until the center and centroid of the current window coincide, albeit with some error Ultimately, this method leads us to the window with the most extensive pixel distribution, indicated in green as "C2."
To implement meanshift in OpenCV, it is essential to define the tracking target and compute its histogram for back projection onto each frame, facilitating the meanshift calculation Additionally, the initial position of the tracking window must be specified For the histogram, we focus on the Hue component, as Saturation and Value can be influenced by varying ambient light conditions.
The object is tracked using image brightness variations over spacetime at pixel levels In the algorithm, the focus will be on obtaining displacement vectors for objects across frames
The criteria of Optical flow:
Brightness consistency: Brightness around a small area is said to be nearly constant, although the location of the region may change
Spatial coherence: Neighboring points in the same scene will often belong to the same surface so there will be similar motions
Temporal persistence: Points will usually have a gradual movement
Once the specified criteria are met, the Lucas-Kanade method is employed to derive the velocity equations for specific, easily identifiable points By utilizing these equations along with various prediction techniques, the method enables effective tracking of an object throughout the video.
Object tracking performance is typically assessed using datasets such as the MOT Challenge and ImageNet VID Key metrics to consider include multiple object tracking accuracy (MOTA) and multiple object tracking precision (MOTP), which are calculated using specific equations.
FP (False Positive) is total number of occurrences of an object detected even though no object exists
FN (False Negative) is the total number of times that an existing object was not detected
ID Switches is total number of times an object was assigned a new ID during tracking video
GT is the number of ground truth d is the Euclidean pixel distance between two points (x1, y1) and (x2, y2)
Ct is total matches made between ground truth and the detection output
Evaluating deep learning models
Model evaluation is essential for developing an effective deep learning model It is important to analyze metrics to enhance accuracy and ensure the model aligns with its intended purpose Four key terms should be prioritized during this process.
▪ True positives (TP): The model predicts positive, ground-truth is positive
▪ False positives (FP): The model predicts positive, ground-truth is negative
▪ True negatives (TN): The model predicts negative, ground-truth is negative
▪ False negatives (FN): The model predicts negative, ground-truth is positive
To build an effective classification model, it is essential to understand the accuracy, which represents the proportion of correctly predicted cases relative to the total number of cases This ratio is crucial for evaluating the predictive performance of the model on a given dataset, as indicated by equation 2.12 A higher accuracy signifies a more reliable model.
Accuracy is a widely used metric for evaluating classification models due to its clear formula and straightforward interpretation However, it has a significant limitation: it treats all labels equally, failing to account for the varying importance of different labels in certain tasks As a result, accuracy may not be the best choice for evaluating models where label significance differs.
Precision, as defined by equation 2.13, measures the accuracy of positive predictions by calculating the proportion of true positive cases among all predicted positives A higher precision indicates a more effective model in identifying positive instances.
Recall, which is synonymous with precision, shares the same numerator but differs slightly in the denominator of its calculation formula It serves as a key indicator for assessing the predictive performance of the positive group.
F1 score is the harmonic mean between precision and recall, computed by equation 2.14 Therefore, it is more representative in the assessment of precision on both precision and recall
Following the convention, when precision = 0 or accuracy = 0, F1 score = 0 We can prove that the value of the F1 score is always in the range of precision and recall
2 precision recall 2 precision recall precision recall min precision, recall
2 precision recall 2 precision recall precision recall max precision, recall
Therefore, for cases where precision and recall are too different, the F1 score will balance both these magnitudes and help us make a more objective assessment
A confusion matrix, also known as an error matrix, is a structured table that visualizes the performance of a classification algorithm It is a widely utilized technique for measuring the effectiveness of classification models An example of a confusion matrix is illustrated in Figure 2.26.
ROC is a curve representing the classification ability of a classification model at the threshold, as shown in figure 2.27 This curve is based on two indices:
The True Positive Rate (TPR), also referred to as recall or sensitivity, measures the ratio of correctly classified positive cases to the total number of actual positive cases This metric assesses the accuracy of a model's predictions for the positive class, as defined by equation 2.19 A higher TPR indicates better predictive performance for the positive group; for instance, a TPR of 0.9 suggests that 90% of the samples in the positive group were accurately classified by the model.
TPR / recall / sensitivity TP total positive
The false positive rate (FPR) is defined as the ratio of incorrectly predicted positive cases to the total number of actual negative cases, calculated using equation 2.20 For instance, an FPR of 0.1 indicates that the model misclassified 10% of the negative cases A lower FPR signifies a more accurate model, as it reflects fewer errors in predicting negative cases Additionally, the complement of FPR is specificity, which measures the proportion of correctly identified negative cases among all actual negative cases.
AUC is a calculated number based on the receiving operating curve (ROC) to assess how well a model’s classification ability is
The Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve ranges from 0 to 1, indicating the model's classification ability A larger AUC value suggests that the ROC curve approaches the line \(y=1\), signifying a better model Conversely, if the ROC curve is near the diagonal line connecting the points (0, 0) and (1, 1), it indicates that the model performs similarly to a random classifier.
A more effective probability prediction model is achieved when the probability distribution graphs of negative and positive outcomes exhibit greater separation, resulting in a smaller overlap area and minimized error rates Additionally, increased distance between the negative and positive probability distributions leads to a more convex ROC graph, with the convexity reflected in the size of the AUC area.
HARDWARE PLATFORM
Overall system
The Jetson Nano serves as the central controller, connecting to two cameras, a keyboard, and a mouse through USB ports for input For output, it links to a speaker via Bluetooth and an HDMI screen through an HDMI port Additionally, the Jetson Nano employs Wi-Fi to access the Firebase Realtime Database.
Furthermore, the database can modify values on the website
The practical hardware looks like an electrical cabinet With a compact size as shown in figure 3.2, it will be easy to install in any classroom
SOFTWARE DESIGN
Face detection
YOLOv4-tiny is a lightweight version of YOLOv4, optimized for machines with limited computing resources With a model size of around 16 megabytes, it can process 350 images per hour on a Tesla P100 GPU This model boasts an impressive inference speed of 3 ms, positioning it among the fastest object detection models Unlike YOLOv4, which has three YOLO heads, YOLOv4-tiny features only two, and it utilizes 29 pre-trained convolutional layers compared to YOLOv4's 137 layers.
YOLOv4-tiny offers an impressive speed advantage, operating at roughly eight times the frames per second (FPS) of YOLOv4 However, its accuracy on the MS COCO dataset is about two-thirds that of its larger counterpart When tested on an RTX 2080Ti, YOLOv4-tiny achieves a mean Average Precision (AP) of 22.0%, with an AP50 of 42.0%.
In a real-time object detection environment, YOLOv4-tiny outperforms YOLOv4 with an impressive 1774 FPS when utilizing TensorRT, a batch size of 4, and FP16 precision This demonstrates that faster inference times are prioritized over precision and accuracy, making YOLOv4-tiny the superior choice for applications requiring rapid processing.
Figure 4 1 The comparison between different YOLO version
Leaky ReLU addresses the issue of "dying ReLU" by introducing a small slope for negative inputs instead of returning zero, as illustrated in figure 4.2 The slope coefficient is predetermined prior to training and remains constant throughout the training process.
Figure 4 2 Leaky ReLu activation function
Object detection methods frequently utilize the Intersection over Union (IoU) loss function to update model weights However, challenges arise when the predicted bounding boxes do not overlap with the ground truth bounding box, making it difficult to determine which prediction is superior This lack of overlap complicates the process of adjusting weights to improve the accuracy of the predicted bounding boxes in relation to the ground truth.
B is boundary box gt is the index of ground truth gt
GIoU addresses the issue by introducing a representative element for the area of intersection between different bounding boxes Initially, the predicted bounding box expands to overlap with the ground truth, but subsequently, it contracts to minimize the Intersection over Union (IoU).
C is the smallest boxing covering B and B gt
The DIoU metric addresses the limitations of GIoU by incorporating the distance between the centers of bounding boxes, ensuring that the normalization of this distance is maintained through the c element.
Where: b is central points of B b gtis central points of B gt p is the Euclidean distance c is the diagonal length of the smallest intersect of 2 boxes
In YOLOv4, the author uses CIoU, inherited the pros of the above loss functions, and adds a parameter to help maintain the proportions of the bounding boxes
Where: ais the positive trade-off parameter
Where: w is the width of bounding boxes h is the height of bounding boxes
Face recognition
After identifying the face location, an algorithm known as face landmark estimation extracts key features from the face This algorithm, which will be elaborated on later, focuses on specific points that contribute to the uniqueness of our facial structure As illustrated in figure 4.3, a total of 68 landmarks are identified.
We focus on aligning images to ensure that the eyes and lips are centered, minimizing distortion To achieve this, we employ affine transformations, which help maintain parallel lines and enhance accuracy in subsequent steps This process is visually represented in Figure 4.4.
Figure 4 3 These feature points, located on every face, are drawn along the eyebrows, the eyes, the mouse, and the chin
Figure 4 4 One of the parts of the facial recognition process
Deep learning surpasses human capabilities in identifying significant facial features, as a deep convolutional neural network is trained to generate 128 distinct measurements (embeddings) for each face.
In a university setting with thousands of students, predicting face labels is achieved by measuring the closest distance of embeddings between unknown individuals and those with tagged identities To efficiently classify human faces, K-Nearest Neighbors (KNN) is employed, offering a more effective solution than template matching or Support Vector Machines (SVM).
Detecting facial landmarks involves reconstructing facial structures through shape prediction techniques This process typically includes two main steps: first, identifying the face, and second, locating the key facial features within the Region Of Interest (ROI).
There are many facial landmark detectors but all methods attempt to localize and label facial regions
The facial landmark detector in the dlib library is an implementation of the paper
[10] This method is implemented using:
▪ The training set is the facial landmarks marked on the photo
▪ Probability on the distance between pairs of input pixels
A set of regression trees was utilized to assess the positioning of facial landmarks based on pixel intensities The Dlib pre-trained model was developed using the iBUG-300 W dataset, which includes images along with their corresponding 68 facial landmarks.
The affine transform is a linear mapping technique that maintains collinearity and distance ratios, ensuring that parallel lines remain unchanged This method is commonly employed to correct distortions from non-ideal camera angles, facilitating easier interaction and calculations without the need to consider image distortion.
Any complex transformation can be made up of table 4.1 where images are aligned by the product of a composite matrix and the original image:
Type Example Transformation Matrix Note
sh specifies the shear factor
Rotation cos(q) sin(q) 0 sin(q) cos(q) 0
q specifies the angle of rotation
Table 4 1 Multiple kinds of affine transformation
During the learning process, the triple loss method analyzes three images from distinct categories: an anchor image of a person, a positive image of the same person, and a negative image of a different individual These images are input into a deep convolutional neural network (CNN) to produce embeddings Subsequently, the neural network is adjusted to maximize the distance between the anchor and the negative image while minimizing the distance between the anchor and the positive image, as described by equation 4.10 The training process for triple loss is illustrated in figure 4.5.
With such training in the model, we will have more information about the relationship between the images, which makes our model much better suited to practical conditions
Where: f(x) takes x as an input and returns a 128-dimensional vector
Subscript a indicates anchor image Subscript p indicates positive image
is the margin (extra margin) between positive and negative pairs, the minimum required deviation between the two ranges of values
T is the set of all possible triplets in the training set
K-nearest neighbours (KNN) is an algorithm to find the output of a new data point by relying only on the information of k data points in the nearest training set (k-neighbors), regardless of the data quantity
To classify or regress, KNN has the following basic steps:
▪ Calculate distance: Euclidean, Hamming, Manhattan and Minkowski (we use, it has equation 4.11)
▪ Find closest neighbours based on k
Nowadays, in theory, users need to check the system performance to achieve the suitable k value, always odd
Figure 4 5 Training model using triplet loss
Simple online real-time object tracking
Simple Online Realtime Object Tracking (SORT) is a minimalistic implementation of a visual framework for tracking multiple objects, utilizing basic data association and state estimation methods The algorithm estimates motion and associates data across frames by analyzing the size and position of bounding boxes.
The Hungarian algorithm, introduced in 1955, addresses the work assignment problem to optimize economic benefits This algorithm is also applicable in multi-object tracking, where it minimizes the association error by linking each bounding box to its corresponding track.
Adding or subtracting a constant from all entries in any row or column of a non-negative cost matrix \( c \) does not affect the optimal assignment of the resulting cost matrix, which remains optimal for the original matrix This principle is fundamental to the SORT procedure.
Row reduction involves selecting the smallest number from each row of the original matrix, referred to as matrix 0 Subsequently, each element in matrix 0 is subtracted by this smallest value to create a new matrix, known as matrix 1.
▪ 2 Column reduction: choose the smallest number of each column in matrix
1 and then each element in matrix 1 subtract this minimum value to form a new matrix (temporarily called matrix 2)
▪ 3 Test for an optimal assignment: Draw lines across rows and columns such that every zero passes through If there are n lines, the assignment is complete
To perform a shift zero operation (for line numbers less than \( n \)), identify the smallest number not located on the essential line Subsequently, subtract this minimum value from each element in matrix 2 that is also not on the essential line, resulting in a new matrix, referred to as matrix 3 This process should be repeated by returning to step 3.
▪ 5 Making the final assignment: choose the smallest number on the essential lines
▪ Note: If the data is not a square matrix, add 0 rows or columns to the original matrix
The Kalman filter, a Linear-Gaussian State Space Model introduced in 1960, is widely utilized across various fields It excels in object tracking by processing consecutive data inputs to swiftly estimate the true value of a measured object, even in the presence of unpredictable errors, uncertainties, or variations in the measured values.
In figure 4.6, we provide the overall Kalman filter
To apply Kalman Filter [15], it is compulsory that we need to identify the various forms as well as the initial model of the process like equation 4.19 We have
The initial covariance matrix for \( x \) is set to a large value to indicate significant uncertainty in the state The coordinates \( u \) and \( v \) represent the center of the object, specifically the center of the bounding box The variable \( s \) denotes the area of the bounding box, while \( r \) indicates its aspect ratio Additionally, \( x \), \( y \), and \( s \) represent the respective velocity values associated with these parameters.
Design GUI
Graphical User Interface (GUI) elements, including icons, cursors, and buttons, enhance user interaction by allowing computer use without the need for command knowledge These objects are often enhanced with sounds and visual effects like transparency and drop shadows, creating a more engaging experience.
We utilized PyQt5 to develop the system's user interface, which features four primary pages: the home page, search page, display page, and create page The designs showcased below represent our completed product.
Figure 4 7 The home page in GUI This page requires the user to log in before entering other pages
Figure 4 8 The data page in GUI This page allows for retrieval into the database server
Figure 4 9 The displaying page in GUI The user can monitor the whole system on this page
Figure 4 10 The creating data page in GUI Facial data of new members can be added to this page
The homepage, as shown in Figure 4.7, requires users to log in and log out for system access Figure 4.8 presents a search page that enables users to retrieve and modify data in the Firebase attendance table In Figure 4.9, the display page serves as the main interface, showcasing frames from two cameras along with corresponding results for user monitoring Finally, Figure 4.10 illustrates the add user button, which switches the system from normal operation to data entry mode Once sufficient data is collected, users must click the save button to store images and revert the system to normal working mode.
Website design
We use JavaScript to design the website This website just helps users create, insert, delete and update values in the database following the date
Figure 4 11 Login function on the website
Figure 4 12 Main page on the website
Database structure
This thesis utilizes the Firebase real-time database platform for information storage, featuring three key tables: the student list, check table, and attendance table The administrative object is limited by the number of students allowed in a lesson within a specific classroom.
Figure 4 13 The structure of student list in the database
Figure 4 14 The structure of check table in the database
Figure 4 15 The structure of attendance table in the database
The student list serves as an initial declaration of students, as shown in figure 4.13 The check table records check-in and check-out times, represented as paired data in figure 4.14 Daily analysis of this recorded data determines student attendance and whether the cumulative time meets the standard interval requirements Subsequently, an attendance table is generated to document all these values, with its structure illustrated in figure 4.15.
Figure 4 16 The list of users in the database
The authentication feature in Firebase is utilized to create login and logout functions within the GUI, while new member registration is exclusively conducted on the Firebase website using the administrator account.
CPU thread
The system comprises three threads, each with distinct responsibilities Thread 1 is dedicated to the graphical user interface (GUI), allowing users to monitor the system, interact with its operations, and manage data in the database Thread 2 serves as the main thread, focusing on critical tasks such as face detection, recognition, tracking, data recording in a CSV file, and training with new data Meanwhile, Thread 3 is tasked with uploading, processing, and analyzing data within the server database.
Figure 4 17 CPU is divided into multiple threadings
Flowchart
In the first branch of Figure 4.18, two camera frames are stacked for detection, allowing us to track bounding boxes around faces If a face is recognized for the first time, it is identified; otherwise, it returns to the detection stage The system features a button for activation and deactivation, with a condition that limits recognition input based on the bounding box area Additionally, there are two operational modes: normal mode and data addition mode.
Figure 4 18 The flowchart in thead 2 This is considered the main thread of the system
In Figure 4.19, two modes are illustrated: updating recorded data and generating an attendance table at midnight The last column of the table is examined, and data is updated if it meets specific conditions, such as ensuring the time interval is sufficiently large.
Figure 4 19 The flowchart in thead 3 This is considered the data thread of the system
Platform
Darknet is an efficient open-source framework designed for neural network implementation, utilizing C and CUDA for seamless integration with both CPUs and GPUs It supports advanced deep learning applications, including real-time object detection with You Only Look Once (YOLO), ImageNet classification, and recurrent neural networks (RNNs), among others.
TensorRT, developed by NVIDIA, enhances inference speed and minimizes lag on NVIDIA GPUs, achieving speed improvements of 2-4 times over real-time services and up to 30 times faster than CPU performance.
Figure 4 20 Convert the original model to Tensorrt
Precision calibration involves converting parameters and activations from FP32 (Float Point 32) precision to FP16 or INT8 precision during training This optimization enhances inference speed and reduces stagnation, albeit with a slight decrease in model accuracy In real-time recognition scenarios, a balance between accuracy and inference speed is often required.
TensorRT will combine layers and tensors to optimize GPU memory and bandwidth by merging nodes vertically, horizontally or both, that is called layer ND tensor fusion
During model optimization, several kernels dedicated to optimization are executed during the process
Dynamic tensor memory allocates just the memory required for each tensor and only for the duration of its usage, reduces memory footprint and improves memory re-use
Multiple stream execution allows processing multiple input streams in parallel
Google Colab, also known as Google Collaboratory, is a free platform provided by Google that facilitates research and learning in artificial intelligence It offers a coding environment similar to Jupyter Notebook and allows users to access GPUs and TPUs at no cost However, the configuration available for students on Google Colab is notably limited, as detailed in table 4.2.
180 teraflops of computation Table 4 2 GPU configuration in Google Colab
Firebase is a cloud-based database service powered by Google's robust server infrastructure, designed to simplify database operations and empower users to develop applications more efficiently.
Firebase Realtime Database is a cloud-hosted database that enables real-time data storage and synchronization across connected clients When developing cross-platform applications using iOS, Android, and JavaScript SDKs, all clients share a single Realtime Database instance, ensuring they automatically receive the latest data updates.
Firebase Realtime Database enables rapid data synchronization, ensuring that all connected devices receive updates within milliseconds When offline, data is cached locally and automatically synced once the user is back online This platform is accessible via mobile devices or web browsers without the need for a server application Additionally, data security and authentication are managed through Firebase Realtime Database Security Rules, which are enforced during data read and write operations.
Qt is a cross-platform Application framework written in C++ language, used to develop desktop, embedded and mobile applications Platform support includes Linux, OS
X, Windows, VxWorks, QNX, Android, iOS, BlackBerry, Sailfish OS and several others PyQt is the Python interface of Qt, a combination of the Python programming language and the Qt library, a library that includes control interface components (widgets, graphical control elements)
EXPERIMENTAL RESULTS
Experimental environment
In this thesis, we simulate our system under adequate illumination or daylight conditions, reflecting typical classroom environments In instances of low light, teachers and students can activate lights to facilitate teaching and learning Our system is depicted in Figure 5.1.
Regarding the detection model, we use the face mask dataset [16] contains 1451 images belonging to the 2 classes (with mask, without mask)
When it comes to the classifier, we use 10 classes There are 35 original images and
35 flipped images in each class, slipped into 60 for the training set, and 10 for the validation
47 set In total, we prepare 700 images in the hands of 10 different identifies Figure 5.3 is one of the pieces of our dataset
The FaceNet model is trained using the CASIA-WebFace dataset and is evaluated against the standard Labeled Faces in the Wild (LFW) benchmark An example from the LFW dataset is illustrated in Figure 5.4.
Figure 5 4 Labeled Faces in the Wild dataset
Training processing
To train YOLOv4, we pre-set the configuration parameters outlined in Table 5.1 The training process is conducted on Google Colab and requires approximately 4 hours to complete, with the results illustrated in Figure 5.6.
Table 5 1 Training parameter of the detection model
To train FaceNet, we pre-configured the parameters listed in Table 5.2 The model underwent training on a server, requiring approximately three days to complete Figure 5.5 illustrates the training accuracy (solid line) and validation accuracy (dashed line), with evaluations conducted after every five epochs.
Table 5 2 Training parameter of the recognition model
Figure 5 5 Training graph of FaceNet
To optimize the KNN classifier, we train it with values of k ranging from 0 to 1000 to determine the most suitable k based on accuracy However, this approach significantly increases training time, taking several minutes on the Jetson Nano Therefore, we select k according to the guidelines in Chapter 26 of [19], which is further clarified in [20] and can be simplified to formula 5.1.
Figure 5 6 Training IoU and loss graph of the detection model
Evaluation
Table 5.3 presents a comparison of different classes, while Table 5.4 provides statistical results detailing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) Building on the data from Table 5.4, Table 5.5 calculates key performance metrics, including precision, recall, and F1-score.
Table 5 3 Evaluate average precision following each class
Table 5 4 Evaluate TP, TN, FP, FN for confident threshold = 0.25 with average IoU =
Table 5 5 Evaluate model by precision, recall, F1-score, mAP for confident threshold =
While the training process, we also evaluate the model on LWF dataset and obtain the result as demonstrated in figure 5.7 with a final accuracy of 99.07%
Figure 5 7 Evaluation graph of FaceNet on LWF dataset
The below evaluation is taken from [13] which illustrates the SORT performance in comparison with other trackers SORT seems to be exceeding opponents about MOTA
To assess the classifier's performance, we generate a confusion matrix and calculate the precision, recall, and F1 score, as shown in Table 5.6 The results indicate a strong performance, achieving 91% accuracy and demonstrating a clear correlation between the classes.
Figure 5 9 Confusion matrix at K$ There are 10 people in the dataset
Class Precision Recall F1-score Support
Result
Face recognition technology performs effectively at short distances of approximately 3 meters, while face detection is optimized for medium distances of around 4 meters, specifically targeting faces or facemasks However, the system's overall speed is limited, with a minimum frame rate of about 5 fps The accuracy threshold for the face detector is 0.7, whereas the recognition accuracy stands at 0.6.
During daytime testing, we encounter significant backlighting when positioning the camera in front of the door Although there are software solutions available, they often restrict system speed Therefore, we opted to adjust the camera's position to improve performance.
When two individuals enter the workspace simultaneously, the system performs adequately, but the frame rate fluctuates significantly, impacting the detector's performance This results in discrete recorded data, which has been addressed through a flowchart approach To minimize confusion between bounding boxes, a small distance parameter is utilized in the tracker Additionally, given the limited size of the workspace, implementing line barrier poles is the most effective method to ensure that students enter and exit one at a time.
The system's GUI effectively connects to the Firebase server, ensuring all functions operate correctly Users must log in to monitor and control the system, while the GUI enables them to modify, add, or remove database values and incorporate new face data into the local dataset Importantly, recorded data remains intact even during internet outages To enhance security, users are required to log out after their session.
Figure 5 11 GUI requires to login before controlling and monitoring the system
Figure 5.12 displays columns for recorded date, student name, attendance status (false for attended, true for absent), cumulative time sum, and notes The notes column indicates "auto" for results generated automatically and "lack info" for results produced despite insufficient information.
Following some below figures, all data is recorded correctly to desired fields, and tables
Figure 5 13 The result after updating data into Firebase (a) The result in the attendance table (b) The result in the check table
Every day, the system automatically generates an attendance table that tracks the total time students spend studying in the classroom For students who are absent or have minimal cumulative classroom time, the system sends warning emails based on the initial information provided An example of the email format is illustrated in Figure 5.15.
Figure 5 14 The result in the student list after updating
Figure 5 15 Send the warning email
The website change and display the correct parameters in the database The results are presented in figure 5.15
Figure 5 16 The retrieved data from the database on 17 August
CONCLUSION AND FUTURE WORK
This chapter gives the conclusion and some future works which will be conducted
CHAPTER 2: LITERATURE REVIEW 2.1 Convolutional neural network
The emergence of convolutional neural networks (CNNs) has significantly transformed the machine learning landscape, particularly in the areas of detection, classification, and recognition CNN architecture is designed to process image inputs, enabling the encoding of specific features within the model A typical CNN consists of three main types of layers: convolutional layers, pooling layers, and fully-connected layers The first two layers focus on feature extraction, while the final layer is dedicated to classification.
Convolution, the first layer to extract features from the input image, maintains relationships between pixels by learning image features using small squares of input data
Convolution is a process that involves two inputs: an image matrix and a filter or kernel This technique allows for various operations, including edge detection, blurring, and sharpening, by merging an image with different filters The size of the kernel, typically 3x3 or 5x5, significantly affects the receptive fields, feature extraction capabilities, convolutional speed, and weight sharing An illustration of convolution in Convolutional Neural Networks (CNN) is provided in Figure 2.2.
In a convolutional block, the stride, as illustrated in figure 2.3, is a crucial parameter of the CNN filter that influences the movement across the image An increase in stride results in a reduction of the encoded output volume, which limits the number of layers in the CNN model and restricts the ability to build the desired deep networks.
Figure 2 2 Convolution in the convolutional layer
Figure 2 3 An illustration of convolution with stride equals 1 in CNN
Sometimes, the kernel does not scan through the input image owing to dimensional conflict We can add zeros to the borders of the image (which is known as padding)
Figure 2 4 An example using zero padding
The Rectified Linear Unit (ReLU) is a widely used nonlinear activation function in Convolutional Neural Networks (CNNs) Its popularity stems from its computational simplicity, which effectively mitigates the vanishing gradient problem and enhances overall performance.
5 activation functions, it is located immediately after the convolution layer In activation, ReLU will assign negative values to zero and keep non-negative values
Figure 2 5 The graph of ReLu function
Using the ReLU activation function can lead to issues, including the absence of a derivative at zero and the possibility of the function reaching positive infinity If weights are not initialized properly or if the learning rate is excessively high, neurons may enter a "dead state," resulting in consistently negative values.
Figure 2 6 An example of using ReLu
The pooling layer minimizes the number of parameters in large images, as illustrated in Figure 2.7 Also referred to as subsampling or downsampling, space pooling decreases the size of each map while preserving essential information Various types of pooling can be employed in this process.
In the CNN model, it is crucial to focus on two key aspects: location invariance and compositionality The algorithm's accuracy is affected when the same object is projected at varying degrees, such as through translation, rotation, or scaling.
6 will be remarkably affected This layer gives invariant to translation, rotation and scaling or even can reduce input dimensionality, limit overfit, and reduce training time
Figure 2 7 An example of using max pooling
The final layer of a CNN model in image classification is the fully connected layer, which transforms the feature matrix from the preceding layer into a vector that represents the probabilities of the objects to be predicted.
Figure 2 8 The fully connected layer in CNN
You Only Look Once (YOLO) is a convolutional neural network (CNN) model designed for object detection, including the ability to detect multiple objects within a single image It stands out for its remarkable speed, achieving near real-time performance without compromising accuracy compared to leading models YOLO's primary goal is to predict labels for various objects while also accurately locating them, enabling the detection of multiple objects with different labels rather than just classifying a single label for an entire image.
In YOLO terminology, bounding boxes are frames that encircle objects, while anchor boxes serve as predefined size references for predicting these bounding boxes The feature map is an output block divided into a grid of squares, which is utilized to search for and detect features within each cell Additionally, non-max suppression is a technique used to eliminate overlapping bounding boxes, retaining only the one with the highest probability.
Understanding the YOLO output is crucial for configuring the correct parameters when training models using open-source platforms like Darknet The output is influenced by the number of classes, as indicated by equation 1, especially when users apply 3 anchors per cell, following equation 2.1: \$$\text{output} = (\text{num of classes} + 5) \times 3\$$
In the feature map, we choose three anchor boxes of varying sizes—box 1, box 2, and box 3—ensuring that their centers align with each cell Consequently, YOLO's output consists of a concatenated vector representing these three bounding boxes, with their attributes illustrated in Figure 2.9.
Figure 2 9 The output tensor of YOLO
Like SSDs, YOLOv4 makes predictions across multiple feature maps Specifically, smaller initial feature maps are effective for detecting large objects, while subsequent feature maps, which are larger in size, assist in predicting smaller objects, with the anchor box remaining constant.
Figure 2.10 illustrates that while a feature map contains three anchor boxes, the total number of anchor boxes in an image is significantly higher, contributing to the slow training speed of the YOLO model This is due to the simultaneous prediction of labels and bounding boxes Additionally, the YOLO training process demands substantial RAM, limiting the batch size to prevent out-of-memory errors.
Figure 2 10 Multi-scale feature maps for detection The output is 3 feature maps with different sizes 13x13, 26x26 and 52x52 respectively
To accurately locate objects, YOLO relies on predefined anchor boxes that surround the objects with precision Subsequently, a regression bounding box algorithm refines these anchor boxes to generate predicted bounding boxes for the detected objects This process is visually represented in figure 2.11.
Figure 2 11 Anchor box in object detection