Smart lock system based on face recognition

Image Enhancement Trang 22 Figure 2.2: Contrast Enhancement Techniques Image enhancement techniques are utilized to improve the quality, contrast, and sharpness of digital images, enab

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION

FACULTY FOR HIGH QUALITY TRAINING

GRADUATION THESIS AUTOMATION AND CONTROL ENGINEERING TECHNOLOGY

SMART LOCK SYSTEM BASED ON

FACE RECOGNITION

ADVISOR : Dr NGUYEN MINH TAM STUDENTS: NGUYEN TAN NHAT NGUYEN MINH NHAT

Ho Chi Minh City, July 2023

S K L 0 1 1 6 3 9

Trang 2

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION

FACULTY FOR HIGH QUALITY TRAINING

GRADUATION PROJECT

Ho Chi Minh City, July 2023

SMART LOCK SYSTEM BASED ON

FACE RECOGNITION

Advisor: Dr NGUYEN MINH TAM

NGUYEN TAN NHAT Student ID: 18151025 NGUYEN MINH NHAT Student ID: 18151099 Major: AUTOMATION AND CONTROL ENGINEERING TECHNOLOGY

Trang 3

GRADUATION PROJECT ASSIGNMENT

Student name: _ Student ID: _

Major: _ Class: Advisor: _ Phone number: _ Date of assignment: Date of submission: _

1 Project title: _

2 Initial materials provided by the advisor: _

3 Content of the project: _

4 Final product:

CHAIR OF THE PROGRAM

(Sign with full name)

ADVISOR

(Sign with full name)

THE SOCIALIST REPUBLIC OF VIETNAM

Independence – Freedom– Happiness

-

Ho Chi Minh City, June 30th, 2023

Trang 4

Independence – Freedom – Happiness

-

Ho Chi Minh City, June 30th, 2023 ADVISOR’S EVALUATION SHEET Student name: Student ID:

Student name: Student ID:

Major:

Project title:

Advisor:

EVALUATION 1 Content and workload of the project:

2 Strengths:

3 Weaknesses:S

4 Approval for oral defense? (Approved or denied)

5 Overall evaluation: (Excellent, Good, Fair, Poor)

6 Mark:………….(in words: )

ADVISOR

(Sign with full name)

Trang 5

-

Ho Chi Minh City, June 30th, 2023 PRE-DEFENSE EVALUATION SHEET Student name: Student ID:

Major:

Project title:

Advisor:

2 Strengths:

3 Weaknesses:

5 Reviewer questions for project valuation

REVIEWER

Trang 6

-

Ho Chi Minh City, June 30h, 2023 DEFENSE COMMITTEE MEMBER EVALUATION SHEET Student name: Student ID:

Major:

Project title:

Name of Reviewer:

2 Strengths:

3 Weaknesses:

5 Overall evaluation: (Excellent, Good, Fair, Poor)

Ho Chi Minh City, August 6th, 2023

COMMITTEE MEMBER

Trang 7

COMMITMENT

Title: SMART LOCK SYSTEM BASED ON FACE RECOGNITION

Advisor: Doctor Nguyen Minh Tam

Name of student 1: Nguyen Tan Nhat

Trang 8

Research the theory Test and evaluate

own built model Week 5:

(03/04 – 07/04)

Research the theory Data adjustment for

the built model Week 6:

(10/04 – 14/04) Research the theory Applied pre-trained model

Week 7:

(17/04 – 21/04)

Hardware Research and Selection

Device selection:

Nano Jetson Week 8:

(24/04 – 28/04) Programming Build dataset for the model

(15/05 – 19/05) Set up Jetson Transfer program to Jetson

Week 12:

(22/05 – 26/05)

Working on Jetson Setup environment

on Jetson Week 13:

(29/05 – 02/06) Writing report Table of contents

Trang 9

Furthermore, we extend a warm appreciation to all the esteemed teachers and

advisors at Ho Chi Minh City University of Technology and Education Their

comprehensive teachings and practical projects equipped us with essential knowledge, enabling us to apply it successfully in our graduation project This project stands as a tangible testament to the achievements we have made throughout our years as students, and it wouldn't have been possible without their unwavering dedication

Lastly, we would like to express our profound love and gratitude to our families, who have been, currently are, and will always be our strongest pillars of support, both emotionally and financially We assure you that we will exert our utmost efforts

to make you proud through our contributions to our nation and society, striving not

to let you down

Trang 11

ADVISOR COMMENTS

Student name: Nguyễn Tấn Nhật Student ID: 18151025

Nguyễn Minh Nhật 18151099 Major: Automation and Control Engineering Technology

Project title: Smart Lock System based on Face Recognition

Advisor: Dr Nguyen Minh Tam

Evaluation:

1 Content of the project

2 Strength

3 Weakness

4 Approval for oral defense? (Approved or Denied)

Trang 12

TABLE OF CONTENTS

GRADUATION PROJECT ASSIGNMENT i

ADVISOR’S EVALUATION SHEET ii

PRE-DEFENSE EVALUATION SHEET iii

DEFENSE COMMITTEE MEMBER EVALUATION SHEET iv

COMMITMENT v

WORKING TIMETABLE vi

ACKNOWLEDGEMENT vii

TASK COMPLETION viii

ADVISOR COMMENTS ix

TABLE OF CONTENTS 1

LIST OF TABLES 3

LIST OF FIGURES 4

Chapter 1: INTRODUCTION 6

1.1 Abstract 6

1.2 Aim of study 6

1.3 Limitations 7

1.4 Research Method 7

Chapter 2: THEORIES 9

2.1 Image Processing 9

2.1.1 Image Obtainment 9

2.1.2 Image Enhancement 10

2.1.3 Image Restoration 11

2.1.4 Image compression 12

2.1.5 Coloring Image Processing 14

2.2 Deep Learning 16

2.2.1 Frameworks 16

2.2.2 Models 18

2.2.3 Algorithms 18

2.2.4 Networks 19

2.2.5 Model Training Process 20

2.3 Face Detection Model 22

2.3.1 Object Detection 22

2.3.2 SSD-Single Shot Multibox Detector 23

2.3.3 RFB – Receptive Field Block 24

2.3.4 Ultra Light-Fast 26

Trang 13

2.4 Face Recognition 28

2.4.1 FaceNet 28

2.4.2 Inception Architecture 28

2.4.3 Triplet Loss 30

2.5 Liveness Detection 32

2.5.1 Concept 32

2.5.2 Liveness Detection Methods 33

2.5.3 Eye Blink Detection 34

2.6 Fingerprint Recognition 35

2.6.1 Fingerprint Technology 35

2.6.2 Operating principle 36

Chapter 3: SYSTEM DESIGN 39

3.1 Design requirement 39

3.1.1 System Block Diagram 39

3.1.2 Block Design on Requirements 40

3.2 System Design 45

3.2.1 Embedded hardware (Jetson Nano B01) 45

3.2.2 Camera Logitech C270 47

3.2.3 Arduino Uno R3 48

3.2.4 Relay Module 50

3.2.5 LCD screen (HDMI LCD 7 inch) 51

3.2.6 Fingerprint sensor (AS608) 52

3.2.7 IC ESP8266 54

3.2.8 Hardware Block And Wiring Diagram 56

Chapter 4: EXPERIMENTAL RESULT 59

4.1 Survey methods 59

4.2 Flowcharts 61

4.3 Environment and Dataset 65

4.4 Performance Of The System 66

4.4.1 Operation result 66

4.4.2 Hardware Result 69

4.4.3 Face Datasets 70

Chapter 5 CONCLUSION 71

REFERENCES 72

Trang 14

LIST OF TABLES

Table 3.1: Camera Logitech C270 Specification 48

Table 3.2: Specification of Arduino Uno R3 50

Table 3.3: Specification of Relay Module 51

Table 3.4: Specification of LCD Screen 52

Table 3.5: Specification of Fingerprint Module 54

Table 3.6: Specification of IC ESP8266 55

Table 4.1: Hardware Configuration 60

Table 4.2: Performance comparison 60

Table 4.3: Advantages and disadvantages of surveyed models 61

Table 4.4: System Performance 66

Table 4.5: General Result in Good Brightness 67

Table 4.6: General Result in Low Brightness 68

Table 5.1: Strengths and weaknesses of the system 71

Trang 15

LIST OF FIGURES

Figure 2.1: Image obtainment in digital camera 9

Figure 2.2: Contrast Enhancement Techniques 11

Figure 2.3: Image Restoration – Reducing noises 12

Figure 2.4: Image compression – lossy and lossless 13

Figure 2.5: Color Space 14

Figure 2.6: Deep Learning Model 18

Figure 2.7: Simple Neural Network 19

Figure 2.8: Deep Learning Process 21

Figure 2.9: Relationships Between Tasks in Computer Vision 22

Figure 2.10: Architecture of SSD 24

Figure 2.11: Construction of the RFB module combining multiple branches with different kernels and dilated convolution layers 25

Figure 2.12: The architecture of RFB and RFB-s 26

Figure 2.13: Ultra light fast generic face detector architecture 27

Figure 2.14: FaceNet Architecture Diagram 28

Figure 2.15: The Inception ResNet V1 architecture 29

Figure 2.16: Triplet Loss 31

Figure 2.17: Regions of Embedding Space of negativest 32

Figure 2.18: Triple Loss Principle 32

Figure 2.19: Eye Blink Detection 33

Figure 2.20: Thermal Imaging Detection 33

Figure 2.21: 3D Depth Analysis 34

Figure 2.22: 68-points Facial Landmarks for Face Recognition 35

Figure 2.23: Fingerprint Image 35

Figure 2.24: Operating Principle Fingerprint Recognition 37

Figure 2.25: Fingerprint Image processing Diagram 37

Figure 2.26: Comparing fingerprint diagram 38

Figure 3.1: System Block Diagram 39

Figure 3.2: Image Receiving Block and Recognition Block 40

Figure 3.3: Example of Input Image Block 41

Figure 3.4: Example of Aligned Face and Resize Block 41

Figure 3.5: Recognized Face 42

Figure 3.6: Liveness Face Recognition 43

Figure 3.7: First Window 43

Figure 3.8: Login Window 44

Figure 3.9: System Window 44

Figure 3.10: Register Window 44

Figure 3.11: Delect Data Window 45

Figure 3.12: Jetson Nano Module 46

Figure 3.13: Pin Diagram 47

Figure 3.14: Camera Logitech C270 48

Figure 3.15: Arduino Uno R3 49

Figure 3.16: Relay Module 50

Figure 3.17: LCD Screen 51

Figure 3.18: Fingerprint Sensor AS608: 53

Figure 3.19: IC ESP8266 55

Figure 3.20: Blynk app connect Node MCU (IC ESP8266) through Internet 55

Trang 16

Figure 3.21: Hardware Block Diagram 57

Figure 3.22: Wiring Diagram 57

Figure 4.1: MTCNN, HOG+Linear SVM, Ultra light fast without mask 60

Figure 4.2: MTCNN, HOG+Linear SVM, Ultra light fast with mask 60

Figure 4.3: Flowchart for Face Registration 62

Figure 4.4: Flowchart for Face Recognition 63

Figure 4.5: Flowchart for Liveness Detection 64

Figure 4.6: Flowchart for General System Operation 65

Figure 4.7: Good Brigthness Results of Face Recognition 66

Figure 4.8: Good Brigtness Results of Face Recogntion + Liveness Detection 67

Figure 4.9: Low Brigthness Results of Face Recogntion 68

Figure 4.10: Low Brigtness Results of Face Recogntion + Liveness Detection 68

Figure 4.11: Lock control through Blynk app (IC ESP8266) 69

Figure 4.12: Fingerprint Recognition 69

Figure 4.13: Hardware Result 70

Figure 4.14: Solidwork Design 70

Figure 4.15: Face Datasets Stored 70

Trang 17

Chapter 1 INTRODUCTION

1.1 Abstract

In our modern society, the advancement of technology, particularly in the fields of machine learning and artificial intelligence, has bestowed upon humanity remarkable utilities across various domains such as education, economy, science, defense, security, and many more These technological advancements have revolutionized our lives, enabling us to achieve feats that were once deemed impossible

From algorithms that gather user behavior data to make informed choices on e-commerce platforms, to search algorithms that deliver the most relevant results based on user-generated keywords, to programs that accurately predict planetary orbits and anticipate natural disasters like earthquakes and volcanic eruptions, these algorithms have played a pivotal role in transforming seemingly impossible tasks into tangible realities Machine learning and automation technologies are continuously being researched and developed, continuously striving for perfection

As we witness the increasing frequency of digital transformation in our daily lives, one notable development is the emergence of smart lock systems that utilize face recognition technology This innovation provides users with the convenience of no longer worrying about forgetting their house keys With this system in place, individuals can effortlessly gain access to their homes, marking a significant leap forward in terms of convenience and security

1.2 Aim of study

The objectives of this research project encompass the design and development of a face recognition system that utilizes a webcam for a door locking mechanism Additionally, the system is required to incorporate liveness detection to ensure the authenticity of the detected faces Furthermore, a user interface should be implemented, allowing for the addition, removal, and daily history check of individuals Moreover, the system needs to demonstrate robust performance in low brightness conditions and accurately operate at distances of up to two meters

Trang 18

1.3 Limitations

In this project, our system has limitation on:

Limited Dataset: The current system lacks diversity in the training dataset, as it only includes a small number of individuals It is important to expand the dataset to include a broader range of faces to improve the accuracy and generalization capabilities of the face recognition system

Environmental Variations: The performance of the face recognition system can be affected

by environmental conditions, such as varying levels of light Adequate lighting should be provided to avoid issues caused by excessive or insufficient light, which can hinder proper recognition

Recognition Distance Standardization: To ensure consistent and accurate performance, it

is crucial to establish a standardized distance between the person being recognized and the camera Standing too far or too close to the camera can impact the capture of essential identifying characteristics, resulting in compromised recognition accuracy

Power Supply Considerations: The system is all dependent on electricity Therefore, it is important to ensure a reliable and consistent power source for uninterrupted operation Adequate power backup or contingency plans should be in place to address power outages

or fluctuations that may disrupt the functioning of the system

Addressing these considerations will contribute to the improvement of the face recognition system's performance, accuracy, and reliability It involves expanding the dataset, optimizing environmental conditions, standardizing recognition distances, and ensuring a stable power supply These measures will enhance the system's overall effectiveness and user experience

1.4 Research Method

The research methodology for this project includes the following steps:

- Conducting theoretical research based on published scientific articles: This involves a comprehensive review of existing literature to gather relevant knowledge and insights related to the project topic

- Investigating encountered problems and challenges: Identifying and examining any difficulties or obstacles faced during the research process This could include technical issues, limitations, or complexities associated with the implementation of the face recognition system

Trang 19

- Offering solutions: Based on the identified problems and challenges, proposing effective solutions or strategies to address them This may involve applying novel approaches, modifying existing methodologies, or utilizing advanced techniques

- Validating performance results and making comparisons: Conducting experiments and evaluations to assess the performance of the face recognition system This includes collecting data, analyzing the results, and comparing them against relevant benchmarks or existing systems The aim is to identify areas of improvement and suggest necessary adjustments to enhance the system's performance

By following these steps, the research project aims to contribute to the existing knowledge, address challenges, and propose effective solutions in the field of face recognition systems

Trang 20

Chapter 2 THEORIES

2.1 Image Processing

Image processing is often viewed as a practice that manipulates images unfairly to enhance their beauty or reinforce preconceived notions of reality However, a more accurate definition portrays it as a means of bridging the gap between the human visual system and digital imaging equipment Our perception of the world differs from that of digital cameras, which possess their own distinct capabilities and limitations Therefore, it becomes crucial

to understand the differences between human and digital detectors and employ precise processes to translate between them By approaching image editing scientifically, we can ensure that the results achieved by individuals can be replicated and verified by others This involves documenting and summarizing the processing operations performed and subjecting appropriate control images to the same treatment

Image processing encompasses the use of digital computers to address various challenges within an image, including audio editing and color correction It involves modifying an image to produce an enhanced version or extract relevant data from it When applied to image or topography-based data, it is referred to as signal processing Currently, the field

of image processing is undergoing rapid expansion and is a primary focus of research within the fields of engineering and computer science

2.1.1 Image Obtainment

The first step in digital image processing is the commencement of image acquisition This entails capturing and recording specialized images that represent real-life scenes or the internal structure of objects This initial stage enables subsequent manipulation, compression, storage, printing, and display of these images

Figure 2.1: Image obtainment in digital camera

Trang 21

The hardware setup and regular maintenance play a vital role in the acquisition and processing of images, depending on the specific industry involved The range of hardware utilized can vary significantly, ranging from small desktop scanners to large optical telescopes It is crucial to correctly configure and align the hardware to prevent visual distortions that could complicate image processing Insufficient hardware configuration can result in such poor image quality that even extensive processing cannot salvage the images These considerations are particularly important in fields that rely on comparative image processing to identify specific variations among collections of images [1][2]

Real-time image acquisition is a widely used approach in the image processing industry This method involves capturing images from a source that continuously takes automatic pictures The data stream produced by real-time image acquisition can be automatically processed, temporarily stored for later use, or consolidated into a single media format Background image acquisition, which combines software and hardware, enables the rapid preservation of the images being streamed into a system and is commonly employed in real-time image processing [1][2]

Cutting-edge image processing techniques often make use of specialized hardware for image acquisition One example is the acquisition of three-dimensional (3D) images This technique entails using two or more precisely aligned cameras positioned around a target

to create a 3D or stereoscopic scene or measure distances In certain cases, satellites employ 3D image acquisition methods to generate accurate representations of various surfaces.[1][2]

Trang 22

Figure 2.2: Contrast Enhancement Techniques

Image enhancement techniques are utilized to improve the quality, contrast, and sharpness

of digital images, enabling them to be further processed and analyzed These modifications are implemented to make the images more suitable for display or to facilitate a more detailed examination of their content For instance, techniques like noise reduction, sharpening, and brightness adjustments are employed to simplify the identification of important details within the image Prior to any processing, image enhancement works to improve the overall quality and information content of the original data It effectively expands the range of visual aspects chosen for enhancement, making them more distinguishable, while maintaining the intrinsic value of the underlying data Through image enhancement, we can achieve greater clarity, uncover valuable insights, and ensure that the integrity of the conveyed information remains intact. [1][2]

2.1.3 Image Restoration

Image restoration techniques aim to recover a clean and undistorted version of an image that has been degraded or distorted The objective of image restoration is to restore the lost details and reduce the effects of noise, ultimately improving the overall quality of the image

By utilizing advanced algorithms and mathematical models, image restoration techniques analyze the degraded image and try to estimate the original content These methods employ different approaches like deconvolution, denoising, and inpainting to enhance the image and restore its visual accuracy The goal is to minimize the impact of degradation and maximize the recovery of important information, ultimately resulting in a clearer and more visually appealing image [1][2]

Trang 23

Figure 2.3: Image Restoration – Reducing noises

- Enhanced Visual Benefits: Despite reducing file size, image compression strives to preserve image quality to a satisfactory level This ensures that visual details and fidelity are maintained, allowing photographers and content creators to share and distribute their work efficiently without compromising the intended visual impact

- Efficient Data Transmission: Compressed images require less bandwidth when being downloaded from websites or transmitted over the internet This leads to faster content delivery and a smoother user experience Reduced file sizes alleviate network congestion, facilitating efficient data transfer, especially in bandwidth-limited environments

- Diverse Compression Techniques: Image compression employs a range of techniques to achieve optimal results These techniques vary from standard compression algorithms to more sophisticated methods tailored to factors such as image complexity and desired compression ratios By employing diverse techniques, image compression ensures efficient data representation and storage

In conclusion, image compression is an essential tool in digital photography, offering multiple advantages It enables cost savings, enhances visual experiences, and expedites content delivery by reducing file sizes while maintaining image quality.[1][2]

Trang 24

Figure 2.4: Image compression – lossy and lossless

Image file compression can be broadly categorized into two main types: lossy compression and lossless compression Each type has its own characteristics and tradeoffs

Lossy compression is a technique that reduces the size of an image file by permanently discarding redundant or less essential information This process allows for significant reduction in file size, making it advantageous for efficient storage and transmission However, it's important to note that with lossy compression, there is a trade-off in terms of image quality If an image is excessively compressed, it can lead to noticeable distortions and a significant loss of visual fidelity However, when used judiciously and with appropriate settings, lossy compression can effectively preserve image quality while achieving significant file size reduction

On the other hand, lossless compression is a method that reduces the size of an image file without compromising any visual information It achieves this by employing algorithms that store and reproduce the original image exactly, pixel by pixel Lossless compression

is desirable when preserving the exact integrity of the image is crucial, such as in professional photography or archival purposes However, it's important to note that lossless compression typically results in smaller file size reductions compared to lossy compression

In summary, while lossy compression can achieve substantial file size reduction, it is important to use it carefully to avoid excessive degradation of image quality On the other hand, lossless compression maintains image fidelity at the cost of smaller file size reductions The choice between these compression techniques depends on the specific requirements of the application, the importance of image quality, and the desired level of file size reduction

Trang 25

2.1.5 Coloring Image Processing

A deep understanding of how light and color are perceived is vital in the field of color image processing Human color perception is influenced by various factors, such as the unique properties of objects, including their material composition, the presence of different substances, lighting conditions, and the time of day

Color image processing involves specific procedures that focus on analyzing and manipulating the color information within an image Through the application of diverse algorithms and methods, color separation techniques can isolate and extract distinct color components from an image, enabling further analysis and processing This separation process plays a crucial role in tasks like recognizing objects, classifying materials, and understanding scenes

By exploring the mechanics of light and color perception, color image processing techniques aim to accurately capture and reproduce the visual aspects of the real world This understanding enhances the ability to manipulate and interpret color information, opening up possibilities for applications in fields such as computer vision, digital imaging, and visual communication

Figure 2.5: Color Space

Color image processing plays a vital role in various applications, presenting numerous opportunities for enhancing and analyzing images Here are some ways in which color image processing is crucial: Image Acquisition and Interpretation, Correction and Enhancement, Analysis and Scientific Discoveries, and Challenges and Techniques

Trang 26

- Image Acquisition and Interpretation:

 Color image processing is essential during image acquisition, whether it involves capturing images with digital devices or recording them on film

 Interpretation of acquired images is often necessary to extract useful data For instance, in magnetic resonance imaging (MRI), computer algorithms interpret the output and present it visually to aid in diagnosis

 Color coding specific regions in scans enhances contrast and clarity, enabling medical professionals to identify abnormalities more effectively

- Correction and Enhancement:

 Color photos often require correction and enhancement to ensure their quality and aesthetic appeal

 Image processing techniques, including manual color correction and cropping, help restore corrupted or damaged images and produce visually pleasing results

 Converting photographs to specific color schemes, such as the RGB color scheme for offset printing, prepares images for publication and dissemination

- Analysis and Scientific Discoveries:

 Color image processing facilitates analysis and scientific exploration across various fields, including astronomy

 Astronomers utilize images captured by telescopes, balloons, and satellites to gain insights into the cosmos Automated color processing tools assist in highlighting phenomena and identifying targets of interest that might be overlooked by manual observation

 Advanced applications enable tasks like object counting in images and identification

of spectral bands present, contributing to data analysis and research

- Challenges and Techniques:

 Handling color photos poses greater challenges compared to black and white images

 Noise, which can degrade color, clarity, or functionality, needs to be addressed using techniques like filtering and stacking

 Color image processing finds application in processing test findings with an imaging component and restoring old photographs, utilizing these technologies for optimal results

Trang 27

Overall, color image processing offers a wide range of applications, from image interpretation in medical imaging to enhancing photographs for publication

2.2 Deep Learning

Deep Learning is a type of computer software that replicates the intricate network of neurons found in the human brain It belongs to the broader field of machine learning and focuses specifically on artificial neural networks that have the capability to learn and represent information The name "deep learning" comes from its utilization of deep neural networks, which consist of multiple layers

Deep learning encompasses different learning modes, namely supervised, unsupervised, and semi-supervised learning In supervised learning, the training data includes predefined category labels, allowing the model to learn and make predictions based on known classifications Algorithms such as linear regression, logistic regression, and decision trees are commonly employed in supervised learning

On the other hand, unsupervised learning deals with training data that lacks explicit category labels In this mode, the model learns patterns and structures within the data without prior knowledge of specific classifications Algorithms like cluster analysis, K-means clustering, and anomaly detection are often used in unsupervised learning

Semi-supervised learning occurs when the dataset contains both labeled and unlabeled data In this approach, the model leverages the limited labeled data in conjunction with the unlabeled data to improve learning and prediction accuracy Semi-supervised learning techniques encompass graph-based models, generative models, and assumptions based on clustering and continuity

By comprehending the principles underlying deep learning and its different learning modes, practitioners can select suitable algorithms and methodologies to train models tailored to specific tasks and datasets.[3][4]

2.2.1 Frameworks

A wide array of deep learning frameworks are available at no cost, offering various features and functionalities These frameworks include TensorFlow, Keras, PyTorch, Theano, MXNet, Caffe, and Deeplearning4j Usage statistics from a survey conducted in 2019 indicate that TensorFlow, Keras, and PyTorch are among the most frequently utilized frameworks among them

Trang 28

a Keras

This is a Python-based Open-Source Neural Networks framework that operates at a high level and employs TensorFlow, CNTK, and Theano as backend tools It comes with comprehensive documentation and offers user-friendly functionality As a result, it is favored in dynamic settings, particularly in research scenarios where swift experimentation outcomes are essential The framework is designed to be modular and adaptable, and it functions seamlessly across various platforms, including CPU, GPU, and TPUs It prioritizes easy comprehensibility and promotes modularity, allowing for the effortless addition of new layers or components to existing models

b TensorFlow

Developed by Google Brains, this is an additional well-known deep learning framework that was initially utilized for proprietary research purposes It is implemented in C++ and Python and has significantly improved the efficiency of intricate numerical computations

At its core, the framework employs dataflow graphs as a data structure, where the nodes of the graph represent a series of mathematical operations to be executed, and the edges represent multidimensional arrays or tensors

By utilizing C++ for low-level numerical computations, this framework achieves impressive computational speed, surpassing other frameworks It also provides a high-level Python API that abstracts the underlying C++ functionality Similar to Keras, it is platform-independent and can seamlessly operate on CPU, GPU, and TPUs Furthermore, being an open-source framework, it can be easily installed using a Python installer or by cloning the corresponding GitHub repository

c PyTorch

Considered as one of the most user-friendly frameworks, it serves as a replacement for NumPy arrays to expedite numerical computations in GPU environments By utilizing tensors, it significantly accelerates computation speed Unlike the aforementioned frameworks that construct a neural network structure to be reused repeatedly, PyTorch employs a technique called reverse-mode auto differentiation This dynamic approach enables seamless modification of the neural network without any delay or additional overhead It generates the dataflow graph in real-time, resulting in ease of debugging and efficient memory usage Implemented in Python and C++, PyTorch offers excellent

Trang 29

documentation and boasts easier extensibility It is platform-independent and compatible with CPU, GPU, and TPUs Installing PyTorch can be accomplished via a Python installer

or by cloning the Open Source repository from GitHub. [3]

2.2.2 Models

A neural network is employed to create a Deep Learning Model, consisting of an Input layer, Hidden layer, and Output layer The Input layer receives the input data, which is processed in the Hidden layer using adjustable weights that are fine-tuned during training The model then generates predictions, which are adjusted iteratively to minimize the error

Figure 2.6: Deep Learning Model

To incorporate non-linear relationships, an activation function is utilized In the initial stage, the structure of the input layer can be defined, where the number "2" represents the input column count, and the desired number of rows can be specified after a comma The output layer contains a single node for prediction Activation functions assist in extracting complex patterns from the provided data, enabling the network to optimize the error function and reduce loss during back-propagation, assuming the function is differentiable The input is multiplied by weights, and bias is added to the computation [3][4]

2.2.3 Algorithms

Creating a deep learning model entails combining multiple algorithms to construct a network of interconnected neurons Deep learning is known for its computational intensity, but there are platforms like TensorFlow, PyTorch, Chainer, and Keras that assist in developing these models The objective of deep learning is to emulate the structure of the human neural network, with perceptrons serving as the fundamental units in the deep learning model [11][27]

Trang 30

A perceptron comprises input nodes (similar to dendrites in the human brain), an activation function for decision-making, and output nodes (similar to axons in the human brain) Understanding the functioning of a single perceptron is crucial as connecting multiple perceptrons forms the basis of a deep learning model Input information, with associated weights, is passed through the activation function, producing an output that serves as input for other neurons After processing a batch, backpropagation error is computed at each neuron using a cost function or cross-entropy

Different activation functions like sigmoid, hyperbolic tangent, and Rectified Linear Unit (ReLU) are employed to make decisions within the deep learning model Models with more than three hidden layers are typically considered deep neural networks Essentially, deep learning involves a collection of neurons, with each layer having specific parameters Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) are popular architectural choices for constructing deep learning models [11][27]

2.2.4 Networks

Deep learning methods utilize neural networks, hence they are commonly known as deep neural networks These networks consist of multiple hidden layers, making them deep or hidden neural networks The objective of deep learning is to train artificial intelligence systems to make predictions based on given inputs or hidden layers within the network Training deep neural networks involves using extensive labeled datasets, allowing the networks to learn features directly from the data Both supervised and unsupervised learning techniques are employed to train the data and extract meaningful features

Figure 2.7: Simple Neural Network

Trang 31

The deep learning process begins with the input layer receiving the input data, which is then passed to the first hidden layer Mathematical calculations are performed on the input data, and ultimately, the output layer produces the results

Convolutional Neural Networks (CNN), a widely used type of neural network, apply feature convolutions to input data, leveraging 2D convolutional layers for processing 2D data such as images CNNs eliminate the need for manual feature extraction as they directly extract relevant features from images for classification This automation makes CNN a highly accurate and reliable algorithm in machine learning Each layer in a CNN learns specific features from the hidden layers, which increases the complexity of learned images

[11][27]

Training artificial intelligence or neural networks is a crucial aspect During training, input data is provided from a dataset, and the outputs are compared to the expected outputs from the dataset If the AI or neural network is untrained, the outputs may be incorrect

To measure the disparity between the AI's output and the actual output, a cost function is employed The cost function calculates the difference between the two outputs A cost function value of zero indicates that both the AI's output and the actual output are the same The goal is to minimize the cost function value, which involves adjusting the weights between the neurons Gradient Descent (GD) is a commonly used technique for this purpose GD systematically reduces the weights of the neurons after each iteration, automating the process [11][27]

2.2.5 Model Training Process

A deep neural network provides state-of-the-art accuracy in many tasks, from object detection to speech recognition They can learn automatically, without predefined knowledge explicitly coded by the programmers

Each layer in a neural network represents a deeper level of knowledge, forming a hierarchy

of knowledge As the number of layers increases, the neural network learns more complex features compared to networks with fewer layers

Refer the figure in next page for more information:

Trang 32

Figure 2.8: Deep Learning Process

The learning process in a neural network consists of two phases:

- First Phase: In the initial phase, a nonlinear transformation is applied to the input data, resulting in the creation of a statistical model as the output

- Second Phase: The second phase focuses on improving the model using a mathematical method known as the derivative

These two phases are repeated hundreds to thousands of times in what is known as iterations Neural networks continue iterating until they achieve the desired level of output and accuracy

- Training of Networks: To train a neural network with data, a large amount of data is collected, and a model is designed to learn the underlying features However, training with

a vast amount of data can be time-consuming

- Transfer Learning: Transfer learning involves fine-tuning a pre-trained model for a new task This approach reduces computation time by leveraging the knowledge learned from previous tasks

- Feature Extraction: Once all the layers of the neural network are trained to recognize the features of an object, these learned features can be extracted, and accurate predictions can

be made based on them

- By utilizing these techniques, neural networks can progressively learn and extract meaningful features from data, leading to improved accuracy in predicting outputs

Trang 33

2.3 Face Detection Model

2.3.1 Object Detection

Object detection is a computer vision task that involves identifying and localizing objects within digital images or videos It encompasses three main tasks: image classification, object positioning, and object detection

Image classification focuses on predicting the class or category of a single object in an image The input is an image containing an object, and the output is a class label or multiple class labels that represent the object's category

Object localization determines the presence of objects in an image and provides their positions using bounding boxes The input is an image containing one or more objects, and the output is one or more bounding boxes defined by their coordinates, including the center point, width, and height

Object detection combines image classification and object localization to identify and locate multiple objects within an image It takes an input image, detects the objects present, and provides both the bounding box coordinates and the corresponding class labels for each detected object

In summary, image classification predicts the label of an object, object positioning determines the position of objects using bounding boxes, and object detection combines both tasks to detect and locate multiple objects with their corresponding labels in an image These tasks play a crucial role in various computer vision applications, enabling machines

to understand and interact with visual data effectively

Figure 2.9: Relationships Between Tasks in Computer Vision

Trang 34

There are various models used for object detection Older architectures include R-CNN and fast R-CNN These models have slower processing speeds and are not suitable for real-time object detection More advanced networks such as SSD, YOLOv2, YOLOv3 offer faster processing speeds while maintaining accuracy by incorporating changes in network architecture to streamline the detection and classification process in a single pass and eliminate unnecessary computations The specific deep learning algorithm used for object detection called the Single Shot Multibox Detector (SSD) [34]

2.3.2 SSD-Single Shot Multibox Detector

SSD, which stands for Single Shot Multibox Detector, is a deep learning method designed

to address the problem of object detection Similar to other object detection architectures, SSD takes as input the coordinates of the bounding box (referred to as offsets) and the label

of the object contained within the box One key feature that makes SSD fast is its use of a single neural network

The approach of SSD is based on object recognition in feature maps, which are dimensional outputs of a convolutional neural network (CNN) after removing the last fully connected layers These feature maps have different resolutions SSD creates a grid of squares called grid cells on these feature maps Each cell defines a set of default boxes that are used to predict objects centered in that cell These boxes act as frames to enclose the objects During the prediction phase, the neural network outputs two values: the probability distribution of the object labels within the bounding box and the offsets of the bounding box

three-Unlike the fast R-CNN model, SSDs do not require a separate region proposal network to suggest object regions Instead, all the object detection and classification processes are performed within the same network The name 'Single Shot MultiBox Detector' reflects the use of multiple box frames with different scales to detect and classify object regions By eliminating the need for a region proposal network, SSD achieves significantly faster processing speeds while still maintaining high accuracy

Furthermore, SSD combines feature maps with different resolutions to effectively detect objects of various sizes and shapes This is a contrast to the fast R-CNN model The use of multiple feature maps allows SSD to handle objects with different scales This reduces the step of creating a region proposal network, resulting in a significant speed improvement without compromising accuracy

Trang 35

Figure 2.10: Architecture of SSD

The SSD model is divided into two stages:

- Feature Map Extraction: In this stage, a base network, typically VGG16, is used to extract feature maps from the input image These feature maps capture high-level semantic information about the image The use of a base network enhances the effectiveness of object detection by providing rich and discriminative features

- Convolutional Filter Application: In this stage, a set of convolutional filters is applied to the feature maps to detect objects These filters are responsible for analyzing different aspects of the feature maps and identifying potential object locations By convolving these filters with the feature maps, the SSD model can effectively detect objects of various sizes and aspect ratios

By combining these two stages, SSD is able to achieve accurate and efficient object detection The feature maps extracted from the base network serve as a basis for detecting objects, while the convolutional filters enable the model to identify and localize objects within the feature maps This multi-stage approach allows SSD to achieve state-of-the-art performance in object detection tasks

2.3.3 RFB – Receptive Field Block

The proposed RFB (Receptive Field Block) is a multi-branch convolutional block designed

to enhance the effectiveness of object detection It consists of two key components: a branch convolution layer with distinct kernels and trailing dilated pooling or convolution layers

multi-The first component, referred to as Inception, aims to replicate the population Receptive Field (pRF) size of the human visual system It achieves this by utilizing different kernels

in the convolutional layer, allowing the network to capture features at multiple scales

Trang 36

The second component focuses on reproducing the relationship between pRF size and eccentricity observed in the human visual system This is accomplished through the integration of dilated pooling or convolution layers, which gather information from different spatial regions

Figure 2.3 provides a visual representation of the RFB architecture, along with spatial pooling region maps that illustrate how the various components of the RFB capture and process information from different parts of the input

By incorporating the RFB module into the object detection framework, the model can benefit from enhanced feature discriminability and robustness, ultimately leading to improved performance in object detection tasks

Figure 2.11: Construction of the RFB module combining multiple branches with different kernels and dilated convolution layers

The multi-branch convolution layer utilizes different kernels to capture Receptive Fields (RFs) of different sizes, leveraging the concept of RFs in Convolutional Neural Networks (CNNs) This approach allows the network to capture information at multiple scales, which

is often more effective than using fixed-size RFs

The RFB architecture incorporates the latest versions of Inception, specifically Inception V4 and Inception-ResNet V2[31], from the Inception family In each branch, a bottleneck structure is applied, which consists of a 1x1 convolutional layer to reduce the number of channels in the feature map, followed by an n x n convolutional layer To reduce parameters and increase depth in non-linear layers, the original 5×5 convolutional layer is replaced by two stacked 3×3 convolutional layers Similarly, the original n×n convolutional layer is

Trang 37

substituted with a 1 × n convolutional layer followed by an n×1 convolutional layer Additionally, the shortcut design from ResNet [32] and Inception-ResNet V2 [31] is incorporated into the architecture

The dilated pooling or convolution layer is designed to create feature maps with higher resolution, enabling the capture of more information over a larger context area while maintaining a manageable number of parameters This design has proven to be effective in tasks such as semantic segmentation [33] and has gained popularity in widely recognized object detectors like SSD [34]

Figure 2.12: The architecture of RFB and RFB-s

The RFB-s parameters, such as kernel size, branch dilation, and the number of branches, undergo slight modifications at each position in the detector

2.3.4 Ultra Light-Fast

The RFB Net detector incorporates the multi-scale and one-stage framework of SSD [34], with the addition of the RFB module to enhance the feature extraction capabilities of the lightweight backbone, ensuring improved accuracy while maintaining speed The key modification involves replacing the top convolution layers with the RFB module

Trang 38

Figure 2.13: Ultra light fast generic face detector architecture

The lightweight backbone used in the RFB Net detector is identical to the one used in SSD

[34] It is based on the VGG16 [37] architecture, pre-trained on the ILSVRC CLS-LOC dataset [38] The conv6 and conv7 layers are converted into convolutional layers with sub-sampling parameters, while the pool5 layer is changed from 2×2-s2 to 3×3-s1 Additionally, the dilated convolution layer fills all the dropout layers, and the fc8 layer is removed

In the original SSD [34], a cascade of convolution layers generates a series of feature maps with decreasing spatial resolutions and increasing fields of view In the RFB Net detector,

we retain the cascade structure of SSD but replace the front convolution layers, which have feature maps with high resolutions, with the RFB module While the original RFB module imitates the impact of eccentricity using a single structure setting, we modify the RFB parameters to create an RFB-s module that simulates smaller pRFs found in shallow human retinotopic maps This RFB-s module is placed behind the conv4_3 features, as indicated The input layer of the RFB Net detector consists of images with a size of 300x300x3 (width

x height x channels)

The VGG16 layer serves as the base network, reusing the architecture of VGG16 but removing some fully connected layers The output of this layer is Conv4_3, which is a 38x38x512 feature map

The Conv4_3 layer undergoes two types of conversions:

First conversion: A convolutional layer, similar to a standard CNN, is applied to obtain the next output layer Specifically, a convolutional kernel with a size of 3x3x1024 is used to generate Conv7, which has a size of 19x19x1024

Trang 39

Second conversion: The 38x38x512 feature map from Conv4_3 passes through an RFBs layer, replacing the classifier of the SSD framework, for object identification

Similarly, RFB layers are also applied to Conv7, Conv8, Conv9, Conv10, and Conv11 The shape of the subsequent layers depends on the convolutional process applied to the previous layers Conv8 and Conv9 are replaced by RFB modules with a Stride of 2 to extract additional features from the preceding layers

In essence, FaceNet utilizes a deep neural network to capture and extract diverse facial attributes These attributes are subsequently projected onto a 128-dimensional space, where images of the same individual are clustered closely together and separated from images of different individuals

The key elements of this architecture are briefly outlined below

Figure 2.14: FaceNet Architecture Diagram

2.4.2 Inception Architecture

The Inception Architecture plays a crucial role in FaceNet, a deep learning-based face recognition system Inception is used to extract and represent facial features from input images

In FaceNet, the Inception architecture is employed to build a deep neural network that learns complex features from facial images The Inception modules in this architecture

Trang 40

enable the model to automatically learn and create convolutional filters that are suitable for the input information

The main role of Inception Architecture in FaceNet is to create a 128-dimensional Euclidean feature space, where points that are close to each other correspond to similar faces, and points that are far from each other correspond to different faces This enables the model to compare and recognize faces based on the distances between points in the feature space

The neural network using the Inception architecture in FaceNet helps generate complex and discriminative facial features It has the ability to learn and represent facial features at various scales and levels of detail, thereby improving the accuracy and computational complexity of the FaceNet face recognition system

Inception ResNet V1, introduced in 2016, is an extension of the Inception module that incorporates residual connections Residual connections enable the network to learn residual mappings, which helps alleviate the degradation problem that can occur in very deep networks By integrating residual connections into the Inception architecture, Inception ResNet V1 achieves improved performance and better gradient flow during training

Figure 2.15: The Inception ResNet V1 architecture

Tiêu đề	Smart Lock System Based On Face Recognition
Tác giả	Nguyen Tan Nhat, Nguyen Minh Nhat
Người hướng dẫn	Dr. Nguyen Minh Tam
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Automation and Control Engineering Technology
Thể loại	Graduation Project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	87
Dung lượng	4,07 MB

Tài liệu tham khảo	Loại	Chi tiết
[1] Image processing, theory, algorithms and architectures, M. A. Sid-Ahmed, January 1995	Khác
[2] An image processing method for morphology characterization and pitting corrosion evaluation, E.NCodaro, September 2002	Khác
[3] Deep Learning with PyTorch, Vishnu Subramanian, February 2018	Khác
[4] Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurélien Géron, June 2019	Khác
[5] Voice recognition system, voice recognition method, and program for voice recognition, Ken Hanazawa, Fumihiro Adachi, Ryosuke Isotani, January 2018	Khác
[6] Voice recognition device, voice recognition method, and voice recognition program, Takayuki Arakawa, Ken Hanazawa, Masanori Tsujikawa, December 2013 [7] Voice Recognition System, Tuba Siddiqui, July 2020	Khác
[8] Voice recognition system and voice recognition method, Takayuki Arakawa, April 2015	Khác
[9] Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges, Hung-Pang Lin, Yu-Jia Zhang, Chia-Ping Chen, August 2021	Khác
[10] Speech Recognition Model Compression, Madhumitha Sakthi, Ahmed H Tewfik, Raj Pawate, May 2020	Khác
[11] Model-Based Deep Learning: On the Intersection of Deep Learning and Optimization, Nir Shlezinger, Yonina C. Eldar, Stephen Boyd, May 2022	Khác
[12] A review of optimization method in face recognition: Comparison deep learning and non-deep learning methods, Sulis Setiowati, Zulfanahri, Eka Legya Franita, Igi Ardiyanto, October 2017	Khác
[15] Face Detection and Recognition System, Akhil Awdhutrao Sambhe, January 2022 [16] Comparison of Face Detection Tools, Ye. Amirgaliyev, A. Sadykova, Ch.Kenshimov, December 2021	Khác
[17] Face Detection and Recognition using OpenCV, Ajay Kumar, Shivansh Chaudhary, Sonik Sangal, Raj Dhama, May 2022	Khác
[18] Face Detection with Applications in Education, Juan Carlos Bonilla-Robles, José Alberto Hernández Aguilar, Guillermo Santamaria-Bonfil, October 2021	Khác
[19] Real-time face detection on a Raspberry PI, Leyla Muradkhanli, Eshgin Mammadov, July 2022	Khác
[20] Face Shape Classification Based on MTCNN and FaceNet, Wenxin. Ji, Lina. Jin, November 2021	Khác
[21] Research on Face Detection Technology Based on MTCNN, Ning Zhang, Junmin Luo, Wuqi Gao, September 2020	Khác
[22] Research on face detection method based on improved MTCNN network, Yang Wang, Guowu Yuan, Dong Zheng, Hao Wu, Yuanyuan Pu, August 2019	Khác
[23] Implementation of Jetson Nano Based Face Recognition System, Il-Sik Chang, Goo-Man Park, December 2021	Khác