Image Enhancement Trang 22 Figure 2.2: Contrast Enhancement Techniques Image enhancement techniques are utilized to improve the quality, contrast, and sharpness of digital images, enab
Trang 1MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
GRADUATION THESIS AUTOMATION AND CONTROL ENGINEERING TECHNOLOGY
SMART LOCK SYSTEM BASED ON
FACE RECOGNITION
ADVISOR : Dr NGUYEN MINH TAM STUDENTS: NGUYEN TAN NHAT NGUYEN MINH NHAT
Ho Chi Minh City, July 2023
S K L 0 1 1 6 3 9
Trang 2HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
GRADUATION PROJECT
Ho Chi Minh City, July 2023
SMART LOCK SYSTEM BASED ON
FACE RECOGNITION
Advisor: Dr NGUYEN MINH TAM
NGUYEN TAN NHAT Student ID: 18151025 NGUYEN MINH NHAT Student ID: 18151099 Major: AUTOMATION AND CONTROL ENGINEERING TECHNOLOGY
Trang 3GRADUATION PROJECT ASSIGNMENT
Student name: _ Student ID: _
Student name: _ Student ID: _
Major: _ Class: Advisor: _ Phone number: _ Date of assignment: Date of submission: _
1 Project title: _
2 Initial materials provided by the advisor: _
3 Content of the project: _
4 Final product:
CHAIR OF THE PROGRAM
(Sign with full name)
ADVISOR
(Sign with full name)
THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, June 30th, 2023
Trang 4THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom – Happiness
-
Ho Chi Minh City, June 30th, 2023 ADVISOR’S EVALUATION SHEET Student name: Student ID:
Student name: Student ID:
Major:
Project title:
Advisor:
EVALUATION 1 Content and workload of the project:
2 Strengths:
3 Weaknesses:S
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark:………….(in words: )
Ho Chi Minh City, June 29th, 2023
ADVISOR
(Sign with full name)
Trang 5THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom – Happiness
-
Ho Chi Minh City, June 30th, 2023 PRE-DEFENSE EVALUATION SHEET Student name: Student ID:
Student name: Student ID:
Major:
Project title:
Advisor:
EVALUATION 1 Content and workload of the project:
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Reviewer questions for project valuation
6 Mark:………….(in words: )
Ho Chi Minh City, June 29th, 2023
REVIEWER
(Sign with full name)
Trang 6THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom – Happiness
-
Ho Chi Minh City, June 30h, 2023 DEFENSE COMMITTEE MEMBER EVALUATION SHEET Student name: Student ID:
Student name: Student ID:
Major:
Project title:
Name of Reviewer:
EVALUATION 1 Content and workload of the project:
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark:………….(in words: )
Ho Chi Minh City, August 6th, 2023
COMMITTEE MEMBER
(Sign with full name)
Trang 7COMMITMENT
Title: SMART LOCK SYSTEM BASED ON FACE RECOGNITION
Advisor: Doctor Nguyen Minh Tam
Name of student 1: Nguyen Tan Nhat
Trang 8Research the theory Test and evaluate
own built model Week 5:
(03/04 – 07/04)
Research the theory Data adjustment for
the built model Week 6:
(10/04 – 14/04) Research the theory Applied pre-trained model
Week 7:
(17/04 – 21/04)
Hardware Research and Selection
Device selection:
Nano Jetson Week 8:
(24/04 – 28/04) Programming Build dataset for the model
(15/05 – 19/05) Set up Jetson Transfer program to Jetson
Week 12:
(22/05 – 26/05)
Working on Jetson Setup environment
on Jetson Week 13:
(29/05 – 02/06) Writing report Table of contents
Trang 9Furthermore, we extend a warm appreciation to all the esteemed teachers and
advisors at Ho Chi Minh City University of Technology and Education Their
comprehensive teachings and practical projects equipped us with essential knowledge, enabling us to apply it successfully in our graduation project This project stands as a tangible testament to the achievements we have made throughout our years as students, and it wouldn't have been possible without their unwavering dedication
Lastly, we would like to express our profound love and gratitude to our families, who have been, currently are, and will always be our strongest pillars of support, both emotionally and financially We assure you that we will exert our utmost efforts
to make you proud through our contributions to our nation and society, striving not
to let you down
Trang 11ADVISOR COMMENTS
Student name: Nguyễn Tấn Nhật Student ID: 18151025
Nguyễn Minh Nhật 18151099 Major: Automation and Control Engineering Technology
Project title: Smart Lock System based on Face Recognition
Advisor: Dr Nguyen Minh Tam
Evaluation:
1 Content of the project
2 Strength
3 Weakness
4 Approval for oral defense? (Approved or Denied)
Trang 12TABLE OF CONTENTS
GRADUATION PROJECT ASSIGNMENT i
ADVISOR’S EVALUATION SHEET ii
PRE-DEFENSE EVALUATION SHEET iii
DEFENSE COMMITTEE MEMBER EVALUATION SHEET iv
COMMITMENT v
WORKING TIMETABLE vi
ACKNOWLEDGEMENT vii
TASK COMPLETION viii
ADVISOR COMMENTS ix
TABLE OF CONTENTS 1
LIST OF TABLES 3
LIST OF FIGURES 4
Chapter 1: INTRODUCTION 6
1.1 Abstract 6
1.2 Aim of study 6
1.3 Limitations 7
1.4 Research Method 7
Chapter 2: THEORIES 9
2.1 Image Processing 9
2.1.1 Image Obtainment 9
2.1.2 Image Enhancement 10
2.1.3 Image Restoration 11
2.1.4 Image compression 12
2.1.5 Coloring Image Processing 14
2.2 Deep Learning 16
2.2.1 Frameworks 16
2.2.2 Models 18
2.2.3 Algorithms 18
2.2.4 Networks 19
2.2.5 Model Training Process 20
2.3 Face Detection Model 22
2.3.1 Object Detection 22
2.3.2 SSD-Single Shot Multibox Detector 23
2.3.3 RFB – Receptive Field Block 24
2.3.4 Ultra Light-Fast 26
Trang 132.4 Face Recognition 28
2.4.1 FaceNet 28
2.4.2 Inception Architecture 28
2.4.3 Triplet Loss 30
2.5 Liveness Detection 32
2.5.1 Concept 32
2.5.2 Liveness Detection Methods 33
2.5.3 Eye Blink Detection 34
2.6 Fingerprint Recognition 35
2.6.1 Fingerprint Technology 35
2.6.2 Operating principle 36
Chapter 3: SYSTEM DESIGN 39
3.1 Design requirement 39
3.1.1 System Block Diagram 39
3.1.2 Block Design on Requirements 40
3.2 System Design 45
3.2.1 Embedded hardware (Jetson Nano B01) 45
3.2.2 Camera Logitech C270 47
3.2.3 Arduino Uno R3 48
3.2.4 Relay Module 50
3.2.5 LCD screen (HDMI LCD 7 inch) 51
3.2.6 Fingerprint sensor (AS608) 52
3.2.7 IC ESP8266 54
3.2.8 Hardware Block And Wiring Diagram 56
Chapter 4: EXPERIMENTAL RESULT 59
4.1 Survey methods 59
4.2 Flowcharts 61
4.3 Environment and Dataset 65
4.4 Performance Of The System 66
4.4.1 Operation result 66
4.4.2 Hardware Result 69
4.4.3 Face Datasets 70
Chapter 5 CONCLUSION 71
REFERENCES 72
Trang 14LIST OF TABLES
Table 3.1: Camera Logitech C270 Specification 48
Table 3.2: Specification of Arduino Uno R3 50
Table 3.3: Specification of Relay Module 51
Table 3.4: Specification of LCD Screen 52
Table 3.5: Specification of Fingerprint Module 54
Table 3.6: Specification of IC ESP8266 55
Table 4.1: Hardware Configuration 60
Table 4.2: Performance comparison 60
Table 4.3: Advantages and disadvantages of surveyed models 61
Table 4.4: System Performance 66
Table 4.5: General Result in Good Brightness 67
Table 4.6: General Result in Low Brightness 68
Table 5.1: Strengths and weaknesses of the system 71
Trang 15LIST OF FIGURES
Figure 2.1: Image obtainment in digital camera 9
Figure 2.2: Contrast Enhancement Techniques 11
Figure 2.3: Image Restoration – Reducing noises 12
Figure 2.4: Image compression – lossy and lossless 13
Figure 2.5: Color Space 14
Figure 2.6: Deep Learning Model 18
Figure 2.7: Simple Neural Network 19
Figure 2.8: Deep Learning Process 21
Figure 2.9: Relationships Between Tasks in Computer Vision 22
Figure 2.10: Architecture of SSD 24
Figure 2.11: Construction of the RFB module combining multiple branches with different kernels and dilated convolution layers 25
Figure 2.12: The architecture of RFB and RFB-s 26
Figure 2.13: Ultra light fast generic face detector architecture 27
Figure 2.14: FaceNet Architecture Diagram 28
Figure 2.15: The Inception ResNet V1 architecture 29
Figure 2.16: Triplet Loss 31
Figure 2.17: Regions of Embedding Space of negativest 32
Figure 2.18: Triple Loss Principle 32
Figure 2.19: Eye Blink Detection 33
Figure 2.20: Thermal Imaging Detection 33
Figure 2.21: 3D Depth Analysis 34
Figure 2.22: 68-points Facial Landmarks for Face Recognition 35
Figure 2.23: Fingerprint Image 35
Figure 2.24: Operating Principle Fingerprint Recognition 37
Figure 2.25: Fingerprint Image processing Diagram 37
Figure 2.26: Comparing fingerprint diagram 38
Figure 3.1: System Block Diagram 39
Figure 3.2: Image Receiving Block and Recognition Block 40
Figure 3.3: Example of Input Image Block 41
Figure 3.4: Example of Aligned Face and Resize Block 41
Figure 3.5: Recognized Face 42
Figure 3.6: Liveness Face Recognition 43
Figure 3.7: First Window 43
Figure 3.8: Login Window 44
Figure 3.9: System Window 44
Figure 3.10: Register Window 44
Figure 3.11: Delect Data Window 45
Figure 3.12: Jetson Nano Module 46
Figure 3.13: Pin Diagram 47
Figure 3.14: Camera Logitech C270 48
Figure 3.15: Arduino Uno R3 49
Figure 3.16: Relay Module 50
Figure 3.17: LCD Screen 51
Figure 3.18: Fingerprint Sensor AS608: 53
Figure 3.19: IC ESP8266 55
Figure 3.20: Blynk app connect Node MCU (IC ESP8266) through Internet 55
Trang 16Figure 3.21: Hardware Block Diagram 57
Figure 3.22: Wiring Diagram 57
Figure 4.1: MTCNN, HOG+Linear SVM, Ultra light fast without mask 60
Figure 4.2: MTCNN, HOG+Linear SVM, Ultra light fast with mask 60
Figure 4.3: Flowchart for Face Registration 62
Figure 4.4: Flowchart for Face Recognition 63
Figure 4.5: Flowchart for Liveness Detection 64
Figure 4.6: Flowchart for General System Operation 65
Figure 4.7: Good Brigthness Results of Face Recognition 66
Figure 4.8: Good Brigtness Results of Face Recogntion + Liveness Detection 67
Figure 4.9: Low Brigthness Results of Face Recogntion 68
Figure 4.10: Low Brigtness Results of Face Recogntion + Liveness Detection 68
Figure 4.11: Lock control through Blynk app (IC ESP8266) 69
Figure 4.12: Fingerprint Recognition 69
Figure 4.13: Hardware Result 70
Figure 4.14: Solidwork Design 70
Figure 4.15: Face Datasets Stored 70
Trang 17Chapter 1 INTRODUCTION
1.1 Abstract
In our modern society, the advancement of technology, particularly in the fields of machine learning and artificial intelligence, has bestowed upon humanity remarkable utilities across various domains such as education, economy, science, defense, security, and many more These technological advancements have revolutionized our lives, enabling us to achieve feats that were once deemed impossible
From algorithms that gather user behavior data to make informed choices on e-commerce platforms, to search algorithms that deliver the most relevant results based on user-generated keywords, to programs that accurately predict planetary orbits and anticipate natural disasters like earthquakes and volcanic eruptions, these algorithms have played a pivotal role in transforming seemingly impossible tasks into tangible realities Machine learning and automation technologies are continuously being researched and developed, continuously striving for perfection
As we witness the increasing frequency of digital transformation in our daily lives, one notable development is the emergence of smart lock systems that utilize face recognition technology This innovation provides users with the convenience of no longer worrying about forgetting their house keys With this system in place, individuals can effortlessly gain access to their homes, marking a significant leap forward in terms of convenience and security
1.2 Aim of study
The objectives of this research project encompass the design and development of a face recognition system that utilizes a webcam for a door locking mechanism Additionally, the system is required to incorporate liveness detection to ensure the authenticity of the detected faces Furthermore, a user interface should be implemented, allowing for the addition, removal, and daily history check of individuals Moreover, the system needs to demonstrate robust performance in low brightness conditions and accurately operate at distances of up to two meters
Trang 181.3 Limitations
In this project, our system has limitation on:
Limited Dataset: The current system lacks diversity in the training dataset, as it only includes a small number of individuals It is important to expand the dataset to include a broader range of faces to improve the accuracy and generalization capabilities of the face recognition system
Environmental Variations: The performance of the face recognition system can be affected
by environmental conditions, such as varying levels of light Adequate lighting should be provided to avoid issues caused by excessive or insufficient light, which can hinder proper recognition
Recognition Distance Standardization: To ensure consistent and accurate performance, it
is crucial to establish a standardized distance between the person being recognized and the camera Standing too far or too close to the camera can impact the capture of essential identifying characteristics, resulting in compromised recognition accuracy
Power Supply Considerations: The system is all dependent on electricity Therefore, it is important to ensure a reliable and consistent power source for uninterrupted operation Adequate power backup or contingency plans should be in place to address power outages
or fluctuations that may disrupt the functioning of the system
Addressing these considerations will contribute to the improvement of the face recognition system's performance, accuracy, and reliability It involves expanding the dataset, optimizing environmental conditions, standardizing recognition distances, and ensuring a stable power supply These measures will enhance the system's overall effectiveness and user experience
1.4 Research Method
The research methodology for this project includes the following steps:
- Conducting theoretical research based on published scientific articles: This involves a comprehensive review of existing literature to gather relevant knowledge and insights related to the project topic
- Investigating encountered problems and challenges: Identifying and examining any difficulties or obstacles faced during the research process This could include technical issues, limitations, or complexities associated with the implementation of the face recognition system
Trang 19- Offering solutions: Based on the identified problems and challenges, proposing effective solutions or strategies to address them This may involve applying novel approaches, modifying existing methodologies, or utilizing advanced techniques
- Validating performance results and making comparisons: Conducting experiments and evaluations to assess the performance of the face recognition system This includes collecting data, analyzing the results, and comparing them against relevant benchmarks or existing systems The aim is to identify areas of improvement and suggest necessary adjustments to enhance the system's performance
By following these steps, the research project aims to contribute to the existing knowledge, address challenges, and propose effective solutions in the field of face recognition systems
Trang 20Chapter 2 THEORIES
2.1 Image Processing
Image processing is often viewed as a practice that manipulates images unfairly to enhance their beauty or reinforce preconceived notions of reality However, a more accurate definition portrays it as a means of bridging the gap between the human visual system and digital imaging equipment Our perception of the world differs from that of digital cameras, which possess their own distinct capabilities and limitations Therefore, it becomes crucial
to understand the differences between human and digital detectors and employ precise processes to translate between them By approaching image editing scientifically, we can ensure that the results achieved by individuals can be replicated and verified by others This involves documenting and summarizing the processing operations performed and subjecting appropriate control images to the same treatment
Image processing encompasses the use of digital computers to address various challenges within an image, including audio editing and color correction It involves modifying an image to produce an enhanced version or extract relevant data from it When applied to image or topography-based data, it is referred to as signal processing Currently, the field
of image processing is undergoing rapid expansion and is a primary focus of research within the fields of engineering and computer science
2.1.1 Image Obtainment
The first step in digital image processing is the commencement of image acquisition This entails capturing and recording specialized images that represent real-life scenes or the internal structure of objects This initial stage enables subsequent manipulation, compression, storage, printing, and display of these images
Figure 2.1: Image obtainment in digital camera
Trang 21The hardware setup and regular maintenance play a vital role in the acquisition and processing of images, depending on the specific industry involved The range of hardware utilized can vary significantly, ranging from small desktop scanners to large optical telescopes It is crucial to correctly configure and align the hardware to prevent visual distortions that could complicate image processing Insufficient hardware configuration can result in such poor image quality that even extensive processing cannot salvage the images These considerations are particularly important in fields that rely on comparative image processing to identify specific variations among collections of images [1][2]
Real-time image acquisition is a widely used approach in the image processing industry This method involves capturing images from a source that continuously takes automatic pictures The data stream produced by real-time image acquisition can be automatically processed, temporarily stored for later use, or consolidated into a single media format Background image acquisition, which combines software and hardware, enables the rapid preservation of the images being streamed into a system and is commonly employed in real-time image processing [1][2]
Cutting-edge image processing techniques often make use of specialized hardware for image acquisition One example is the acquisition of three-dimensional (3D) images This technique entails using two or more precisely aligned cameras positioned around a target
to create a 3D or stereoscopic scene or measure distances In certain cases, satellites employ 3D image acquisition methods to generate accurate representations of various surfaces.[1][2]
Trang 22Figure 2.2: Contrast Enhancement Techniques
Image enhancement techniques are utilized to improve the quality, contrast, and sharpness
of digital images, enabling them to be further processed and analyzed These modifications are implemented to make the images more suitable for display or to facilitate a more detailed examination of their content For instance, techniques like noise reduction, sharpening, and brightness adjustments are employed to simplify the identification of important details within the image Prior to any processing, image enhancement works to improve the overall quality and information content of the original data It effectively expands the range of visual aspects chosen for enhancement, making them more distinguishable, while maintaining the intrinsic value of the underlying data Through image enhancement, we can achieve greater clarity, uncover valuable insights, and ensure that the integrity of the conveyed information remains intact. [1][2]
2.1.3 Image Restoration
Image restoration techniques aim to recover a clean and undistorted version of an image that has been degraded or distorted The objective of image restoration is to restore the lost details and reduce the effects of noise, ultimately improving the overall quality of the image
By utilizing advanced algorithms and mathematical models, image restoration techniques analyze the degraded image and try to estimate the original content These methods employ different approaches like deconvolution, denoising, and inpainting to enhance the image and restore its visual accuracy The goal is to minimize the impact of degradation and maximize the recovery of important information, ultimately resulting in a clearer and more visually appealing image [1][2]
Trang 23Figure 2.3: Image Restoration – Reducing noises
- Enhanced Visual Benefits: Despite reducing file size, image compression strives to preserve image quality to a satisfactory level This ensures that visual details and fidelity are maintained, allowing photographers and content creators to share and distribute their work efficiently without compromising the intended visual impact
- Efficient Data Transmission: Compressed images require less bandwidth when being downloaded from websites or transmitted over the internet This leads to faster content delivery and a smoother user experience Reduced file sizes alleviate network congestion, facilitating efficient data transfer, especially in bandwidth-limited environments
- Diverse Compression Techniques: Image compression employs a range of techniques to achieve optimal results These techniques vary from standard compression algorithms to more sophisticated methods tailored to factors such as image complexity and desired compression ratios By employing diverse techniques, image compression ensures efficient data representation and storage
In conclusion, image compression is an essential tool in digital photography, offering multiple advantages It enables cost savings, enhances visual experiences, and expedites content delivery by reducing file sizes while maintaining image quality.[1][2]
Trang 24Figure 2.4: Image compression – lossy and lossless
Image file compression can be broadly categorized into two main types: lossy compression and lossless compression Each type has its own characteristics and tradeoffs
Lossy compression is a technique that reduces the size of an image file by permanently discarding redundant or less essential information This process allows for significant reduction in file size, making it advantageous for efficient storage and transmission However, it's important to note that with lossy compression, there is a trade-off in terms of image quality If an image is excessively compressed, it can lead to noticeable distortions and a significant loss of visual fidelity However, when used judiciously and with appropriate settings, lossy compression can effectively preserve image quality while achieving significant file size reduction
On the other hand, lossless compression is a method that reduces the size of an image file without compromising any visual information It achieves this by employing algorithms that store and reproduce the original image exactly, pixel by pixel Lossless compression
is desirable when preserving the exact integrity of the image is crucial, such as in professional photography or archival purposes However, it's important to note that lossless compression typically results in smaller file size reductions compared to lossy compression
In summary, while lossy compression can achieve substantial file size reduction, it is important to use it carefully to avoid excessive degradation of image quality On the other hand, lossless compression maintains image fidelity at the cost of smaller file size reductions The choice between these compression techniques depends on the specific requirements of the application, the importance of image quality, and the desired level of file size reduction
Trang 252.1.5 Coloring Image Processing
A deep understanding of how light and color are perceived is vital in the field of color image processing Human color perception is influenced by various factors, such as the unique properties of objects, including their material composition, the presence of different substances, lighting conditions, and the time of day
Color image processing involves specific procedures that focus on analyzing and manipulating the color information within an image Through the application of diverse algorithms and methods, color separation techniques can isolate and extract distinct color components from an image, enabling further analysis and processing This separation process plays a crucial role in tasks like recognizing objects, classifying materials, and understanding scenes
By exploring the mechanics of light and color perception, color image processing techniques aim to accurately capture and reproduce the visual aspects of the real world This understanding enhances the ability to manipulate and interpret color information, opening up possibilities for applications in fields such as computer vision, digital imaging, and visual communication
Figure 2.5: Color Space
Color image processing plays a vital role in various applications, presenting numerous opportunities for enhancing and analyzing images Here are some ways in which color image processing is crucial: Image Acquisition and Interpretation, Correction and Enhancement, Analysis and Scientific Discoveries, and Challenges and Techniques
Trang 26- Image Acquisition and Interpretation:
Color image processing is essential during image acquisition, whether it involves capturing images with digital devices or recording them on film
Interpretation of acquired images is often necessary to extract useful data For instance, in magnetic resonance imaging (MRI), computer algorithms interpret the output and present it visually to aid in diagnosis
Color coding specific regions in scans enhances contrast and clarity, enabling medical professionals to identify abnormalities more effectively
- Correction and Enhancement:
Color photos often require correction and enhancement to ensure their quality and aesthetic appeal
Image processing techniques, including manual color correction and cropping, help restore corrupted or damaged images and produce visually pleasing results
Converting photographs to specific color schemes, such as the RGB color scheme for offset printing, prepares images for publication and dissemination
- Analysis and Scientific Discoveries:
Color image processing facilitates analysis and scientific exploration across various fields, including astronomy
Astronomers utilize images captured by telescopes, balloons, and satellites to gain insights into the cosmos Automated color processing tools assist in highlighting phenomena and identifying targets of interest that might be overlooked by manual observation
Advanced applications enable tasks like object counting in images and identification
of spectral bands present, contributing to data analysis and research
- Challenges and Techniques:
Handling color photos poses greater challenges compared to black and white images
Noise, which can degrade color, clarity, or functionality, needs to be addressed using techniques like filtering and stacking
Color image processing finds application in processing test findings with an imaging component and restoring old photographs, utilizing these technologies for optimal results
Trang 27Overall, color image processing offers a wide range of applications, from image interpretation in medical imaging to enhancing photographs for publication
2.2 Deep Learning
Deep Learning is a type of computer software that replicates the intricate network of neurons found in the human brain It belongs to the broader field of machine learning and focuses specifically on artificial neural networks that have the capability to learn and represent information The name "deep learning" comes from its utilization of deep neural networks, which consist of multiple layers
Deep learning encompasses different learning modes, namely supervised, unsupervised, and semi-supervised learning In supervised learning, the training data includes predefined category labels, allowing the model to learn and make predictions based on known classifications Algorithms such as linear regression, logistic regression, and decision trees are commonly employed in supervised learning
On the other hand, unsupervised learning deals with training data that lacks explicit category labels In this mode, the model learns patterns and structures within the data without prior knowledge of specific classifications Algorithms like cluster analysis, K-means clustering, and anomaly detection are often used in unsupervised learning
Semi-supervised learning occurs when the dataset contains both labeled and unlabeled data In this approach, the model leverages the limited labeled data in conjunction with the unlabeled data to improve learning and prediction accuracy Semi-supervised learning techniques encompass graph-based models, generative models, and assumptions based on clustering and continuity
By comprehending the principles underlying deep learning and its different learning modes, practitioners can select suitable algorithms and methodologies to train models tailored to specific tasks and datasets.[3][4]
2.2.1 Frameworks
A wide array of deep learning frameworks are available at no cost, offering various features and functionalities These frameworks include TensorFlow, Keras, PyTorch, Theano, MXNet, Caffe, and Deeplearning4j Usage statistics from a survey conducted in 2019 indicate that TensorFlow, Keras, and PyTorch are among the most frequently utilized frameworks among them
Trang 28a Keras
This is a Python-based Open-Source Neural Networks framework that operates at a high level and employs TensorFlow, CNTK, and Theano as backend tools It comes with comprehensive documentation and offers user-friendly functionality As a result, it is favored in dynamic settings, particularly in research scenarios where swift experimentation outcomes are essential The framework is designed to be modular and adaptable, and it functions seamlessly across various platforms, including CPU, GPU, and TPUs It prioritizes easy comprehensibility and promotes modularity, allowing for the effortless addition of new layers or components to existing models
b TensorFlow
Developed by Google Brains, this is an additional well-known deep learning framework that was initially utilized for proprietary research purposes It is implemented in C++ and Python and has significantly improved the efficiency of intricate numerical computations
At its core, the framework employs dataflow graphs as a data structure, where the nodes of the graph represent a series of mathematical operations to be executed, and the edges represent multidimensional arrays or tensors
By utilizing C++ for low-level numerical computations, this framework achieves impressive computational speed, surpassing other frameworks It also provides a high-level Python API that abstracts the underlying C++ functionality Similar to Keras, it is platform-independent and can seamlessly operate on CPU, GPU, and TPUs Furthermore, being an open-source framework, it can be easily installed using a Python installer or by cloning the corresponding GitHub repository
c PyTorch
Considered as one of the most user-friendly frameworks, it serves as a replacement for NumPy arrays to expedite numerical computations in GPU environments By utilizing tensors, it significantly accelerates computation speed Unlike the aforementioned frameworks that construct a neural network structure to be reused repeatedly, PyTorch employs a technique called reverse-mode auto differentiation This dynamic approach enables seamless modification of the neural network without any delay or additional overhead It generates the dataflow graph in real-time, resulting in ease of debugging and efficient memory usage Implemented in Python and C++, PyTorch offers excellent
Trang 29documentation and boasts easier extensibility It is platform-independent and compatible with CPU, GPU, and TPUs Installing PyTorch can be accomplished via a Python installer
or by cloning the Open Source repository from GitHub. [3]
2.2.2 Models
A neural network is employed to create a Deep Learning Model, consisting of an Input layer, Hidden layer, and Output layer The Input layer receives the input data, which is processed in the Hidden layer using adjustable weights that are fine-tuned during training The model then generates predictions, which are adjusted iteratively to minimize the error
Figure 2.6: Deep Learning Model
To incorporate non-linear relationships, an activation function is utilized In the initial stage, the structure of the input layer can be defined, where the number "2" represents the input column count, and the desired number of rows can be specified after a comma The output layer contains a single node for prediction Activation functions assist in extracting complex patterns from the provided data, enabling the network to optimize the error function and reduce loss during back-propagation, assuming the function is differentiable The input is multiplied by weights, and bias is added to the computation [3][4]
2.2.3 Algorithms
Creating a deep learning model entails combining multiple algorithms to construct a network of interconnected neurons Deep learning is known for its computational intensity, but there are platforms like TensorFlow, PyTorch, Chainer, and Keras that assist in developing these models The objective of deep learning is to emulate the structure of the human neural network, with perceptrons serving as the fundamental units in the deep learning model [11][27]
Trang 30A perceptron comprises input nodes (similar to dendrites in the human brain), an activation function for decision-making, and output nodes (similar to axons in the human brain) Understanding the functioning of a single perceptron is crucial as connecting multiple perceptrons forms the basis of a deep learning model Input information, with associated weights, is passed through the activation function, producing an output that serves as input for other neurons After processing a batch, backpropagation error is computed at each neuron using a cost function or cross-entropy
Different activation functions like sigmoid, hyperbolic tangent, and Rectified Linear Unit (ReLU) are employed to make decisions within the deep learning model Models with more than three hidden layers are typically considered deep neural networks Essentially, deep learning involves a collection of neurons, with each layer having specific parameters Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) are popular architectural choices for constructing deep learning models [11][27]
2.2.4 Networks
Deep learning methods utilize neural networks, hence they are commonly known as deep neural networks These networks consist of multiple hidden layers, making them deep or hidden neural networks The objective of deep learning is to train artificial intelligence systems to make predictions based on given inputs or hidden layers within the network Training deep neural networks involves using extensive labeled datasets, allowing the networks to learn features directly from the data Both supervised and unsupervised learning techniques are employed to train the data and extract meaningful features
Figure 2.7: Simple Neural Network
Trang 31The deep learning process begins with the input layer receiving the input data, which is then passed to the first hidden layer Mathematical calculations are performed on the input data, and ultimately, the output layer produces the results
Convolutional Neural Networks (CNN), a widely used type of neural network, apply feature convolutions to input data, leveraging 2D convolutional layers for processing 2D data such as images CNNs eliminate the need for manual feature extraction as they directly extract relevant features from images for classification This automation makes CNN a highly accurate and reliable algorithm in machine learning Each layer in a CNN learns specific features from the hidden layers, which increases the complexity of learned images
[11][27]
Training artificial intelligence or neural networks is a crucial aspect During training, input data is provided from a dataset, and the outputs are compared to the expected outputs from the dataset If the AI or neural network is untrained, the outputs may be incorrect
To measure the disparity between the AI's output and the actual output, a cost function is employed The cost function calculates the difference between the two outputs A cost function value of zero indicates that both the AI's output and the actual output are the same The goal is to minimize the cost function value, which involves adjusting the weights between the neurons Gradient Descent (GD) is a commonly used technique for this purpose GD systematically reduces the weights of the neurons after each iteration, automating the process [11][27]
2.2.5 Model Training Process
A deep neural network provides state-of-the-art accuracy in many tasks, from object detection to speech recognition They can learn automatically, without predefined knowledge explicitly coded by the programmers
Each layer in a neural network represents a deeper level of knowledge, forming a hierarchy
of knowledge As the number of layers increases, the neural network learns more complex features compared to networks with fewer layers
Refer the figure in next page for more information:
Trang 32Figure 2.8: Deep Learning Process
The learning process in a neural network consists of two phases:
- First Phase: In the initial phase, a nonlinear transformation is applied to the input data, resulting in the creation of a statistical model as the output
- Second Phase: The second phase focuses on improving the model using a mathematical method known as the derivative
These two phases are repeated hundreds to thousands of times in what is known as iterations Neural networks continue iterating until they achieve the desired level of output and accuracy
- Training of Networks: To train a neural network with data, a large amount of data is collected, and a model is designed to learn the underlying features However, training with
a vast amount of data can be time-consuming
- Transfer Learning: Transfer learning involves fine-tuning a pre-trained model for a new task This approach reduces computation time by leveraging the knowledge learned from previous tasks
- Feature Extraction: Once all the layers of the neural network are trained to recognize the features of an object, these learned features can be extracted, and accurate predictions can
be made based on them
- By utilizing these techniques, neural networks can progressively learn and extract meaningful features from data, leading to improved accuracy in predicting outputs
Trang 332.3 Face Detection Model
2.3.1 Object Detection
Object detection is a computer vision task that involves identifying and localizing objects within digital images or videos It encompasses three main tasks: image classification, object positioning, and object detection
Image classification focuses on predicting the class or category of a single object in an image The input is an image containing an object, and the output is a class label or multiple class labels that represent the object's category
Object localization determines the presence of objects in an image and provides their positions using bounding boxes The input is an image containing one or more objects, and the output is one or more bounding boxes defined by their coordinates, including the center point, width, and height
Object detection combines image classification and object localization to identify and locate multiple objects within an image It takes an input image, detects the objects present, and provides both the bounding box coordinates and the corresponding class labels for each detected object
In summary, image classification predicts the label of an object, object positioning determines the position of objects using bounding boxes, and object detection combines both tasks to detect and locate multiple objects with their corresponding labels in an image These tasks play a crucial role in various computer vision applications, enabling machines
to understand and interact with visual data effectively
Figure 2.9: Relationships Between Tasks in Computer Vision
Trang 34There are various models used for object detection Older architectures include R-CNN and fast R-CNN These models have slower processing speeds and are not suitable for real-time object detection More advanced networks such as SSD, YOLOv2, YOLOv3 offer faster processing speeds while maintaining accuracy by incorporating changes in network architecture to streamline the detection and classification process in a single pass and eliminate unnecessary computations The specific deep learning algorithm used for object detection called the Single Shot Multibox Detector (SSD) [34]
2.3.2 SSD-Single Shot Multibox Detector
SSD, which stands for Single Shot Multibox Detector, is a deep learning method designed
to address the problem of object detection Similar to other object detection architectures, SSD takes as input the coordinates of the bounding box (referred to as offsets) and the label
of the object contained within the box One key feature that makes SSD fast is its use of a single neural network
The approach of SSD is based on object recognition in feature maps, which are dimensional outputs of a convolutional neural network (CNN) after removing the last fully connected layers These feature maps have different resolutions SSD creates a grid of squares called grid cells on these feature maps Each cell defines a set of default boxes that are used to predict objects centered in that cell These boxes act as frames to enclose the objects During the prediction phase, the neural network outputs two values: the probability distribution of the object labels within the bounding box and the offsets of the bounding box
three-Unlike the fast R-CNN model, SSDs do not require a separate region proposal network to suggest object regions Instead, all the object detection and classification processes are performed within the same network The name 'Single Shot MultiBox Detector' reflects the use of multiple box frames with different scales to detect and classify object regions By eliminating the need for a region proposal network, SSD achieves significantly faster processing speeds while still maintaining high accuracy
Furthermore, SSD combines feature maps with different resolutions to effectively detect objects of various sizes and shapes This is a contrast to the fast R-CNN model The use of multiple feature maps allows SSD to handle objects with different scales This reduces the step of creating a region proposal network, resulting in a significant speed improvement without compromising accuracy
Trang 35Figure 2.10: Architecture of SSD
The SSD model is divided into two stages:
- Feature Map Extraction: In this stage, a base network, typically VGG16, is used to extract feature maps from the input image These feature maps capture high-level semantic information about the image The use of a base network enhances the effectiveness of object detection by providing rich and discriminative features
- Convolutional Filter Application: In this stage, a set of convolutional filters is applied to the feature maps to detect objects These filters are responsible for analyzing different aspects of the feature maps and identifying potential object locations By convolving these filters with the feature maps, the SSD model can effectively detect objects of various sizes and aspect ratios
By combining these two stages, SSD is able to achieve accurate and efficient object detection The feature maps extracted from the base network serve as a basis for detecting objects, while the convolutional filters enable the model to identify and localize objects within the feature maps This multi-stage approach allows SSD to achieve state-of-the-art performance in object detection tasks
2.3.3 RFB – Receptive Field Block
The proposed RFB (Receptive Field Block) is a multi-branch convolutional block designed
to enhance the effectiveness of object detection It consists of two key components: a branch convolution layer with distinct kernels and trailing dilated pooling or convolution layers
multi-The first component, referred to as Inception, aims to replicate the population Receptive Field (pRF) size of the human visual system It achieves this by utilizing different kernels
in the convolutional layer, allowing the network to capture features at multiple scales
Trang 36The second component focuses on reproducing the relationship between pRF size and eccentricity observed in the human visual system This is accomplished through the integration of dilated pooling or convolution layers, which gather information from different spatial regions
Figure 2.3 provides a visual representation of the RFB architecture, along with spatial pooling region maps that illustrate how the various components of the RFB capture and process information from different parts of the input
By incorporating the RFB module into the object detection framework, the model can benefit from enhanced feature discriminability and robustness, ultimately leading to improved performance in object detection tasks
Figure 2.11: Construction of the RFB module combining multiple branches with different kernels and dilated convolution layers
The multi-branch convolution layer utilizes different kernels to capture Receptive Fields (RFs) of different sizes, leveraging the concept of RFs in Convolutional Neural Networks (CNNs) This approach allows the network to capture information at multiple scales, which
is often more effective than using fixed-size RFs
The RFB architecture incorporates the latest versions of Inception, specifically Inception V4 and Inception-ResNet V2[31], from the Inception family In each branch, a bottleneck structure is applied, which consists of a 1x1 convolutional layer to reduce the number of channels in the feature map, followed by an n x n convolutional layer To reduce parameters and increase depth in non-linear layers, the original 5×5 convolutional layer is replaced by two stacked 3×3 convolutional layers Similarly, the original n×n convolutional layer is
Trang 37substituted with a 1 × n convolutional layer followed by an n×1 convolutional layer Additionally, the shortcut design from ResNet [32] and Inception-ResNet V2 [31] is incorporated into the architecture
The dilated pooling or convolution layer is designed to create feature maps with higher resolution, enabling the capture of more information over a larger context area while maintaining a manageable number of parameters This design has proven to be effective in tasks such as semantic segmentation [33] and has gained popularity in widely recognized object detectors like SSD [34]
Figure 2.12: The architecture of RFB and RFB-s
The RFB-s parameters, such as kernel size, branch dilation, and the number of branches, undergo slight modifications at each position in the detector
2.3.4 Ultra Light-Fast
The RFB Net detector incorporates the multi-scale and one-stage framework of SSD [34], with the addition of the RFB module to enhance the feature extraction capabilities of the lightweight backbone, ensuring improved accuracy while maintaining speed The key modification involves replacing the top convolution layers with the RFB module
The RFB Net detector incorporates the multi-scale and one-stage framework of SSD [34], with the addition of the RFB module to enhance the feature extraction capabilities of the lightweight backbone, ensuring improved accuracy while maintaining speed The key modification involves replacing the top convolution layers with the RFB module
Trang 38Figure 2.13: Ultra light fast generic face detector architecture
The lightweight backbone used in the RFB Net detector is identical to the one used in SSD
[34] It is based on the VGG16 [37] architecture, pre-trained on the ILSVRC CLS-LOC dataset [38] The conv6 and conv7 layers are converted into convolutional layers with sub-sampling parameters, while the pool5 layer is changed from 2×2-s2 to 3×3-s1 Additionally, the dilated convolution layer fills all the dropout layers, and the fc8 layer is removed
In the original SSD [34], a cascade of convolution layers generates a series of feature maps with decreasing spatial resolutions and increasing fields of view In the RFB Net detector,
we retain the cascade structure of SSD but replace the front convolution layers, which have feature maps with high resolutions, with the RFB module While the original RFB module imitates the impact of eccentricity using a single structure setting, we modify the RFB parameters to create an RFB-s module that simulates smaller pRFs found in shallow human retinotopic maps This RFB-s module is placed behind the conv4_3 features, as indicated The input layer of the RFB Net detector consists of images with a size of 300x300x3 (width
x height x channels)
The VGG16 layer serves as the base network, reusing the architecture of VGG16 but removing some fully connected layers The output of this layer is Conv4_3, which is a 38x38x512 feature map
The Conv4_3 layer undergoes two types of conversions:
First conversion: A convolutional layer, similar to a standard CNN, is applied to obtain the next output layer Specifically, a convolutional kernel with a size of 3x3x1024 is used to generate Conv7, which has a size of 19x19x1024
Trang 39Second conversion: The 38x38x512 feature map from Conv4_3 passes through an RFBs layer, replacing the classifier of the SSD framework, for object identification
Similarly, RFB layers are also applied to Conv7, Conv8, Conv9, Conv10, and Conv11 The shape of the subsequent layers depends on the convolutional process applied to the previous layers Conv8 and Conv9 are replaced by RFB modules with a Stride of 2 to extract additional features from the preceding layers
In essence, FaceNet utilizes a deep neural network to capture and extract diverse facial attributes These attributes are subsequently projected onto a 128-dimensional space, where images of the same individual are clustered closely together and separated from images of different individuals
The key elements of this architecture are briefly outlined below
Figure 2.14: FaceNet Architecture Diagram
2.4.2 Inception Architecture
The Inception Architecture plays a crucial role in FaceNet, a deep learning-based face recognition system Inception is used to extract and represent facial features from input images
In FaceNet, the Inception architecture is employed to build a deep neural network that learns complex features from facial images The Inception modules in this architecture
Trang 40enable the model to automatically learn and create convolutional filters that are suitable for the input information
The main role of Inception Architecture in FaceNet is to create a 128-dimensional Euclidean feature space, where points that are close to each other correspond to similar faces, and points that are far from each other correspond to different faces This enables the model to compare and recognize faces based on the distances between points in the feature space
The neural network using the Inception architecture in FaceNet helps generate complex and discriminative facial features It has the ability to learn and represent facial features at various scales and levels of detail, thereby improving the accuracy and computational complexity of the FaceNet face recognition system
Inception ResNet V1, introduced in 2016, is an extension of the Inception module that incorporates residual connections Residual connections enable the network to learn residual mappings, which helps alleviate the degradation problem that can occur in very deep networks By integrating residual connections into the Inception architecture, Inception ResNet V1 achieves improved performance and better gradient flow during training
Figure 2.15: The Inception ResNet V1 architecture