A multifunctional embedded system based on deep learning for assisting the cognition of visually impaired people

Introduction

Motivation

Visually impaired individuals face daily challenges in navigation, prompting the development of various adaptive devices to enhance their independence These innovations fall into three main categories: sensor-based, computer vision-based, and smartphone-based solutions Sensor-based methods utilize signals from various sensors, such as ultrasonic, infrared, laser, and distance sensors, to detect obstacles In contrast, computer vision-based approaches employ cameras to capture the surrounding environment, using algorithms to identify barriers Lastly, smartphone-based methods leverage smartphone cameras and sensors to gather environmental data, which is then processed by the device to identify potential obstacles.

However, most of the existing studies focus on navigation and obstacle avoidance, and there is less focus on context-aware and recognition of surrounding objects [3, 7,

Recent studies have primarily focused on experiments conducted on servers or laptops, which restricts the experimental areas and navigation capabilities Object detection schemes typically employed for identifying objects in current scenes often yield simplistic descriptions of the surrounding environment To address these limitations, this study develops a multifunctional embedded system utilizing deep learning technology aimed at enhancing the cognitive abilities of visually impaired individuals The motivations behind this research are elaborated upon in detail.

Face recognition has emerged as a significant area of research in computer vision due to its critical role in applications like security systems, video surveillance, and human-computer interaction Recent advancements in deep learning, particularly through the use of convolutional neural network (CNN) architectures, have greatly enhanced face recognition techniques This progress has led to the development of cutting-edge methods such as VGGFace, FaceNet, and ArcFace, which are setting new standards in the field.

Face recognition technology is primarily utilized in monitoring and security systems, particularly in video surveillance The integration of IoT technology across various sectors has enhanced support for daily human needs, leading to significant research in IoT-based healthcare systems A key focus of this research is the application of facial recognition for assisting visually impaired individuals Two main approaches are prevalent in these studies: cloud-based and computer-based methods Cloud-based systems typically consist of two modules—a local unit that collects images and communicates with a server, which processes the images and provides feedback However, these methods often require more processing time and depend on internet connectivity, limiting their accessibility.

In the computer-based methods [60, 64], the processor unit is a laptop computer

Laptop computers serve as the central processing unit in these systems, handling tasks like gathering and processing input images and exporting results However, their weight poses a challenge for visually impaired individuals during movement and navigation, making them cumbersome to carry for extended periods Consequently, there is significant potential for enhancing computer-based methods to better accommodate these users.

(2) Gender, age and emotion classification issues

Gender classification plays a crucial role in various domains, including Human-Computer Interaction (HCI), surveillance systems, commercial development, demographic research, and entertainment In HCI, it enables robots to identify users' genders and tailor services accordingly In surveillance, gender classification enhances the effectiveness of intelligent systems In commercial settings, it aids market research and informs business decisions For demographic research, it facilitates the collection of vital demographic data and statistics In entertainment, it allows for the design of gender-specific game content and app customization Although less prevalent in the medical field, particularly for applications benefiting visually impaired individuals, gender classification is gaining attention as a promising area for future research.

Age classification plays a crucial role in various fields, including intelligent surveillance, human-computer interaction, commercial development, social media analysis, and demographic research It is particularly valuable for security applications, such as restricting minors from accessing adult content, purchasing age-restricted items, and consuming alcohol In retail, age classification systems enhance the shopping experience by personalizing offerings based on customers' age and gender, allowing store managers to effectively cater to diverse preferences, track market trends, and tailor products and services to meet consumer demands.

69] However, in order to further assist visually impaired people, age classification is still expected to have wider applications

Facial expressions are powerful and universal indicators of emotional states and intentions Due to their significant role in various applications such as human-computer interaction (HCI), robotics, monitoring driver fatigue, and psychological assessments, extensive research has been dedicated to facial emotion classification.

Facial emotion recognition systems in computer vision and machine learning aim to encode expression information from facial representations, focusing on six basic human emotions: anger, disgust, fear, happiness, sadness, and surprise These emotions form the foundation for classifying human emotional studies Recent research has explored emotion classification to assist visually impaired individuals; however, these studies often overlook practical applicability and flexibility Thus, there is a need for an efficient system that considers gender, age, and emotion classifications to enhance usability.

Object detection has emerged as a vital technique in computer vision, with applications spanning human-computer interaction, security systems, video surveillance, and autonomous vehicles The growing emphasis on accuracy in object detection has led researchers to develop various deep-learning methods, including Faster R-CNN, SSD, YOLOv3, RetinaNet, and Mask R-CNN.

In the medical field, various studies have applied object detection to help visually impaired people to navigate independently, and perceive the surroundings and objects

Recent studies have demonstrated significant advancements in assistive systems for visually impaired individuals by integrating multiple sensors and cameras to enhance performance For instance, Joshi et al proposed a system that combines a camera with a distance sensor, enabling functionalities such as object detection, text recognition, and obstacle distance measurement However, the reliance on laptop computers limits user accessibility While Raspberry Pi has been utilized to address this issue, it often results in reduced system performance This highlights the urgent need for a more flexible and efficient system that effectively applies object detection techniques to better assist visually impaired users.

Overview of Research

This research focuses on developing a multifunctional embedded system to assist visually impaired individuals, utilizing advanced deep learning techniques such as face recognition, gender classification, age estimation, emotion recognition, and object detection The system is designed to perform multiple tasks, categorized into three primary functions: face recognition, gender and age classification, and emotion detection However, controlling the system poses challenges for visually impaired users due to its reliance on various deep learning models.

This study proposes an efficient function selection process for visually impaired individuals, utilizing a remote controller as an input method It emphasizes the importance of dataset collection for face recognition and emotion classification, detailing the use of images from videos, photos, and standard datasets, supported by two algorithms for image collection and pre-processing Notably, an innovative algorithm allows for the addition of new individuals without retraining the entire model The system also integrates gender and age classification to enhance social interactions for visually impaired users, offering critical information about detected strangers Furthermore, an object detection feature is introduced to assist users in recognizing their surroundings, employing pre-trained models to deliver detailed results, organized by an object order table for clarity Finally, a prototype built on the Jetson AGX Xavier demonstrates the system's feasibility, incorporating face recognition, gender, age, and emotion classification, tested with both collected datasets and real-time camera images.

Dissertation Organization

This research develops a multifunctional embedded system utilizing deep learning for face recognition, gender, age, and emotion classification The dissertation is organized as follows: Chapter 2 reviews relevant literature and technological applications, while Chapter 3 outlines the system's design and architecture Chapter 4 focuses on the face recognition feature, and Chapter 5 covers gender, age, and emotion classification Object detection is detailed in Chapter 6, with Chapter 7 describing the prototype implementation Chapter 8 presents the experimental results, and Chapter 9 concludes the study and explores future research directions.

Related Work

Face Recognition

Recent advancements in face recognition have gained significant attention due to their vital role in computer vision applications, driven by deep learning and CNN architectures that enhance methods like VGGFace, FaceNet, and ArcFace Additionally, the rise of IoT technology, particularly in healthcare systems, has spurred the development of facial recognition applications aimed at assisting visually impaired individuals For instance, Aza et al proposed a real-time face recognition system utilizing the Local Binary Pattern Histogram (LBPH) algorithm to identify faces This system, which employs a smartphone to capture videos of the environment, is limited to recognizing one face per frame and requires input images to be converted to binary or grayscale for effective processing.

Cloud-based technologies have greatly enhanced system performance, particularly in applications for visually impaired individuals Chaudhry and Chandra introduced a mobile face recognition system that operates on a smartphone, leveraging a server for enrollment and identification tasks This system utilizes the Cascade classifier and LBPH algorithm for effective face detection and recognition Similarly, Chen et al developed a smart wearable system that recognizes faces, objects, and texts, facilitating better environmental awareness for visually impaired users This system comprises a local unit for data collection and communication with the cloud server, which handles image processing However, a significant limitation is its reliance on internet connectivity, restricting its use to areas with reliable access.

Computer-based techniques significantly improve the functionality and adaptability of face recognition technologies A notable example is the DEEP-SEE FACE system developed by Mocanu et al., which operates in real-time to aid visually impaired individuals in their cognitive, interactive, and communicative tasks This innovative system leverages deep convolutional neural networks (CNNs) alongside advanced computer vision algorithms, enabling users to effectively detect, track, and recognize multiple individuals in their vicinity.

The laptop-based system initially used for assisting visually impaired individuals in face recognition faced challenges due to its weight, making mobility difficult for users To address this issue, Neto et al proposed a wearable solution that utilizes a Kinect sensor to capture RGB-Depth images This innovative system combines techniques like the histogram of oriented gradients (HOG), principal component analysis (PCA), and k-nearest neighbor (kNN) algorithms for effective face recognition However, the reliance on Kinect's infrared sensor limits its usability in outdoor settings.

Gender, Age and Emotion Classification

Gender classification can be approached through various methods, including ear, fingerprint, iris, voice, and face-based techniques With advancements in CNN architectures, face-based gender classification has gained significant attention Arriaga et al introduced a real-time CNN model, named mini-Xception, which utilizes primary layers like convolution, ReLU activation, and residual depth-wise separable convolution for emotion and gender classification Liew et al developed a simpler CNN model with three convolutional layers and one output layer, employing cross-correlation to minimize computational demands Yang et al proposed a compact soft stagewise regression network (SSR-Net) that initially classified age but was later enhanced to include gender classification through a two-stream model with heterogeneous streams Dhomne et al used the VGGNet architecture to create a CNN model that recognizes gender by automatically extracting features from face images without relying on HOG or SVM techniques Khan et al implemented a framework that segments a face image into six parts using a CRF-based model, followed by a probabilistic classification strategy to generate probability maps for gender classification.

Age classification, alongside gender classification, has garnered significant interest, leading to numerous studies that combine both age and gender or age and emotion classifications based on facial features Levi and Hassner proposed a deep CNN model for these classifications, utilizing an architecture inspired by AlexNet, which consists of five main layers: three convolutional and two fully-connected layers, along with components like max-pooling, normalization, dropout layers, and ReLU activation functions Similarly, Agbo-Ajala and Viriri developed a face-based classification model for age and gender, also based on AlexNet, featuring convolutional, ReLU activation, batch normalization, max-pooling, dropout, and fully-connected layers Their model, pre-trained and fine-tuned on the extensive IMDb-WIKI dataset, incorporates a robust face detection and alignment technique to enhance classification accuracy.

Recent advancements in age estimation utilize deep convolutional neural networks (CNNs) and generative adversarial networks (GANs) for enhanced accuracy A notable approach involves reconstructing high-resolution facial images from low-resolution inputs, with VGGNet employed for evaluating age estimation Additionally, Liao et al introduced a CNN model leveraging a divide-and-rule strategy for robust feature extraction, built upon the GoogLeNet architecture Zhang et al proposed a novel residual network of residual networks (RoR) to classify age groups and gender, enhancing performance through two innovative mechanisms that consider age group characteristics, based on the ResNet architecture.

Face emotion classification has gained significant attention in various applications, with numerous authors contributing to the literature Hu et al introduced a deep CNN model called supervised scoring ensemble (SSE), which enhances accuracy by incorporating auxiliary blocks and three supervised layers across shallow, intermediate, and deep levels Cai et al developed a novel island loss for CNN models, aimed at increasing pairwise distances between class centers to improve classification accuracy Bargal et al presented a network ensemble model utilizing VGG13, VGG16, and ResNet, which combines learned features into a single vector for effective emotion classification Zhang et al proposed an evolutional spatial-temporal network that employs multitask networks, utilizing a multi-signal convolutional neural network (MSCNN) for spatial features and a part-based hierarchical bidirectional recurrent neural network (PHRNN) for temporal analysis, significantly boosting performance Liu et al designed an AU-aware deep network (AUDN) with cascaded modules for facial expression identification, featuring convolutional layers for representation, an AU-aware receptive field layer for targeted feature extraction, and multilayer restricted Boltzmann machines for hierarchical learning.

Object Detection

Object detection is a key area in computer vision, particularly in smart healthcare systems aimed at aiding visually impaired individuals Various systems have been developed to enhance communication and social inclusion, such as the one proposed by Tian et al., which utilizes object detection and text recognition to help users navigate by identifying objects like doors and elevators Another innovative system by Ko and Kim employs QR code detection for wayfinding in unfamiliar indoor spaces, though its application is limited to environments with QR codes Mekhalfi et al introduced a prototype featuring lightweight components for navigation and object recognition, while Long et al designed a framework using millimeter-wave radar and RGB-Depth sensors for obstacle detection Khade and Dandawate developed a compact, wearable system on Raspberry Pi to track obstacles, and Joshi et al implemented an AI-based assistive system using YOLOv3 for object detection, providing audio feedback to help avoid obstacles Tapu et al presented an automatic cognition system based on deep CNN and computer vision algorithms for navigation support.

[70] is applied to detect objects, and the system sends a warning to the user by a headphone when detecting obstacles.

Smart Healthcare

IoT-based techniques are increasingly recognized as effective solutions in healthcare, particularly for supporting visually impaired individuals Research has focused on three main approaches: sensor-based, computer vision-based, and smartphone-based methods Sensor-based techniques utilize various sensors, such as ultrasonic, infrared, laser, and distance sensors, to detect obstacles For instance, Katzschmann et al developed the ALVU device, which consists of a sensor belt and a haptic strap, enabling safe navigation for visually impaired users through distance measurement and haptic feedback Additionally, Nada et al created a smart stick equipped with infrared sensors that identifies obstacles within two meters, offering a cost-effective, lightweight, and user-friendly solution that provides audio alerts Furthermore, Capi designed an intelligent robot system that assists visually impaired individuals in unfamiliar indoor settings, featuring obstacle detection and guided navigation modes through a combination of a laptop, camera, speaker, and laser range finder.

Computer vision methods aid visually impaired individuals by capturing their surroundings with a camera and employing algorithms to identify obstacles Kang et al introduced a deformable grid (DG) obstacle detection technique that adapts its shape based on the motion of objects in the environment, enhancing collision risk recognition and system accuracy Additionally, Yang et al proposed a deep learning framework to further advance obstacle detection capabilities.

This study focuses on enhancing the perception of the surrounding environment for visually impaired individuals through semantic segmentation The framework efficiently aids in recognizing traversable terrains, sidewalks, stairs, and water hazards, while also ensuring the safe avoidance of obstacles, pedestrians, and vehicles.

Smartphone-based systems serve as a comprehensive solution for assisting visually impaired individuals by utilizing the smartphone as the central processing unit for data collection, processing, and decision-making For instance, Tanveer et al developed an embedded system that employs voice commands to help users detect obstacles and make voice calls, while also leveraging GPS technology to track their location, with data managed on a server Additionally, Cheraghi et al introduced GuideBeacon, a wayfinding system that utilizes Bluetooth beacons within a designated area to facilitate navigation for visually impaired users, enabling them to navigate more efficiently and effectively.

System Overview

System Architecture

The system overview, as depicted in Figure 3.1, features the NVIDIA Jetson AGX Xavier as its central module, connected to peripherals like a webcam, speaker, and Bluetooth audio transmitter This configuration facilitates essential functions, including image collection, processing, and system control The webcam captures the user's current scene, which is processed based on commands from a remote controller The system performs three primary functions: face recognition and emotion classification, age and gender classification, and object detection In the first function, it identifies faces and emotions, providing names and emotional states, and can offer details about strangers, such as gender and age The second function delivers results that include gender, age, and emotions, while the third function identifies various objects within the image, detailing their types and quantities Finally, the system converts the collected information into voice format, delivering the results to the user through the speaker.

Function Selection

This section presents the function selection Subsection 3.2.1 discusses the remote controller technique, and Subsection 3.2.2 describes the function selection process

The proposed multifunctional system is designed to enhance cognition for visually impaired individuals, featuring three easily selectable functions Users can choose their preferred function through various methods, including voice commands, computer vision, and remote control The system utilizes the remote controller technique, which is favored for its popularity and user-friendly operation This remote controller acts as a keyboard, connecting to and transmitting control signals to the central processing module, Jetson AGX Xavier.

Table 3.1 Pseudocode of the Key Code Testing

1 print("Please press any key ")

2 image = numpy.zeros([512,512,1],dtype=numpy.uint8)

4 cv2.imshow("The input key code testing (press Esc key to exit)",image)

5 key_input = cv2.waitKey(1) & 0xFF

6 if (key_input != 0xFF): #press key

8 if (key_input == 27): #press Esc key

To effectively utilize the remote controller within the system, it is crucial to recognize the key codes on the keyboard While standard computer keyboards allow for easy identification of key codes through the American Standard Code for Information Interchange (ASCII), determining key codes on various keyboards can be quite challenging.

To address this issue, a program has been developed to recognize the input key code, as detailed in Table 3.1 The algorithm used in this program is straightforward, allowing the key code to be captured each time a key on the keyboard is pressed.

The proposed system aims to enhance cognitive assistance for visually impaired individuals by ensuring the remote controller is user-friendly and allows for quick recognition of function keys Utilizing a Logitech remote controller, the study details the key codes for function keys, which can be customized according to system needs Specifically, function key codes are identified as 85, 86, and 46 for keys 1, 2, and 3, respectively When a user activates a function by pressing a key, the corresponding key code is transmitted to the central processing module (Jetson AGX Xavier), which processes the function based on the received control signal Additionally, users can easily switch functions by pressing the appropriate function key, resulting in the immediate presentation of the selected function.

Figure 3.2 Function Key of Logitech Remote Controller

The system operates with three primary functions: face recognition and emotion classification, age and gender classification, and object detection Effective function selection is crucial as it directly influences the system's efficiency Upon logging into the operating system, the Jetson AGX Xavier automatically initiates the function selection process, setting the initial parameters for the program.

The system provides a voice notification to alert the user that it is ready for function selection It then tests the input key code against predefined function key codes to identify the chosen function.

When the user activates the first function with the input key code 85, the system performs face recognition and emotion classification, labeling any detected stranger as "Unknown." The results include the names and emotions of identified individuals In the second function, triggered by the input key code 86, the process occurs in two stages: initially, face recognition and emotion classification are conducted, followed by an additional stage that provides details about the stranger's gender and age Consequently, the results encompass gender, age, and emotion Lastly, selecting the third function with the input key code 46 enables object detection, allowing the system to identify and quantify various objects in the image, with results detailing the types and quantities of detected objects.

Users can easily toggle between three functions by pressing the designated function keys The system responds by executing the corresponding function based on the input key code, continuing this process until the system is powered off.

Select function 1? Select function 2? Select function 3?

End function? End function? End function?

NVIDIA Jetson AGX Xavier

This section presents NVIDIA Jetson AGX Xavier Subsection 3.3.1 introduces an overview of the NVIDIA Jetson family, and Subsection 3.3.2 provides the technical specification of NVIDIA Jetson AGX Xavier

NVIDIA Jetson is the premier embedded artificial intelligence computing platform, featuring a complete System-on-Module (SOM) that integrates a CPU, GPU, PMIC, DRAM, and flash storage The Jetson platform offers compact modules equipped with GPU-accelerated parallel processing, alongside the Jetpack software development kit (SDK) that includes essential developer tools and extensive libraries for AI application development With high-performance and low-power computing capabilities, NVIDIA Jetson systems are ideal for deep learning and computer vision, enabling the creation of software for autonomous machines.

NVIDIA's Jetson products deliver advanced AI edge computing solutions tailored for embedded applications across diverse sectors, including medical, transportation, factory automation, retail, surveillance, and gaming The Jetson Family features a range of modules, such as Jetson Nano, Jetson TX1, Jetson TX2 series, Jetson Xavier NX, and Jetson AGX Xavier With these innovations, NVIDIA has established itself as the gold standard in AI edge computing technology.

The Jetson Nano module is a compact AI computer (70 mm x 45 mm) that enables a variety of embedded IoT applications, including surveillance and home robotics The Jetson TX1, recognized as the first supercomputer on a module, excels in performance and power efficiency for advanced visual computing tasks The TX2 series, comprising Jetson TX2, TX2i, and TX2 4GB, offers exceptional speed and efficiency in a small form factor (50 mm x 87 mm), ideal for deep learning applications The latest Jetson Xavier NX delivers high performance with low power consumption for deep learning and computer vision, supporting cloud-native technologies for software-defined features on edge devices Lastly, the Jetson AGX Xavier is specifically designed for autonomous machines, featuring six engines for accelerated sensor data processing and enhanced performance for fully autonomous operations.

3.3.2 Technical Specification of NVIDIA Jetson AGX Xavier

The NVIDIA Jetson family comprises a range of embedded computing boards, including the Jetson Nano, TX1, TX2 series, Xavier NX, and AGX Xavier modules This study focuses on the Jetson AGX Xavier module due to its superior performance, achieving up to twice the performance of the Xavier NX, twenty times that of the TX2, and forty times that of the Nano The Jetson AGX Xavier developer kit facilitates the development and deployment of comprehensive AI robotics applications across various sectors such as manufacturing, delivery, retail, and smart cities It is compatible with NVIDIA Jetpack, DeepStream SDKs, and software libraries like CUDA, cuDNN, and TensorRT, providing essential tools for AI edge computing.

Figure 3.4 Jetson AGX Xavier Developer Kit

Figure 3.5 Block Diagram of Jetson AGX Xavier Modules [97]

Jetson AGX Xavier includes more than 750Gbps of high-speed input/output (I/O)

This device offers exceptional bandwidth for streaming sensors and high-speed peripherals, making it a standout in its category It is among the first embedded devices to support PCIe Gen 4, featuring 16 lanes across five PCIe Gen connections.

The Jetson AGX Xavier modules feature four controllers and support simultaneous camera connections through stream aggregation, utilizing up to 36 virtual channels Additionally, they offer high-speed I/O options, including three USB 3.1 ports, SLVS-EC, UFS, and RGMII for Gigabit Ethernet connectivity Detailed technical specifications can be found in Table 3.2.

Table 3.2 Technical Specification of Jetson AGX Xavier Modules [97]

1 CPU 8-core NVIDIA Carmel 64-bit ARMv8.2 @ 2265MHz

2 GPU 512-core NVIDIA Volta @ 1377MHz with 64 TensorCores

3 DL Dual NVIDIA Deep Learning Accelerators (DLAs)

4 Memory 16GB 256-bit LPDDR4x @ 2133MHz | 137GB/s

6 Vision (2x) 7-way VLIW Vision Accelerator

Maximum throughput up to (2x) 1000MP/s – H.265 Main

Maximum throughput up to (2x) 1500MP/s – H.265 Main

9 Camera (16x) MIPI CSI-2 lanes, (8x) SLVS-EC lanes; up to 6 active sensor streams and 36 virtual channels

10 Display (3x) eDP 1.4/ DP 1.2/ HDMI 2.0 @ 4Kp60

11 Ethernet 10/100/1000 BASE-T Ethernet + MAC + RGMII interface

14 CAN Dual CAN bus controller

15 Misc I/Os UART, SPI, I2C, I2S, GPIOs

16 Socket 699-pin board-to-board connector, 100x87mm with 16mm

Face Recognition Function

Overview of Face Recognition Function

The face recognition function begins with the collection and pre-processing of datasets from three different sources, which involves removing blurry images and aligning faces using a multi-task Cascaded Convolutional Neural Network (MTCNN) These refined datasets serve as the training data for our Convolutional Neural Network (CNN) model In selecting the most effective CNN model for our study, we evaluate three options: VGGFace, FaceNet, and ArcFace, focusing on their efficiency The chosen CNN model is then utilized for face recognition, displaying and storing the name of the recognized individual, while any newly detected person is labeled as “Unknown.”

Figure 4.1 Face Recognition Function Overview

Dataset Collection

Dataset collection plays a crucial role in determining the quality of a model, with this study utilizing videos, images, and standard datasets Initially, videos featuring specific individuals are gathered, allowing the system to perform face detection and extract high-quality facial images from each frame This approach offers exceptional image quality, a vast array of image resources, and minimal noise In contrast, images sourced from Google require extensive pre-processing due to the presence of numerous noisy images Lastly, standard datasets collected online facilitate straightforward comparisons with results from other studies that utilize the same datasets Additionally, employing MTCNN enhances the clarity of the facial images produced.

After collecting videos, we utilize three methods for frame extraction: period time per frame, number of frames per video, and key-frame extraction The period time per frame method extracts images at regular intervals, while the number of frames per video method calculates the extraction interval by dividing the total frames by the desired output frames Although these first two methods, detailed in Table 4.1, are straightforward and computationally efficient, they may overlook critical frames due to a lack of content consideration In contrast, key-frame extraction addresses this issue by capturing significant frames, albeit at the cost of higher computational requirements.

Table 4.1 Pseudocode of the Splitting Video

Input: The input video (inputVideo), period time per frame (n), the number of frames per video (m)

2 sumFrames = cap.get(cv2.CAP_PROP_FRAME_COUNT)

After collecting the dataset, pre-processing is essential to create the final datasets, involving two main steps The first step is identifying and removing blurry images, using the variance of Laplacian method proposed by Pech-Pacheco et al This technique measures blurriness by analyzing pixel intensity and local features An image is classified as blurry if its variance of Laplacian does not exceed a specified threshold, set at 100 in this study For instance, images with variances of 1,812 and 8 are evaluated, where the former is deemed clear and retained for further processing, while the latter is identified as blurry and excluded The second step involves face alignment using MTCNN.

Table 4.2 Pseudocode of Dataset Collection Scheme

Input: The consecutive video frames, total images per class (number_face), threshold of blurry image (threshold_blur)

Output: Face dataset (face_dataset)

3 initialize parameters of MTCNN scheme

7 calculate the variance of laplacian (var_laplacian) for image

8 if var_laplacian > threshold_blur:

9 face_image = MTCNN scheme detects face

10 save face_image in face_dataset

12 if current_frame > number_face:

MTCNN is utilized for face alignment, which effectively focuses and crops faces in images, thereby improving model accuracy, as detailed in Table 4.2 This approach integrates the elimination of blurry images with face alignment for optimal results.

Table 4.2 outlines the pseudocode for the dataset collection scheme, beginning with the definition of all parameters in Lines 1-4 The parameter number_face indicates the total images per class, while threshold_blur serves as a pre-defined limit to determine if an image is considered blurry The current_frame parameter tracks the number of collected images, and the dataset collection process concludes when current_frame equals number_face Additionally, the parameters for the MTCNN scheme are initialized.

For lines (5-7), each input image is read, and the corresponding variance of Laplacian is calculated

Lines (8-11) indicate that MTCNN is used to conduct face alignment if the image is not blurry

Lines (12-14) show that the program will stop and return the face dataset when the dataset collection is enough.

Model Architectures

Recent advancements in face recognition architectures have led to the evaluation of three different models to determine the most efficient option for this study The VGGFace network, developed by Parkhi et al., is built upon the foundational designs of Simonyan and Zisserman, featuring a deep architecture that emphasizes model simplification This network consists of essential components, including convolutional layers, max-pooling layers, fully-connected layers, and a softmax layer, with five max-pooling layers implemented to reduce input size The architecture includes three fully-connected layers, with the first two layers outputting 4,096 dimensions each and the final layer producing 2,622 dimensions The softmax layer serves as the concluding component, while the triplet loss function is utilized to optimize the model, focusing on minimizing the distance between anchor and positive samples and maximizing it between anchor and negative samples.

The ArcFace model, introduced by Deng et al., emphasizes the importance of discriminative power in feature learning for deep CNN-based face recognition To enhance feature discrimination, they developed an additive angular margin loss (ArcFace), which is mathematically represented in Equation 2, where 'm' is the margin parameter, 's' represents the feature scale, and 'θ j' indicates the angle between the weight.

The study explores the implementation of the ArcFace model using various network architectures, specifically focusing on the ResNet-50 architecture introduced by He et al This architecture is notable for its deeper bottleneck design, which consists of three convolution layers: two 1×1 layers for dimension reduction and restoration, and a 3×3 layer serving as the bottleneck The ArcFace model has demonstrated state-of-the-art performance in its applications, reinforcing the effectiveness of the ResNet-50 architecture in deep learning tasks.

The FaceNet model, introduced by Schroff et al., utilizes an end-to-end learning architecture that incorporates the Inception-ResNet-v1 framework, designed by Szegedy et al., to extract crucial features from facial images and produce a vector known as an embedding This architecture includes Inception-A, Inception-B, and Inception-C modules, with shortcut connections enhancing the depth of the ResNet Images are efficiently encoded into feature vectors, which are then processed through a triplet loss function for face recognition This triplet loss function enables the FaceNet model to effectively learn both the similarities within the same class and the dissimilarities between different classes.

Figure 4.3 The Inception-ResNet-v1 Network [77]

Enrolling a New Person

The face recognition system identifies users and provides their names, but if it encounters an unfamiliar individual, it labels them as "Unknown." This system allows for the registration of new individuals without the need to retrain the entire model It generates a unique embedding for each new person, as shown in Figure 4.4, ensuring that only the database of known faces needs to be updated.

Figure 4.4 Illustration for New Person Registration

In deep learning inference, a face embedding is generated from the testing face image using a trained model, which is then compared with all other face embeddings in the database To optimize inference time, various algorithms, such as support vector machines (SVM) and k-nearest neighbor (k-NN), are employed Our model leverages machine learning techniques, with different face embeddings as input The SVM is designed to calculate a hyperplane in N-dimensional space for effective classification of data points.

In Figure 4.5, a multi-class classifier is trained using Support Vector Machines (SVM) with diverse face embeddings as input, where each class represents a distinct individual This classifier efficiently categorizes new face embeddings by comparing them to support vectors, allowing for accurate identification of the corresponding class This approach significantly lowers computational costs while enhancing classification accuracy.

Figure 4.5 The Process of New Person Registration

Dataset Collection for New Person

Training a Multi-class Classifier using SVM

Gender, Age and Emotion Classification Function

Overview of Gender, Age and Emotion Classification Function

Figure 5.1 Overview of Gender, Age and Emotion Classification Function

The second function of the system enhances detection capabilities by providing additional information such as gender and age when a stranger is identified While it operates similarly to the first function, it takes longer to process each image due to the added classification tasks An overview of the gender, age, and emotion classification function is illustrated in Figure 5.1 The process begins with capturing the input image from the user's current scene, followed by utilizing pre-trained models to classify gender, age, and emotion The results are then aggregated to form a comprehensive output, which is documented in the function's result description.

Gender Classification Schemes

Recent advancements in gender classification using CNNs have been introduced by various researchers, including Liew et al [48], who developed a compact CNN model that classifies gender from facial images This model features a simple architecture that integrates convolutional and subsampling layers, consisting of three convolutional layers (C1, C2, and C3) and one output layer (F4), with input images sized at 32 × 32 pixels To optimize performance, the model employs cross-correlation in place of traditional convolution operations, and the training is facilitated by a second-order backpropagation algorithm coupled with the stochastic diagonal Levenberg–Marquardt (SDLM) algorithm.

Duan et al developed a hybrid model that combines Convolutional Neural Networks (CNN) and Extreme Learning Machine (ELM) to classify age and gender This model consists of two main components: feature extraction and classification, where CNN is utilized to extract features from input images, followed by ELM for classifying the results The feature extraction process involves three convolutional layers, two contrast normalization layers, and two max-pooling layers, arranged alternately A fully connected layer then transforms the feature maps into vectors, which serve as input for ELM in the age and gender classification task Additionally, the forward-propagation and back-propagation processes are essential for enhancing the performance of this hybrid architecture.

The compact soft stagewise regression network (SSR-Net) developed by Yang et al introduces an enhanced model for age and gender classification This 2-stream architecture features heterogeneous streams, each built with basic blocks that include 3x3 convolution, batch normalization, non-linear activation, and 2x2 pooling layers Notably, stream 1 employs ReLU activation and average pooling, while stream 2 utilizes Tanh activation and maximum pooling This strategic variation between the two streams significantly boosts the model's performance.

Figure 5.2 SSR-Net Structure with Three Stages (K=3) [86]

Age Classification Schemes

Deep learning plays a pivotal role in computer vision tasks, such as age classification Shang and Ai introduced a novel deep neural network called Cluster-CNN for classifying age from facial images The process begins with face normalization using a landmark detector, which crops the face to a standard scale based on the distance between the eyes The normalized face is then processed by the Cluster-CNN to extract facial features, which are subsequently grouped using a k-means++ algorithm The network is retrained on each group to select a branch with a learnable cluster module, ultimately leading to age prediction.

Hu et al [30] introduced an advanced deep CNN model aimed at enhancing age estimation accuracy This model processes age-labeled images alongside year-labeled image pairs, where each pair features two images of the same individual To assess age differences, the Kullback-Leibler divergence is employed Furthermore, the model incorporates adaptive entropy loss and cross-entropy loss for each image, ensuring the distribution achieves a single peak value Three distinct loss functions are strategically implemented atop the softmax layer to effectively capture and represent age differences.

Levi and Hassner introduced a straightforward convolutional neural network (CNN) for age classification, featuring three convolutional layers and two fully-connected layers Each convolutional layer is equipped with a ReLU activation function and a max-pooling layer, with the first two layers also incorporating local response normalization The architecture includes 96 filters of 7×7 pixels in the first layer, 256 filters of 5×5 pixels in the second, and 384 filters of 3×3 pixels in the third The network concludes with two fully-connected layers, each containing 512 neurons, followed by a ReLU activation and a dropout layer.

Figure 5.3 Illustration of the CNN Architecture for Age Classification [44]

Emotion Classification Schemes

Facial expressions serve as a powerful and universal means for humans to communicate their emotions and intentions In the realm of computer vision, face emotion classification has gained significant attention Jain et al introduced a deep CNN model designed for this purpose, featuring six convolution layers complemented by three max-pooling layers The architecture also incorporates two blocks of deep residual learning, each containing four convolution layers of varying sizes The model concludes with two fully-connected layers, followed by a ReLU activation function and a dropout layer Detailed specifications of this innovative network can be found in Table 5.1.

Figure 5.4 The CNN Model for Emotion Classification [33]

Table 5.1 Details of the Network for Emotion Classification [33]

Type Filter Size/ Stride Output Size

Jaiswal et al [34] introduced a CNN architecture designed for emotion classification, featuring two parallel sub-models with identical kernel sizes Each sub-model incorporates four types of layers: convolutional, local contrast normalization, max-pooling, and flatten layers, all processing the same input image to extract high-quality features These features are then flattened into vectors and concatenated into a single extended vector matrix The architecture concludes with a softmax layer for effective emotion classification, resulting in enhanced model accuracy due to the dual sub-model structure.

Jalal et al introduced an end-to-end convolutional self-attention framework for facial emotion classification, comprising four CNN blocks (C1-C4), a self-attention layer (A1), and a dense block (D1) The first CNN block (C1) features a convolutional layer with a 3x3 kernel, batch normalization, and ReLU activation, producing 32 output feature channels The second block (C2) includes two convolutional layers with 3x3 and 5x5 kernels, each followed by max-pooling and batch normalization, increasing the feature channels from 32 to 192 The third block (C3) consists of three convolutional layers, with the first and third layers followed by max-pooling and batch normalization, resulting in 192 input and 128 output feature channels Following C3, the self-attention layer (A1) models inter-region relationships in the feature maps Finally, the dense block (D1) contains two fully-connected layers, the first with a ReLU activation and dropout, and the second culminating in a softmax layer.

Figure 5.5 The Model for Real-Time Emotion Classification [4]

Arriaga et al proposed a real-time CNN model for emotion classification, known as mini-Xception, as illustrated in Figure 5.5 This fully-convolutional network incorporates batch normalization and ReLU activation functions after each convolution The architecture features four residual depth components, enhancing its performance in emotion recognition tasks.

Sep-Conv2D/ BatchNorm Sep-Conv2D/ BatchNorm MaxPool 2D

The model architecture incorporates 4X wise separable convolutions, which consist of two distinct layers: depth-wise convolutions and point-wise convolutions To generate predictions, the final layer employs a global average pooling followed by a softmax activation function.

Object Detection Function

Overview of Object Detection Function

The object detection function allows the system to identify and count various objects in the current scene It begins by capturing an input image, followed by the use of pre-trained models to select the most efficient one based on accuracy and processing time The output includes a detailed description of the detected objects, specifying their names and quantities Additionally, an adaptable object order table is created to organize the results based on specific scenarios.

Figure 6.1 Object Detection Function Overview

Object Detection Schemes

Since the launch of R-CNN, the pioneering CNN-based object detector, there has been significant progress in the field of general object detection This section highlights several key architectures that represent advancements in object detection technology.

Regions with CNN features (R-CNN) has emerged as a leading object detection method, initially introduced by Girshick et al The R-CNN model utilizes a selective search algorithm to generate 2000 region proposals from a single image, which are then processed through a CNN for feature extraction Subsequently, class-specific linear support vector machines classify each region While this powerful approach enables effective object localization and segmentation, it faces challenges with real-time implementation.

Figure 6.2 R-CNN Based Object Detection Model [24]

Girshick later introduced Fast R-CNN to address issues in the original R-CNN algorithm In this approach, features are extracted from the entire input image and subsequently processed through a region of interest (RoI) pooling layer, which reshapes region proposals from the feature map to a uniform size Each resulting feature vector is then input into a series of fully connected layers.

R-CNN used multi-task loss to achieve end-to-end learning where the network was jointly trained by the use of a multi-task loss on each labeled RoI It has been proved that Fast R-CNN achieves a significant improvement in its training and testing speed and as well as detection accuracy

Ren et al introduced a Faster R-CNN method for real-time object detection, utilizing Region Proposal Networks (RPNs) that enhance the efficiency and accuracy of region proposal generation This learned RPN significantly improves the quality of region proposals, leading to higher detection accuracy Building on this foundation, He et al developed the Mask R-CNN method, which not only detects objects in an image but also accurately segments a mask for each object instance Mask R-CNN extends Faster R-CNN by adding a parallel branch for object mask prediction alongside the existing bounding box recognition, offering easy implementation and the flexibility to adopt various architectural designs during training.

You Only Look Once (YOLO), introduced by Redmon et al., is a one-stage object detection system known as YOLOv1, capable of real-time object detection The model divides the input image into an S × S grid, where each grid cell predicts an object if its center falls within that cell Each cell generates B bounding boxes and corresponding confidence scores, defined as Pr(Object) multiplied by the Intersection over Union (IOU) of the predicted box Additionally, each grid cell predicts C conditional class probabilities, which are dependent on the presence of an object The predictions are represented as an S × S × (5B + C) tensor The YOLO architecture comprises 24 convolutional layers and 2 fully connected layers, with 1×1 convolutional layers utilized to reduce feature space During pre-training on the ImageNet dataset, the first 20 convolutional layers are employed, followed by an average pooling layer and a fully connected layer.

As an improved version, YOLOv2, was later proposed by Redmon and Farhadi

YOLOv2 employs several effective strategies that enhance its speed and precision compared to YOLOv1 The introduction of Darknet-19, featuring 19 convolutional layers and 5 max-pooling layers, allows for rapid network resizing and supports multi-scale training Additionally, the incorporation of batch normalization in every convolution layer significantly improves the mean average precision (mAP) and helps to regularize the model.

Furthermore, all fully connected layers are removed in YOLOv2, and anchor boxes are used to predict bounding boxes YOLOv2 also achieved state-of-the-art on standard detection tasks

YOLOv3, an enhanced version of YOLOv2 introduced by Redmon and Farhadi, employs multi-label classification for effective object detection in images By utilizing independent logistic classifiers, it predicts multiple labels for each object, which significantly improves accuracy in complex datasets with overlapping labels Additionally, YOLOv3 makes box predictions at three different scales, with its final convolutional layer outputting a 3D tensor that encodes bounding boxes, objectness, and class predictions The model also incorporates a new feature extraction network called Darknet-53.

53 convolutional layers Darknet-53 also achieves the highest measured floating point operations per second (FLOPS); therefore, this network structure better uses the GPU

The Single Shot Detector (SSD), introduced by Liu et al., is a one-stage object detection framework that efficiently identifies multiple categories of objects Utilizing the VGG16 backbone architecture, SSD enhances the truncated base network by incorporating additional convolutional feature layers designed to detect objects at various scales This approach effectively addresses the challenge of varying object sizes by merging predictions from feature maps with differing resolutions Each feature map contains default bounding boxes with unique aspect ratios and scales, eliminating the need for traditional object proposal methods Consequently, SSD simplifies the detection process by performing all computations within a single network, facilitating easier training and rapid integration into detection systems Notably, SSD has achieved state-of-the-art performance in both accuracy and speed for object detection tasks.

Arrangement of Result Description

To enhance user experience with complex result descriptions, an object order table is established to identify object positions, allowing for adjustments based on specific situations This method ensures that users can access information swiftly and efficiently without impacting the final detection results.

Figure 6.1 illustrates the object detection function, which utilizes a pre-trained model based on the Common Objects in Context (COCO) dataset This dataset comprises natural images that depict everyday scenes and offer contextual information, with a majority of the objects labeled and segmented within the images.

The COCO dataset comprises 91 object categories, but 11 of these categories—including street signs, hats, shoes, eyeglasses, plates, mirrors, windows, desks, doors, blenders, and hair brushes—are unlabeled and unsegmented As a result, only 80 object categories are effectively labeled and segmented within the images.

The COCO dataset categorizes objects into super categories such as person and accessory, animal, vehicle, outdoor objects, sports, kitchenware, food, furniture, appliance, electronics, and indoor objects To establish object positioning in result descriptions, we created an object order table based on these super categories For instance, when considering outdoor scenarios, the top 10 objects are organized as follows: person, bicycle, motorcycle, car, bus, truck, train, traffic light, stop sign, and bench.

System Prototype and Implementation

Experimental Results

Conclusions

Tiêu đề	A multifunctional embedded system based on deep learning for assisting the cognition of visually impaired people
Tác giả	Wu You-Hui
Người hướng dẫn	Prof. Chyi-Ren Dow, Prof. Feng-Cheng Lin
Trường học	Feng Chia University
Chuyên ngành	Engineering
Thể loại	Luận văn
Năm xuất bản	2021
Thành phố	Đài Trung

Định dạng
Số trang	106
Dung lượng	4,05 MB