RESEARCH AND DEVELOPMENT OF FACIAL EMOTION RECOGNITION APPLICATION BASED ON DEEP LEARNING Human facial emotions (FER) have garnered significant attention from researchers due to their potential applications. Mapping different facial expressions to match emotional states is the main goal of FER. Traditional facial emotion recognition methods typically involve two primary actions: extracting features and recognizing emotions. Recently, deep neural networks (DNNs), particularly convolutional neural networks (CNNs), have been extensively employed in facial emotion recognition (FER) because of their effective image feature extraction capabilities. Several studies have utilized convolutional neural networks with limited layers to address facial emotion recognition issues. However, standard shallow CNNs with basic knowledge frameworks cannot adequately extract high-resolution features from photos. A common limitation of many existing methods is their focus on frontal images, neglecting profile views, which are crucial for a practical facial emotion recognition system. This report proposes a very deep facial emotion recognition model utilizing transfer learning to develop a highly accurate facial emotion recognition system. This approach includes adopting a prepared and trained Deep facial emotion recognition (DCNN) model and modifying its dense upper layers to be compatible with FER, followed by fine-tuning with facial emotion data. A novel training strategy is introduced, where the dense layers are trained first, followed by successive tuning of each prepared and trained Deep facial emotion recognition block, gradually improving FER accuracy. The proposed facial emotion recognition system is tested using eight different pre-trained Deep facial emotion recognition models on the famous KDEF, JAFFE, and FER2013 facial image datasets, building a web real-time emotion identification system with facial expressions methods and soft calculations.
INTRODUCTION IN FACIAL EMOTION RECOGNITION
Emotion
Emotion is a complicated behavioral phenomenon with several neurological and physiological components[1] Emotions are natural qualities of persons and play an important part in social communication[2],[3] Humans display emotion in some ways, including facial expressions[4].[5], voice[6], and body language[7]
➢ Joy (Happiness) - shown by lifting the mouth corners (an evident grin) and tightening the eyelids
➢ Surprise is represented by the brows arching, the eyes expanding wide and showing more white, and the mouth slightly dropping
➢ Sadness - shown by lowering the mouth corners, the brows sinking to the inner corners, and the eyes sagging
➢ Anger is expressed by lowering brows, pressing forcefully on lips, and bulging eyes
➢ Disgust is expressed by elevating the upper lip, wrinkling the nasal bridge, and raising the cheeks
➢ Fear is expressed by elevating the top eyelids, opening the eyes, and stretching the lips horizontally
➢ Contempt is expressed by half of the upper lip contracting and the head tilting slightly back
Emotions are conditions of mind and body brought on by modifications in neurophysiology They are associated with ideas, emotions, emotional reactions, as well as a happy or dissatisfaction scale There is no agreement among scientists over the meaning of the word Mood, temperament, personality, disposition, and inventiveness are frequently associated with feelings.[8].
Facial Emotion Recognition in Deep CNN
Ekman, Friesen, and Ekman[9] investigated people's facial expressions and established seven universal emotions: joy, regret, fury, fear, surprise, disgust, and neutrality Recognizing emotions through appearances has lately been a major research area in psychological wellness, psychiatry, and neuroscience In addition, autonomous detection of facial emotion expressions is required for smart living, health care systems, interaction between humans and robots (HRI) and human-computer interface (HCI), emotion disorder diagnosis in schizophrenia and autism spectrum disorders, and HRI- based social welfare initiatives Because of its potential for multiple uses, the scientific community has concentrated on facial emotion recognition (FER)
The primary goal of FER is to identify different facial movements to correspond with feelings The two main phases of conventional FER are emotion recognition and feature extraction Furthermore, picture processing is required, including face identification, scaling, normalization, and cropping Before cutting the face, facial recognition eliminates non-facial and background components Feature extraction from the processed picture is the primary task of a typical FER system Modern approaches, on the other hand, make use of complex techniques like linear discriminant analysis and discrete wavelet transform (DWT) Lastly, the collected characteristics are utilized to classify emotions, a process that is frequently carried out by neural networks (NNs) and other machine learning techniques
Neural networks with deep features (DNNs), and especially convolutional neural networks (CNNs), have been more and more prominent in FER lately because of its integrated feature extraction strategy for images [10,11] CNN has reportedly been used in several studies to solve FER problems However, even though more complicated models have been demonstrated to perform better in other image-processing applications, current FER methodologies only examined CNNs with a few layers The reasons for this might include FER-related issues To begin with, emotion detection needs a picture with a high enough resolution, which means handling high-dimensional
10 data Second, the process of categorization is made more difficult by the very little differences in faces that result from different emotional states On the other hand, a very deep CNN contains a large quantity of hidden convolutional layers A CNN that has been trained with a large number of hidden layers is ineffective and does not offer excellent generalization Furthermore, accuracy cannot be improved above a certain threshold by merely adding layers due to the vanishing gradient problem The deep CNN architecture and training method incorporates some enhancements and strategies to boost accuracy[12] VGG-16 [13], Resnet-50, Resnet-152 [14], Inception-v3 [15], and DenseNet-161 [16] are the most widely used pre-trained DCNN models However, enormous quantities of data and powerful processing power are required for training such a deep model.
RELATED WORKS IN THIS REPORT
Literature Review In This Report
2.1.1: LA-Net: Accurate Expression of Face Recognition with Landmark-Aware Training in the Presence of Labeling Noise
This approach originates in noisy labels that significantly impair performance in real- world situations LA-Net takes advantage of markings on the face to minimize the impact of noise labels from two angles Furthermore, this method can any deep neural network may be incorporated with this technique to improve learning monitoring not adding to the expense of interpretation
The dataset includes 7 facial expressions:
✓ First, LA-Net improves training monitoring by using landmark information to remove ambiguity on spatial expression and construct label distribution for each sample using neighborhood synthesis
✓ Second, a model that includes landmark information in expression representations via a landmark-expression comparison incurs a loss
The model includes three fundamental components: the framework, expression- landmark distinct loss, and landmark distribution estimation (LDE)
Figure 2: Label Distribution Estimate (LDE) Structure
Label distribution estimate computes the label distribution of each sample and utilizes it as an additional monitoring signal Based on the premise that expressions must share feelings with their neighbors in the feature space Landmark information is utilized to fix problems in the expression space To produce the target label distribution, this module first learns the pairwise contribution scores and then aggregates them by neighborhood To reduce the effect of batching on online aggregation, target label distributions are averaged over past epochs using an exponential moving average (EMA)
Figure 3: Loss Of Expression Landmark Contrast (EL Loss)
Expression landmarks loss combines landmark information into expression representations to provide a solid basis Less sensitive to label noise The method analyzes landmarks and emotions as two viewpoints of a face picture and builds interaction using supervised contrastive learning
2.1.2: POSTER: A Pyramid Cross-Fusion Transformer Network for Facial
The study to solve all three issues comprehensively, the paper suggests a two-stream Pyramid Cross-Fusion Transformer network (POSTER)
• Inter-class similarity: Images that are similar yet have small differences can be grouped into several expression groups As seen in Figure 4, A little alteration in one area of a picture, something like a mouth., can define the expression category, even if the entire look stays essentially the same Because of the intricacy of these variations, present approaches may not be enough to distinguish between such pictures
Figure 4: Inter-class similarities and intra-class differences
• Intra-class discrepancy: Pictures in the same expression category may differ substantially from sample to sample in terms of a person's age, gender, and skin tone as well as the appearance of the background As seen in Fig 4, the two images represent happiness but look very different from one another
• Scale sensitivity: When used carelessly, deep-learning networks usually exhibit sensitivity to picture quality and resolution The image sizes in FER datasets and real-world testing images are very different Therefore, in FER, consistent performance across scales is essential
Facial landmarks are a collection of significant spots on the human face that indicate key regions associated with facial expression Concurrently, globally distributed data outside of landmarks is crucial for recognizing human face expressions (such as wrinkled brows and cheeks) Motivated by this, we construct a two-stream network with an image and a landmark stream We exchange the key matrices of the two streams, resulting in a cross- fusion process that enables feature cooperation Specifically, by employing this technique, we allow prior knowledge of prominent regions from landmarks to steer the image features In a similar vein, the picture features provide the landmark stream representations of a global context through block processes
Figure 5: Architectures of POSTER for Facial Expression Recognition (FER)
A face landmark detector is used to extract landmark characteristics Xlm The image backbone is used to extract image characteristics (Ximg) "+" denotes a patch-wise concatenation operation
Figure 6: Cross-fusion transformer encoder
(a) The standard transformer encoder, sourced from ViT
(b) An encoder known as the cross-attention transformer (CrossViT)
In this approach, we promote better contextual knowledge, reducing intra-class disparities and inter-class similarities To achieve our aim, we create cross-fusion MSA blocks using the extracted image features Ximg and landmark features Xlm, as illustrated in Fig 7
Figure 7: Cross-fusion MSA block
(a) The transformer encoder's standard MSA block
(b) Proposed cross-fusion MSA block for the transformer encoder
• P signifies the number of patches
Machine Learning-Based FER Approaches
In the field of artificial intelligence (AI), especially in the machine learning subdomain, automatic FER is a challenging task As the FER problem advances, several traditional machine learning techniques (such as neural networks and K-nearest neighbor) are employed The FER technique is attributed to Xiao-Xu and Wei [17] They started by enhancing the facial image with a wavelet energy feature (WEF) After extracting features using Fisher's linear discriminants (FLD), the emotion was classified using the K-nearest neighbor (KNN) method In addition to KNN for FER classification, Zhao et al [18] used principal component analysis (PCA) and non-negative matrix factorization (NMF) to extract features In the field of artificial intelligence (AI), automatic FER is a challenging task, especially in the machine learning subdomain, which is different from regular machine learning Using regional data, Feng et al [19] Using binary pattern (LBP) histograms from many small image segments, merging them into a single feature histogram, then applying linear programming (LP) to categorize moods Zhi and Ruan
[20] used 2D discriminant locality-preserving projections to generate facial feature vectors In a recent study, Joseph and Geetha used a variety of classification methods, such as logistic regression, LDA, KNN, classification and regression trees, naive Bayes, and SVM, to examine their proposed face geometry-based feature extraction technique The primary flaw in the previously disclosed methods is that they only assessed frontal views; standard methods extract different characteristics from both frontal and profile views.
Deep Learning-Based FER Approaches
The approach known as deep learning for FER is a relatively new machine learning method, with some prior CNN-based experiments existing in the scientific literature Zhao and Zhang [21] paired a deep belief network (DBN) with a neural network (NN) to conduct FER on self-collected facial emotional photos The DBN was used for unattended feature training, while the NN was used for emotions characteristic classification A more extensive approach with two convolutional pooling stages and four origination layers was also examined Pons and Masip [22] built a collection of 72 CNNs and educated each CNN with a distinct convolutional layers filter length or fully connected layer neuron population The final model included a predefined number of CNNs that were started with stacked convolutional auto-encoder encoder weights and trained on face pictures The group of CNNs also trained one hundred CNNs This is an instance when CNN initialization with randomization outperforms CNN initialization: provided FaceNet2ExpNet, a deep facial recognition architecture that can be extended by transfer learning A hybrid deep learning architecture for FER that combines a CNN
16 and a recurrent neural network (RNN) was proposed by Jain et al [56] Additionally, they looked at a hybrid architecture with TL in which SVM is used to identify pre- trained AlexNet features Lastly, they employed a fairly deep CNN architecture with 18 convolutional layers and four subsampling layers for FER on DWT-extracted variables Recently, they explored a graph-based CNN for FER using facial landmark characteristics, suggested a CNN-based gathering method for FER, used a CNN-based approach that took into account both designated and unlabeled information, and assessed several data enhancement strategies, such as using computer-generated images to train the deep CNN, before finding that an amalgamation of artificial images and other methods performed better for FER The present significant A hybrid deep learning architecture for FER was presented by Jain et al [56] It mixes a recurrent neural network much of the research and frontal pictures have been handled utilizing learning- based methods.
Overview of Convolutional Neural Networks, Deep Convolutional Neural
Convolutional Neural Networks (CNN)
As a result of their innate structure, convolutional neural networks provide the best way to represent the picture field [23] A CNN is composed of three layers: a layer for output, multiple concealed layers that employ convolutional pooling, and an input layer for data A pair of functions can be merged to produce a third function that depicts the function's changed form using the mathematical technique of inversion Convolution is the technique that a CNN's tiny kernel (3 × 3, 5 x 5) uses to identify patterns in images Pooling is a type of nonlinear down sampling A pooling layer in the subsequent layer combines non-overlapping regions from one layer into a single value The first figure depicts the general design of a conventional CNN with two convolutional pooling layers The first convolution layer applies convolution operations to the input picture, producing the initial convolved feature maps (CMFs), which serve as the input for additional pooling processes The first pooling procedure yields the initial subsampled feature mappings (SFMs) Following the initial pooling, the second convolutional- pooling layer operations are performed After flattening the values of the second SFM, the fully connected layer, also known as the thick layer, performs the last reasoning step by connecting the neurons to each activation in the layer before it The last layer, often
17 known as the loss layer, explains how training penalizes deviations from expected to actual outputs A detailed explanation of CNN may be found in the published research Generally, a CNN architecture is used for pattern recognition from small (e.g., 48 by 48) input pictures, such as handwritten numeral identification
Figure 8: The general structure of a convolutional neural network is a structure with two convolution-pooling layers
Deep Convolutional Neural Networks Models and Transfer Learning Motivation 17
Deep convolutional neural networks are challenging to input and train because they operate on high-dimensional images and include several hidden convolutional layers Critical layers of convolution layouts and connections vary throughout deep convolutional neural network (DCNN) models AlexNet, which employed five layers of convolutional neural networks (CNN), was the first model to attain exceptional accuracy on ImageNet Similar in idea, but with fewer parameters, ZFnet achieves a comparable accuracy level Smaller kernels took the place of the larger ones ZFNet used only 1.3 million photographs to get the same outcome as AlexNet, which was trained on 15 million photos A deeper depth 16 model with 13 convolutional layers and smaller kernels was later introduced by VGG 16 Another model in this category, the VGG 19, has 16 layers of convolution
The skip-connection, first presented in residual neural network (ResNet), is an essential feature used by most later models Skipconnection works by directing a layer's input and adding it to the output after a few layers This gives the layer additional information and helps to solve the vanishing gradient problem Currently, a few distinct ResNet models with varying depths are available, including ResNet such as 18, 34, 50, and 152; the number of convolutional layers in a model is one less than the depth size specified in the model's name
DenseNet included dense skip connections across layers, in addition to a single skip link This suggests that every layer receives signals from layers above it and that every level below it uses the output of that layer One layer's input is combined with the channel concatenation of layers that came before it The DenseNet features [L(L+1)/2]
18 straight relationships, while traditional CNNs with L layers have L immediate connections Because every layer has direct access to the levels above it, the system has a smaller data bottleneck Consequently, the model achieves exceptional computing efficiency and becomes significantly more compact and slender Since DenseNet bricks are built by concatenating feature maps, additional layers will need a great amount of processing due to their enormous input They utilize somewhat less expensive complexity with length individually to decrease channel lengths and increase variable effectiveness at the same time Furthermore, by combining 0-(k - 1) data and using a non-linear function to this feature chart, the non-linearity within the kth layer is approximated This model has several versions; the DenseNet is 161 model, for example, consists of just four modules and 157 layers of convolution
Another advanced convolutional neural network model composed of multiple units is called Inception The main concept underlying the creation is to introduce nonlinear behavior, then test with various filters and stack the modules This helps prevent the network from adopting a fixed filter and enables it to acquire any combination of those filters that it desires In order to reduce computational expenses, this component uses one-by-one convolution to reduce the number of channels In addition to layering these genesis modules, the network also has several branch layers that predict the result of the algorithm and show if it is underfitting or overfitting The latest version of the inception model is called Inception v3 It has multiple genesis modules and forty layers of convolution
Any large deep convolutional neural network model requires a lot of work to train since there are many parameters that need to be changed A considerable quantity of training data is frequently needed for a big network Overfitting may arise from training with too few or too little data It might be challenging to get enough data for proper deep convolutional neural network training in some jobs Furthermore, there are situations where having a lot of data is difficult to obtain However, studies have shown that transfer learning can be a very useful tool in addressing this problem Applying knowledge representations from different activities but similar applications is the idea behind transfer learning (TL) It has been noted that when both occupations are equivalent, the transfer learning technique performs better
FACIAL EMOTION RECOGNITION USING TRANSFER LEARNING
Facial Emotion Recognition (FER) Using Transfer Learning (TL)
The primary benefit of this research is the use of learned, pre-trained deeply convolutional neural network algorithms with adequate learning by transfer (TL) for face expression identification Gathering basic visual features like corners and shortcuts is the primary layer that the convolutional neural network performs The layer recognizes more intricate features, such as textures or shapes, and the upper layer applies the same technique to pick up more intricate patterns the FER procedure tasks in deeper convolution artificial neural networks at bottom levels are similar to other image-based operations like classification since the fundamental features of all images are similar Since creating a deeply convolutional neural network model from scratch takes a lot of work, the TL approach may be used to improve a deeply convolutional neural network model that has already been trained for another task by identifying emotions For image classification, a large dataset (ImageNet) is used to create and train a deeply convolutional neural network (DCNN) model (VGG-16) that is suitable for FER The transfer learning concepts for facial emotion recognition (FER) and the suggested FER depth approach are defined in the following subsections along with the relevant examples
Figure 9: General architecture of a transfer learning with deep CNN model of emotion identification
The overall structure of transferring with the deep convolutional neural network algorithm for recognizing facial emotions is shown in Figure 9 The base's classifier is the newly added FER layers, and the convolutional learns is a prepared and trained DCNN minus its layers Repurposing a prepared and trained DCNN generally entails two key mains: fine-tuning this model and substituting the innovative classifier for the existing one Often, a set of completely connected thick layers makes up the new classifier component Practically speaking, selecting a prepared and trained model and establishing size-similarity matrices for adjusting are essential components of TL When adjusting an algorithm, there are three generally used methods: train the entire model, learn certain layers and freeze others, and learn the classifier just (that is, only its convolution base) It is enough to train one classifier and/or a few layers for fine-tuning on an equivalent job On the other hand, comprehensive model training is necessary for some positions As a result, both the convolution base and the extra classifier are both subject to fine-tuning A pipeline approach is used in this study to manage the time-
20 consuming processes of portion selection and appropriate training procedures for fine- tuning in order to improve FER
Figure 10: Illustration of the proposed FER system based on transfer learning in deep CNNs
The provided diagram illustrates a Convolutional Neural Network (CNN) architecture, commonly used for image classification tasks This CNN comprises several types of layers: Convolution + ReLU layers, Max Pooling layers, and Fully Connected + ReLU layers The Convolution + ReLU layers (indicated in gray) are responsible for feature extraction from the input image The network begins with Conv-1, which applies 32 filters of size 3x3, followed by 64 filters of size 3x3 Conv-2 applies 128 filters of size 5x5, Conv-3 applies 512 filters of size 3x3, and Conv-4 applies 256 filters of size 3x3 Each convolutional operation is followed by a ReLU activation function, introducing non-linearity into the model and allowing it to learn complex patterns
Max Pooling layers (indicated in orange) follow the convolutional layers, reducing the spatial dimensions (height and width) of the feature maps to decrease computational cost and control overfitting Max pooling uses a 2x2 window across the feature maps, taking the maximum value in each window Specifically, pooling layers follow the 64-filter Conv-1, 128-filter Conv-2, 512-filter Conv-3, and 256-filter Conv-4
Finally, the Fully Connected + ReLU layers (indicated in green) handle classification based on the features extracted by the previous layers The network includes FC-5 with
512 neurons and FC-6 with 256 neurons, each followed by a ReLU activation function to maintain non-linearity This hierarchical structure of convolutional and pooling layers for feature extraction, followed by fully connected layers for classification, enables the
CNN to effectively process and classify
Figure 11: Illustration of the proposed Emotion Recognition model using the VGG-16 model plus dense
The provided image summarizes a Convolutional Neural Network (CNN) model architecture, detailing the layer types, their output shapes, and parameter counts The architecture begins with a Conv2D layer, conv2d_60, which has 32 filters of size 3x3, resulting in an output shape of (None, 48, 48, 32) and containing 320 parameters This is followed by another Conv2D layer, conv2d_61, with 64 filters of size 3x3, yielding an output shape of (None, 48, 48, 64) and 18,496 parameters Next is a BatchNormalization layer, batch_normalization_72, which maintains the output shape of (None, 48, 48, 64) and has 256 parameters
A MaxPooling2D layer, max_pooling2d_48, follows, reducing the spatial dimensions to (None, 24, 24, 64) without adding parameters Then, a Dropout layer, dropout_72, is applied, keeping the output shape the same at (None, 24, 24, 64) The next Conv2D layer, conv2d_62, consists of 128 filters of size 3x3, producing an output shape of
(None, 24, 24, 128) and containing 204,928 parameters This is followed by another BatchNormalization layer, batch_normalization_73, which maintains the output shape at (None, 24, 24, 128) with 512 parameters The model then includes another MaxPooling2D layer, max_pooling2d_49, which reduces the dimensions to (None, 12,
12, 128) and does not add parameters
Finally, another Dropout layer, dropout_73, retains the output shape of (None, 12, 12,
128) The total number of parameters in the model is 2,726,151, of which 2,722,695 are trainable and 3,456 are non-trainable
Drawing of the suggested Emotional Identification model with dense (i.e., fully connected) levels and the model known as VGG-16 The ImageNet data collection was used to prepare for training the model referred to as VGG-16 While the 'Frozen/Fine- tune' mark is a choice, a block bearing the 'Fine-tune' mark is necessary for adjustments
The whole graphic architecture of the model that was suggested is shown in Figure 11, which also contains thick layers for FER and a comprehensive pre-trained VGG-16 model The additional component with three completely connected layers arranged in a cascade arrangement is represented by the green piece in the figure First, a vector with a single dimension representing the metric is created via the 'Flatten' layer Its goal is to provide only one representation and enable it to function with the subsequent layers' recognition of emotions processes There is no processing done to the input There is a connection between the next layer and the previous one: The first layer is concealed and serves as the intermediate-length vectors's input for the layer that follows by translating the higher-dimensional matrix The most recent layer creates a vector that shows the various psychological states' sizes
Since the whole model is in the same pipeline as the extra dense layers and the pre- trained DCNN, emotion data may be utilized to refine the dense layers and the required few layers of the pre-trained model In addition to a pooling layer, each of the five convolutional blocks that make up the pre-trained VGG-16 model seen in Figure 11 has two or three convolutional layers The methods for convolutional and 2D pooling show that the approaches are used with 2D pictures Two convolutional layers and a cascaded MaxPooling layer make up Conv Block 1, the first block Electronics 2021, 10, 1036 8 of 19, the layer's output, serves as Conv Block 2's input Assume that the inputs are accepted by Block 1's first convolutional layer The VGG-16 model uses convolution and pooling procedures in discrete blocks to create an output size of 7 × 7 × 512 from an input color image size of 224 × 224 × 3 A linear vector of size 25,088 (=7 × 7 × 512) is generated by the flattened layer and transferred to the first dense layer It performs a linear operation and generates a vector of length 1000, which is the input for the last dense layer with a length of 128 The final dense layer elicits seven distinct emotional reactions
The most important stage of the transfer learning-based FER methodology is adjusting when the proposed model is adjusted using a well-considered method to get a better
23 result Since the weights of each layer are initialized randomly, each further dense layer(s) must always be adjusted On the other hand, during the adjusting phase, some or all of the pre-trained models could be taken into account The VGG-16 model's Conv Block 5 is labeled "adjusting" in Figure 11, signifying that adjusting this block is necessary, whereas fine-tuning the other four blocks is optional, as shown by the label
"frozen/adjusting." Throughout the training, adjusting is also required The most important stage of the TL-based FER methodology is adjusting when the proposed The random weights of the dense layers will provide a bad gradient if the extra layers and the prepared and trained model are trained concurrently This poor gradient will spread across the training region, producing an output that differs from what is intended To train the model with the face emotion dataset for adjusting, a pipeline approach is employed First, the new dense layers are learned, and then, the pre-trained VGG-16 model's chosen blocks are progressively added to the training To train the four independent layers of the VGG-16 model's Conv Block 5, adjusting increases gradually as opposed to suddenly It keeps track of accuracy and lessens the effect of the original random weight, model is adjusted using a well-considered method to get a better result Since the weights of each layer are initialized randomly, each further dense layer(s) must always be adjusted On the other hand, during the adjusting phase, some or all of the pre-trained models could be considered The VGG-16 model's Conv Block 5 is labeled "adjusting" in Figure 11, signifying that adjusting this block is necessary, whereas adjusting the other four blocks is optional, as shown by the label "frozen/ adjusting." Throughout the training, adjusting is also required It helps to reduce the impact of initial random weight and maintains track of correctness It is worth noting that the optimum conclusion with a certain deep CNN model may need different adjusting choices for distinct datasets, depending on their size and other characteristics.
Facial Emotion Recognition (FER) Using Fine-tuning
Fine-tuning is a powerful technique in deep learning where a pre-trained model is adapted to a new, related task with a smaller, task-specific dataset In the context of Facial Emotion Recognition (FER), fine-tuning involves using a pre-trained Convolutional Neural Network (CNN) model, such as VGG16, ResNet, or InceptionV3, and adapting it to recognize emotions from facial images Below is a step-by-step guide to performing FER using fine-tuning:
The first step is to load a pre-trained CNN model These models have been trained on large datasets, such as ImageNet, and have learned to extract powerful features from images For this example, we will use the VGG16 model, a popular choice for image- related tasks due to its depth and architecture The pre-trained model is loaded without its top layers, which are specific to the original classification task (e.g., ImageNet classes) Instead, we will add custom layers for the FER task
Once the pre-trained base model is loaded, we proceed by freezing its layers to retain the learned features and prevent them from being updated during training This step ensures that the model's pre-learned features remain intact On top of this base, we add custom layers tailored for the FER task Typically, this includes a flatten layer to convert the 3D output of the base model into 1D, followed by dense (fully connected) layers with ReLU activation functions to introduce non-linearity and enable the model to learn complex patterns The final layer is a softmax layer with a number of units equal to the number of emotion classes (e.g., 7 classes for emotions like happy, sad, angry, etc.)
After modifying the model, it is essential to compile it with an appropriate optimizer, loss function, and evaluation metrics The Adam optimizer is chosen for its efficiency and adaptive learning rate capabilities The categorical cross-entropy loss function is used because it is suitable for multi-class classification problems, where each input belongs to one of several classes The accuracy metric is used to monitor the performance of the model
The next crucial step is to load and preprocess the dataset This involves ensuring that the images are resized to the input shape expected by the model and that the labels are one-hot encoded for multi-class classification Data augmentation techniques can also be applied to artificially increase the size of the training dataset and improve the model's
25 generalization I use ImageDataGenerator from Keras to rescale pixel values and split the dataset into training and validation subsets
The final step involves training the model on the FER dataset using the data generators created earlier During training, the model learns to map input images to the correct emotion labels by updating the weights of the new layers we added The training process involves iterating over the dataset for a specified number of epochs, and the performance on both the training and validation sets is monitored to ensure the model is learning effectively and not overfitting
Fine-tuning a pre-trained model for FER is an efficient and effective approach to leveraging existing deep learning architectures for new tasks By loading a pre-trained model, adding task-specific layers, and carefully training the model on a new dataset, we can achieve high accuracy in recognizing emotions from facial images This method not only saves computational resources but also often leads to better performance due to the powerful features learned by the pre-trained model.
Real-Time Facial Expression-Based Emotional Identification System
These feelings are referred to as universal emotions Developers and academics have been working on artificial intelligence to construct systems that not only think and behave like people, but also sense and respond to human emotions Humans demonstrate universal constancy in emotion recognition, although their talents vary greatly across individuals Enabling the electronics around us to identify our emotions can only improve our interactions with machines and with the rest of mankind The goal of this project research is to provide individualized user experiences that can enhance lives
Following facial detection, an efficient CNN-based architecture is employed to train and test face mask detection There are numerous existing designs for training purposes, as detailed in the part of the research study In the paper, we offer a bespoke architecture for detecting if the wearer is using a face mask The suggested approach uses facial feature analysis to detect face masks
The suggested lightweight convolutional neural networks (CNN) model comprises four convolutions, one completely connected and one layer of output, followed by a nonlinear ReLU activation function for thresholding Neurons with values less than or equal to zero are eliminated from the network by the ReLU activation function, leaving neurons with positive values in place Two maximum pooling layers are used in the suggested model to decrease the amount of dimensional which affects the training length of the network
Because of the model's and the information's simplicity, overfitting may occur during the modeling process We used batch normalization and dropout techniques to reduce excessive fitting The number of classes realized by the network is equal to the number of output neurons in the final fully connected layer The softmax classification algorithm is then used to assign the input to the correct class, such as face mask detected or no face mask detected
Haar cascades are a prominent technique in the field of computer vision, especially known for object and face detection Developed by Paul Viola and Michael Jones in
2001, this method is based on the use of Haar features to identify regions of an image that are likely to contain a target object, such as a human face One of the biggest advantages of Haar cascades is their ability to detect quickly and efficiently, which is important in applications that require high speed and performance
The process of facial emotion recognition begins with the detection of faces in an image The Haar cascade model uses Haar features—simple black-and-white patterns—to identify regions of the image that are similar to faces The Haar cascade system consists
27 of several stages (cascades), each of which is a simple classifier If an image region passes all of these stages, it is determined to contain a face Once a face is detected, a bounding box is drawn around the face, marking the location and size of the face in the image
Once the face is detected and marked with a bounding box, the next step is to crop the face from the original image This cropping helps to focus on important regions, remove irrelevant parts, and reduce noise The cropped face is then preprocessed, including converting to grayscale and adjusting the size Grayscale helps to reduce unnecessary color information, focusing on shape and texture features important for emotion recognition The image size is adjusted to suit the input requirements of later feature extraction and classification models
To recognize emotions from faces, important features need to be extracted from pre- processed facial images These features typically include elements such as the positions of the eyes, mouth, and facial contours An effective method for doing this is to use a convolutional neural network (CNN) CNNs are capable of automatically learning complex features from input data, which can help recognize different emotions such as happiness, sadness, anger, and surprise After extracting the features, the model classifies the face into one of the emotion groups based on the learned features
The emotion recognition results after classification will be displayed by drawing corresponding emotion labels on the original image Bounding boxes are used to mark the face position and the emotion labels are written on or near the face This not only makes it easy for users to observe and understand the recognition results, but can also be stored or used in other applications such as user behavior analysis, security monitoring, or improving user experience in interactive systems
Haar cascades provide a fast and efficient method for detecting faces in images, which is an important foundational step in facial emotion recognition The use of bounding boxes to mark detected faces makes the clipping, preprocessing, and classification processes clearer and more intuitive Although there are more modern techniques in this field, Haar cascades still play an important role in many practical applications due to its simplicity and efficiency
DATASET RELATED RESEARCH
The Karolinska Directed Emotional Faces (KDEF) Dataset
The Karolinska directed emotional faces Lundqvist D, Flykt A, & ệhman A (1998) The Karolinska Directed Emotional Faces - KDEF, CD-ROM from the Department of Clinical Neuroscience, Psychology Section, Karolinska Institutet, ISBN 91-630-7164-9
The Karolinska Directed Emotional Faces (KDEF) includes 4900 images of human facial emotions The collection of photographs of 70 people with seven distinct emotional expressions and each emotion is viewed from five distinct perspectives
(90 degrees towards left of center)
Right profile (90 degrees towards right of center)
Left semi- profile (45 degrees towards the left of center)
Right semi- profile (45 degrees towards right of center)
Figure 13: Example image of KDEF Dataset
The population consists of 70 amateur actors, 35 of them are female and 35 are male Selection criteria: Age range: 20-30 years No beards, mustaches, earrings, or eyeglasses, and preferably no apparent make-up during the photo shoot
All subjects were given written instructions in advance These directions included a description of the seven distinct emotions they were to make during the picture shoot Before the picture session, the subject was requested to rehearse the various emotions for an hour It was highlighted that the subject should strive to elicit the feeling to be communicated and, while expressing the emotion in a way that seemed natural to them, aim to make the expression forceful and unambiguous All participants donned distinctive gray T-shirts They were seated around three meters from the camera The absolute distance was changed for each person by altering the camera position until the individual's eyes and mouth were certain The camera's grid screen has predefined vertical and horizontal locations The lights were adjusted to produce a gentle indirect light evenly over both sides of the face Following a rehearsal session, the individuals were photographed in one expression at a time until all seven expressions were captured (series one) The people were photographed again in various emotions and perspectives (series 2).
The Japanese Female Facial Expression (JAFFE) Dataset
The Japanese female facial expression (JAFFE) dataset is a collection of images used for research in the fields of facial expression recognition and affective computing The dataset consists of facial expressions from 10 Japanese female subjects Each image is annotated with the expression it represents Additionally, the images were rated by 60 Japanese subjects on a scale of 1-5 for each emotion, providing a subjective measure of the perceived intensity of each expression
The dataset includes 7 facial expressions:
Figure 14: Example image of KDEF Dataset
The JAFFE dataset is widely used in various applications, including:
• Facial expression recognition systems: Developing algorithms that can automatically recognize human emotions from facial expressions
• Human-computer interaction: Enhancing the interaction between humans and computers by enabling systems to respond appropriately to the user's emotional state
• Affective computing: Studying the affective aspects of computing, including the development of systems that can detect and respond to human emotions
• Psychological studies: Understanding how different facial expressions are perceived and recognized across cultures
The JAFFE dataset is especially useful for facial expression recognition research because of its extensive annotations and the cultural distinctiveness of its participants, which provide insights into how expressions are interpreted within a given cultural context
Facial Emotion Recognition 2013 (FER2013) Dataset
The FER2013 (Facial Expression Recognition 2013) dataset is a commonly used dataset in the field of facial expression recognition The FER2013 dataset contains a vast and diverse collection of facial expressions, making it an excellent resource for creating and testing facial expression recognition algorithms Its consistent syntax and diverse set of expressions aid in the development of strong models for a variety of applications in affective computing and human-computer interaction
The dataset was introduced during the ICML (International Conference on Machine Learning) 2013 Challenges in Representation Learning The dataset contains 35,887 grayscale images Each image is 48x48 pixels in size and grayscale format The dataset includes 7 different facial expressions:
The dataset is divided into three sets:
• Public Test Set with 3,589 images
• Private Test Set with 3,589 images
Breakdown of Images by Expression: The distribution of images across different facial expressions in the training, public test, and private test sets is as follows:
Expression Training Public test Private test Total
Figure 15: Example image of FER2013 Dataset
The FER2013 dataset is extensively used in:
• Training machine learning models: To recognize and classify facial expressions
• Emotion recognition systems: Improving human-computer interaction by enabling devices to understand and respond to human emotions
• Affective computing: Studying the emotional aspects of human-computer interaction
DESIGN WEB APPLICATION
Streamlit Library
Streamlit is an innovative Python library specifically designed to enable developers to create highly interactive and visually appealing web applications using straightforward and minimal code This powerful tool simplifies the web application development process by providing an intuitive API and a wide range of pre-built components and widgets, allowing users to build complex data-driven applications with ease and efficiency By leveraging Streamlit, developers can focus more on the logic and functionality of their applications, rather than getting bogged down with intricate web development details
Streamlit is a framework for building and developing Python-based online apps that may be used to communicate analytics findings, create sophisticated interactive experiences, and demonstrate new machine learning models Furthermore, designing and deploying Streamlit apps is extremely rapid and versatile, typically reducing application development time from days to hours
The best thing about Streamlit is that you may create your own online application or start using it without having to understand the basics of web programming Streamlit is a great option if you have a strong interest in data science and want to quickly and effectively deploy your models with a minimal amount of code The provision of an application with an efficient and user-friendly user interface is one of the most important factors in determining its success The problem with many modern data-heavy apps is that they have to rapidly and without complicated chores include a good user interface
A potential open-source Python tool called Streamlit aids developers in producing eye- catching user interfaces rapidly If you don't know how to write front-end code, streamlit is the easiest approach to get your code on a website:
• There is no need for any prior experience or expertise of front-end programming (HTML, JS, or CSS)
• Don't need to spend days or months developing a web app; instead, you can construct a stunning machine learning or data science app in a matter of hours or even minutes
• It is compatible with the majority of Python libraries, including pandas, matplotlib, seaborn, plotly, Keras, PyTorch, and SymPy (latex)
• Creating outstanding web apps requires less code
• Data caching simplifies and accelerates calculation pipelines.
Pyngrok
Pyngrok is a Python wrapper for the popular ngrok utility, which enables developers to connect their local development servers to the internet safely This is very handy for web development, testing webhooks, and sharing your work with others without deploying it to a production server Ngrok is a tool that establishes a secure tunnel between a public URL and a local server on your PC It is a simple approach to safely expose your local servers to the internet, granting outsiders access to your development environment Pyngrok is a Python package that facilitates ngrok usage by providing a Pythonic interface for tunnel creation and management It enables you to include ngrok's functionality straight into your Python programs
• Secure Tunnels: Create secure tunnels to your local host, making your local development server accessible from the internet
• Easy Integration: Integrate seamlessly with Python applications, supporting frameworks like Flask, Django, and Streamlit
• Configuration Options: Customize tunnels with options for subdomains, authentication, region specification, and more
• Monitoring and Logs: Access detailed logs and monitor traffic through the ngrok dashboard
Pyngrok is a handy tool for developers who need to quickly and securely connect to the internet from their local development environments It is best suited for creating and testing webhooks, sharing web apps, and interacting with CI/CD processes Pyngrok enables developers to streamline their development workflow, making it simpler to collaborate and test in real-world circumstances
Design Web
NumPy is a Python library that provides support for multidimensional arrays and matrices as well as a variety of mathematical functions for working with them
• Supports mathematical and logical operations on multidimensional arrays
• Supports matrix and vector calculations
• Provides array constructors, enriching data creation
Streamlit is a Python library that allows you to easily create interactive web applications using just simple lines of Python code
• Create interactive web applications using only Python
• Supports interactive components such as widgets, tables, and charts
• Supports creating charts and displaying data in many different formats
TensorFlow is an open-source library from Google for Machine Learning and Deep Learning
• Support building, training, and deploying machine learning models
• Provides APIs for many levels of abstraction, from beginners to researchers and professional developers
OpenCV is a popular open-source library for image processing and computer vision Main feature:
• Process photos and videos from different sources
• Detect and recognize objects and faces
• Process and recognize objects in real-time
Streamlit-WebRTC is a Streamlit extension that allows integrating video and audio streams into your web application
• Supports real-time video and audio transmission
• Allows building highly interactive web applications with streaming video
Protocol Buffers (Protobuf) is a means developed by Google to serialize structured data
• Define data structures in files proto
• Automatically generate code for many different programming languages to process data defined in proto
Pandas is a Python library that provides powerful data structures and data analysis tools Main feature:
• Supports reading, writing, and processing data from many different sources (CSV files, Excel, SQL database, etc.)
• Provides methods and functions to perform common data analysis operations such as filtering, grouping, and calculating
Seaborn is a Python library based on Matplotlib, used to draw attractive and readable statistical graphs
• Integrates easily with Pandas DataFrames
• Provides functions for drawing distribution, relationship, and heatmap plots
Pyngrok is a Python library that provides an interface to ngrok, allowing you to create secure tunnels from your local host server to the internet
• Create secure tunnels from localhost to the internet
• Supports creating public URLs for your web application, API, or server
Altair is a Python library for creating interactive statistical charts
• Use simple and easy-to-understand syntax
• Create charts that are easy to read and interact with
Each library in the list above has unique features and applications that help you build web applications, process data, and perform data analysis tasks efficiently and flexibly
Focuses on the process of deploying a Streamlit application using Pyngrok The goal is to create an interactive web application from the Streamlit application and provide a public URL to access the application remotely
The purpose of this implementation is to:
• Create an interactive web application that is easy to use and user-friendly
• Allows access to remote applications via the internet safely and conveniently
• Facilitate application sharing and development in a collaborative and development environment
The source code is written in Python and uses the following libraries:
• “os”: To perform system operations such as running shell commands threading
• “thread”: To run the Streamlit application in a separate thread
• “pyngrok.ngrok”: To create and manage tunnels from localhost to the internet
• The first step is to set up ngrok's authentication token using the
“ngrok.set_auth_token()” function with the token provided by ngrok
• Next, a “run_streamlit()” function is defined to run the Streamlit application In this example, the application is run on port 8501
• A new thread is created to run the Streamlit application by calling the
“run_streamlit()” function This helps the application run concurrently with other activities without affecting the deployment process
• Then a tunnel is created from the Streamlit port (8501) to the internet using Pyngrok with the parameters “addr='8501'”, “proto='http'”, “bind_tls=True” The tunnel's public URL is printed on the screen so users can access the application remotely
Streamlit application deployment using Pyngrok is a simple, effective, and flexible solution for sharing your application with remote users Pyngrok helps create secure and easily accessible tunnels, providing a convenient means of deploying and sharing your web applications during development and collaboration
This part presents a facial emotion recognition application built with Python and popular libraries such as OpenCV, Keras, Streamlit, and Pytube This application allows users to perform emotion recognition directly from the webcam or from uploaded photo files, videos or from YouTube links
Source code and libraries used: The source code is written in Python and uses the following libraries:
• “numpy”: To process arithmetic data
• “cv2” (OpenCV): For image and video processing
• “keras”: To load and use deep learning models
• “streamlit”: For building user interfaces and deploying web applications
• “pytube”: To download videos from YouTube
• The deep learning model is loaded and used to predict emotions from faces The necessary folders containing the model and weights are preloaded
• The “VideoTransformer” class inherits from “VideoTransformerBase” in
“streamlit_webrtc” and performs emotion recognition on live video frames from the webcam
• The “process_video()” and “process_youtube_video()” functions are used to process video files uploaded to or from YouTube
• The user interface is built using “streamlit” and divided into sections that allow users to choose between different functions such as recognizing directly from the webcam, uploading photos or videos, or importing YouTube links
• Users can select functions from the app's sidebar, including:
➢ Identify directly from the webcam
➢ Upload photos to detect emotions
➢ Upload videos for emotion recognition
➢ Enter the YouTube link to detect emotions in videos from YouTube
• The application provides clear and easy-to-understand instructions for users to use every function
This facial emotion recognition application provides a convenient and intuitive way for users to identify and understand emotions conveyed through facial expressions The combination of emotion recognition technologies and a simple, easy-to-use user interface makes the app useful and convenient for many purposes, from education to entertainment
This facial emotion recognition app idea focuses on introducing the idea of a facial emotion recognition app This idea aims to develop an application capable of recognizing and analyzing emotions from users' facial expressions
The purpose of this idea is:
• To create a useful tool for analyzing emotions, helping users better understand their mental state and that of others
• Support in areas such as education, psychology, counseling, and entertainment
• Facilitates the development of customizable and scalable applications
• Emotions Directly from Webcam: The application will allow users to use the computer's webcam to recognize and display emotions directly from facial expressions
• Analyzing Existing Photos and Videos: Users can upload photo or video files for the app to analyze emotions from these files
• Detect Emotions on Online Videos from YouTube: The application provides the ability to import video links from YouTube and automatically analyze emotions from this video
• Results Display and Detailed Analysis: The results of the emotion recognition process will be displayed to the user, along with detailed information about the analyzed emotion and further analysis of the emotion's context
Source code and libraries: This idea will use libraries such as OpenCV, Keras, Pytube, and Streamlit to implement functions and features Using deep learning models will help identify emotions more accurately and diversely
Potential and Applications: The application has potential for use in many areas, including education, mental health, counseling, entertainment, and gaming The application can be expanded to integrate additional features such as body language and voice recognition for more comprehensive emotion analysis The ability to customize
40 and extend applications creates a flexible platform for developers to develop custom applications for specific purposes
The idea of a facial emotion recognition app promises to bring real value to users by helping them better understand their emotions and mental states The combination of emotion recognition technology and a friendly user interface will create a useful and convenient tool for everyone.
RESULTS AND DEMO
Results of KDEF, JAEFA, FER 2013 Dataset
This section evaluates the proposed model's efficiency using benchmark datasets Because CNN is the proposed model's foundation, a series of extensive tests using standard CNN are initially carried out to determine baseline performance Following that, the impacts of various fine-tuning modes are examined using VGG-16 Finally, the performance of the proposed model is compared to several pre-trained DCNN models
This table shows the test set accuracies for conventional CNN with two layers, 3 × 3 kernel size, and 2 × 2 MaxPooling for input sizes ranging from 360 × 360 to 48 × 48 on the KDEF and JAFFE datasets The test size was determined at random from 10% of the available data The provided results for a certain setup are the best test set accuracies obtained across 50 iterations The table shows that bigger input picture sizes tend to provide improved accuracy up to a maximum for both datasets For example, the attained accuracy on KDEF is 73.87% for an input picture size of 360 × 360 For the same dataset, the accuracy is 61.63% with a 48 x 48 input size A larger image has more information, hence a system should do well when categorizing larger images However, the highest accuracy was not reached with the largest input size (360 × 360) The greatest accuracies were achieved for both datasets with picture sizes of 128 × 128 The reason for this is that the model is appropriate for fitting such input-size picture data, and greater input sizes need more data and a larger model to achieve better performance The goal of the proposed strategy is to apply deeper CNN models with TL on a pre- trained model to reduce overfitting while training with a limited dataset
Input image size KDEF JAFFA
Demo Web
Figure 19: Demo live face emotion detection
Reports on facial emotion recognition systems have shown that the correct recognition rate is up to 80% However, there exist several causes of misidentifications One of them is the distance between the face and the camera When the face is too far or too close to the camera, the system has difficulty recognizing specific facial features, leading to false recognition Non-ideal lighting conditions are also an important factor Low light, too strong light, or uneven light can blur facial features, causing inaccurate recognition In addition, the perspective of the face also plays an important role If the viewing angle is not properly aligned, it can reduce the ability to recognize emotions properly Finally, natural facial variation, such as changes in expression or other natural variations, can also cause inaccurate recognition To improve recognition performance, measures such as optimizing cameras and lighting conditions, enhancing recognition algorithms, and enhancing training data can be applied By making these improvements, it is possible to increase the system's correct recognition rate to higher than the current level
7.2.2: Upload Image for Emotion Detection
7.2.2.1: Image With Full Frontal View (0 degrees)
Figure 20: Demo image with full frontal view
When only one person is in the frame, the system achieves very high accuracy, about 98% This may be because the model only has to focus on a single face, has no scattered resources, and does not need to process many different objects Frontal facial images, with clear and unobstructed features, help the model easily detect and classify emotions
The accuracy gradually decreases as the number of people increases to two, three, four, or more people The model must process many faces at the same time, increasing complexity and decreasing accuracy due to distributed processing resources The lighting is uneven and the positioning of each face may not be optimal, making it difficult to detect and classify emotions accurately Furthermore, factors such as uneven lighting and slight overlap between faces also contribute to reduced accuracy
7.2.2.2: Left and right profile (45 degrees to the left and right)
Figure 21: Left and right profile (45 degrees)
When the facial emotion recognition system has only one person in the frame and at a diagonal angle of 45 degrees, the accuracy reaches 80% (40/50) This shows that the system has good recognition ability when there is only one face to analyze, even though the viewing angle is not frontal However, there are still 20% errors, which may be due to facial features being distorted when viewed from a diagonal angle CNN systems may have difficulty recognizing facial features at this angle, so the handling of different perspectives needs to be improved to increase accuracy
When there are two people in the frame and at a diagonal angle of 45 degrees, accuracy drops to 64% (32/50) Analyzing two faces simultaneously increases complexity, and diagonal viewing angles make recognition even more difficult The system may have problems distinguishing faces when they are not in a frontal position, leading to confusion or missing important features Furthermore, light and shadow from one face can affect the other face, causing recognition errors
With three or more people in the frame and at a 45-degree diagonal angle, accuracy drops sharply to 50% (25/50) The increased number of faces means the system has to process more information, and the diagonal viewing angle makes recognition even more complicated Faces may obscure each other, and facial features may be distorted due to non-frontal viewing angles These factors increase the possibility of errors, leading to a decrease in the correct recognition rate This suggests a need for more advanced processing techniques to improve accuracy in complex situations
Overall, the facial emotion recognition system based on deep learning and CNN shows good recognition ability when processing single faces, but the performance gradually
45 decreases as the number of faces increases and as faces Not in direct view The main sources of error include distortion of facial features at a 45-degree diagonal angle, uneven lighting, and occlusion between faces To improve the system's performance, it is necessary to optimize the algorithm to better handle different viewing angles and enhance the ability to distinguish faces in crowded frames Furthermore, using image preprocessing methods to adjust lighting and reduce noise can help increase the accuracy of the system
7.2.2.3: Left and right profile (90 degrees to the left and right)
When the system was asked to recognize the emotions of a face at a 90-degree angle, accuracy dropped sharply compared to 45-degree frontal or diagonal views There are even many cases where the system cannot recognize faces The main reason is that the 90-degree angle loses most of the important facial features that the system often uses to recognize emotions, such as the shape of the eyes, nose, mouth, and facial bone structure The CNN system is trained mainly on frontal or diagonal face images, so when encountering these tilted angles, the system's recognition ability is seriously impaired Recognizing facial emotions at a 90-degree angle is a big challenge for current deep-learning systems The results show that the system has very low accuracy and is often unable to recognize faces in these situations
Causes of Misidentification and Non-Identification:
• Loss of Important Features: A 90-degree angle loses most of the key facial features that the system relies on to recognize emotions The CNN system is trained on features such as the shape of the eyes, nose, mouth and facial bone structure When these characteristics are no longer clear, the system cannot operate effectively
• Incomplete Training: Models are often trained on frontal or diagonal face images, with rarely 90-degree face images in the training dataset This leads to a lack of training data for these perspectives, reducing the system's recognition ability
• Occlusion and Noise: When there are multiple faces in the frame, the possibility of occlusion between faces increases This is especially true when faces are not in frontal angles, increasing the difficulty for the system to distinguish and recognize individual faces
7.2.3: Upload Video For Emotion Detection
Figure 22: Demo upload video for emotion detection
Recognizing faces and emotions from downloaded videos still faces many challenges The phenomenon of jitter and low accuracy shows that there is a need for improvement in both algorithms and hardware resources Recognizing faces and emotions on downloaded videos also often experiences stuttering Here are some main reasons:
• Hardware Resources: Processing video and running deep learning models for emotion recognition requires a large amount of processing resources If the hardware is not powerful enough, this can lead to stuttering because the system cannot process the frames in time
• Fast Movement: A video with a lot of fast or unstable movement makes tracking and recognizing faces more difficult The system may miss or misidentify expressions in blurry or shaky frames
• Lighting Conditions and Camera Angles: Videos from YouTube often have lighting conditions and camera angles that are not ideal for facial recognition Insufficient lighting or unfavorable rotation angles can reduce the accuracy of the system
7.2.4: YouTube Link Video Emotion Detection
Figure 23: Demo YouTube video emotion detection
Recognizing emotions from YouTube videos today still faces many challenges, especially in terms of accuracy and smoothness in the recognition process The phenomenon of jerks and misidentifications appearing frequently shows that further improvements in both software and hardware are needed for the system to operate more effectively When performing face and emotion recognition on YouTube videos, jerking often occurs The main reason is that video streams from YouTube can be unstable, especially with network speeds that are not fast enough or unstable This phenomenon can reduce the user experience and affect the emotion recognition process because frames are not processed continuously Furthermore, live video processing requires high processing resources, especially when using complex deep-learning models, which can lead to stuttering if the hardware is not powerful enough