Báo cáo nghiên cứu khoa học: Deep learning based graph convolutional network using hand skeletal points for Vietnamese sign language classification

INTERNATIONAL SCHOOL, NATIONAL UNIVERSITY, HANOIFACULTY OF APPLIED SCIENCES STUDENT RESEARCH REPORT Deep Learning Based Graph Convolutional Network Using Hand Skeletal Points For Vietnam

Trang 1

INTERNATIONAL SCHOOL, NATIONAL UNIVERSITY, HANOI

FACULTY OF APPLIED SCIENCES

STUDENT RESEARCH REPORT

Deep Learning Based Graph Convolutional Network Using Hand Skeletal Points For

Vietnamese Sign Language Classification

CN.NC.SV.23_07

Team Leader: Le Quang Nhat

ID: 21070005Class: MIS2021B

Trang 2

TEAM LEADER INFORMATION

I Student profile

- Full name: Le Quang Nhat

- Date of birth: 07/07/2003

- Place of birth: Quang Ninh

- Class: MIS2021B Picture 4x6

- Program: Management Of Information System

- Address: Dong Da, Ha Noi

- Phone no /Email: 0969526500

II Academic Results (from the first year to now)

Academic year Overall score Academic rating

III Other achievements:

Hanoi, April 17 "2024

Advisor Team Leader (Sign and write fullname) (Sign and write fullname)

Trang 3

4 Experimental Results and Discussion .0 0 ccc cece c2 21111 1 1H HH HH TH HH, 18

4.1 Experimental Ñesulfs c1 HH HH Tà HH TH Hà HH TH, 18 4.2 Visual Performance InsiglfS - - c2 S11 ng TH nh ng tr, 22 4.3 Application potential of the GCN model G5 2+ 312211311 11312 1 1 kg ngư, 27 9®.) he 29 4.5 Web Application DeyelopmenI( - cece reece HH TH HH TH HH HH, 30

5 Conclusion and Future WWOFFK - ó- sọ TH Hh gọTHH T H HH H H n H n He 32 ĂẮ\)) vẽ 33

TJ Abbreviation oo 46 REEFERENCES HH HH HH HH TH ko HH HT TH HT TT HT Tàn TH 46

Trang 4

LIST OF FIGURES AND TABLES

Figure 1 - Sample Signal Language :ccccessessessceseeseeeeeeseeseeeceseesececesecseceeesesaeseaeeaeceeeesesaeeeeeeaeeneeesees 5

Figure 2 - Official MUMETICS 00 5

Figure 3 9)šiioi101:i in ắ 4Ả 5

Figure 4 901:.8U)1-v000002)0)9) 1 0N .- 7

Figure 5 - Advanced hand-tracking model capable of real-time identification of 21 hand landmarks 8

Figure 6 - VGG16 Model ATrCHIIt€CEUTG - ĩ6 6 119 E1 E1 11111 T1 TH TT HH TH HH TH TH HT 11 Figure 7 - Standard ViT architecture (with N=9 patches as an example) (a) Data Flow of ViT (b) Transformer Encoder BIOCK 0n ằẦằẦ 13

Figure 8 - Architecture of a Graph Convolutional Network for Gesture Classification - 15

Figure 9 - Multi-Stream Graph Convolutional Network for Gesture ClasSIfication ‹-s‹-« 16 Figure 10 - MobileNetV3-Inspired Bottleneck Structure in a Graph Convolutional Network for Gesture 0P 50160 18

Figure 11 - Performance Metrics Comparison between VGG16 V2 and ViT Models 19

Figure 12 - Performance Metrics of GCN Models Across Key Evaluation CTiteria ‹ ‹ 21

Figure 13 - VGG16 Model Confusion Matrix ÀnaÏyS18 6 1kg HT HH Hư, 23 Figure 14 - ViT Model Confusion 8/1000 10/) 0777 24

Figure 15 - Training and Validation Performance cessessseeeeseeseeseseeseeseesceeseesesseeeeseeseeseseeeeseeneease 25 Figure 16 - Potential applications of a Graph Convolutional Network (GCN) .-s«<c+xc<e 28 Table 1 - Data properties 0 Ầ.na 6

Table 2 - Number oi 1 6 Table 3 - Performance Metrics of VGG16 V2 and ViT ModelÌs «c5 2< cee teteseeneeneae 19 Table 4 - Comparative Performance Metrics of Three GCN Models - - s55 s+ssssersee 20

Trang 5

INTRODUCTIONDeep Learning Based Graph Convolutional Network Using Hand Skeletal Points For

Vietnamese Sign Language Classification

1 Project Code: CN.NC.SV.23_07

2 Member List:

Full Name Class ID

Phung Phuong Uyén MIS2021B 21070713

Tran Thi Kim Ngoc MIS2021B 21070652 Nguyễn Thị Huyén Trang IB2021C 21070768

Đỗ Thị Trung Anh BDA2021C 21070686

3 Advisor(s): Dr Ha Manh Hung, Ph.D

4 Abstract

In the context of enabling more inclusive communication for the Vietnamese hearing impairedcommunity, Vietnamese Sign Language (VSL) classification emerges as a crucial technologicalchallenge This paper introduces a groundbreaking approach to the automatic classification ofVSL, employing a Deep Learning framework that integrates a Graph Convolutional Network (GCN) with hand skeletal points analysis for enhanced gesture recognition Our datasetspecifically comprises 33 characters and numbers represented in VSL, captured in image format,aiming to cover a comprehensive spectrum of basic communication gestures The core of ourmethodology lies in the novel application of GCN, which, by leveraging the structural and dynamicnuances of hand gestures, demonstrates superior performance in recognizing VSL gestures This

is compared with the performances of two additional models: VGGV2 and Vision Transformer(ViT), each selected for their unique strengths in image recognition tasks VGGV2 is renownedfor its deep architecture that excels in capturing texture and detail while ViT leverages self-attention mechanisms to understand global context Our comparative analysis reveals that whileVGGV2 and ViT each offer significant merits in terms of feature extraction and recognitioncapabilities, the GCN model exhibits a unique advantage in handling the spatial relationships andintricate patterns inherent in sign language gestures Through a detailed experimental setup, wedemonstrate the GCN model's enhanced ability to classify VSL gestures accurately, highlightingits potential in bridging the communication gap for the hearing impaired This study opens up newavenues for research into the application of graph-based deep learning models for gesturerecognition The comparative evaluation with VGGV2 and ViT further enriches our understanding

of deep learning architectures' capabilities and limitations in the context of sign languagerecognition

Index Terms—Vietnamese Sign Language (VSL), VSL Classification, Graph ConvolutionalNetwork, Hand Skeletal Points, Gesture Recognition, VGGV2, Vision Transformer (ViT), DeepLearning.

5 Keywords: Deep Learning; Graph Convolutional Network; Vietnamese Sign Language

Trang 6

SUMMARY REPORT IN STUDENT RESEARCH

2023-2024 ACADEMIC YEAR

1 Introduction

Sign language, with its rich diversity, presents a unique challenge in the field of automaticrecognition due to the high demand for accuracy and contextual understanding This becomeseven more crucial when considering Vietnamese Sign Language (VSL), a complex system withunique characteristics that are not completely aligned with international sign languages likeASL Efforts to enhance the communication and interaction capabilities of the deaf communitynecessitate the development of a system capable of accurately recognizing VSL from visualdata To address the recognition task, we explore the synergy between Deep Learning andGraph Convolutional Networks (GCN) to fully capture the complexities of sign languagecommunication Deep Learning, known for its ability to learn from large and complex datawithout explicit programming, has proven to be a powerful tool in solving this issue,particularly due to its proficiency in processing time series and spatial data This makes it anideal choice for sign language recognition tasks that require accurate capture of continuousmotion and positional changes Meanwhile, GCN is chosen for its ability to handle graph-structured data, particularly useful when working with hand skeletal data represented as graphs.GCN understands and leverages spatial relationships between nodes (here, hand skeletalpoints), enabling it to capture hand structures and movements naturally and accurately Thecombination of Deep Learning and GCN in this project is based not only on their ability toprocess image and time series data but also on their deep analysis of complex spatialrelationships, making them particularly suited to the challenge of recognizing sign languagefrom images Through these models, we aim to achieve significant progress in accurately andautomatically recognizing VSL, opening new avenues in sign language recognition technologyresearch and development This study focuses on developing a new VSL dataset comprising

33 letters and numbers, collected from 6 project participants Each sign was recorded fromvarious angles and under consistent lighting conditions to ensure diversity and minimize bias.The data collection process included cross-checking among members to maximize the accuracy

of each sign

The structure of this paper is organized as follows: Section II provides a comprehensiveliterature review, exploring prior methodologies for sign language recognition and theapplication of Deep Learning and GCN in processing skeletal data Section III introduces ourproposed model, detailing the dataset preparation, the deployment of various deep learningmodels for analysis, and the specific use of GCN for VSL recognition Experimental resultsand a thorough discussion on the findings, challenges encountered, and the evaluation of the

GCN model’s effectiveness are presented in Section IV The development and integration of a

web application to utilize the GCN model for sign language recognition services are alsooutlined Section V concludes the paper with a summary of our contributions and insights forfuture research directions

2 Literature Review

Trang 7

Methods, with their utilization of classifiers and regressors to map image features onto handposes, have historically been the bulwark of hand gesture language recognition Suchmethodologies have played a pivotal role in the field's advancement, yet they are invariablytethered to voluminous datasets and exhibit inherent limitations in their capacity to generalizeacross diverse scenarios.

In the advent of deep learning, Convolutional Neural Networks (CNNs) have augmented thecapabilities of appearance-based methods, bringing to bear a formidable discriminative powerthat allows for nuanced feature extraction However, their efficacy is contingent upon theavailability of large, well-annotated datasets, a dependency that presents significant challenges,particularly when such datasets are scarce or imbalanced (Jiang, Xia, and Guo, 2019)

Complementing these developments, traditional algorithms such as Hidden Markov Models(HMMs) and Decision Trees have been instrumental in the recognition of symbolic gestures.HMMs, lauded for their temporal sequence modeling, unfortunately, stumble when tasked withcapturing the spatial relationships that are critical to understanding the full gamut of handgestures Decision Trees, in concert with HMMs, offer a means for gesture classification andfeature selection Nevertheless, they are not impervious to the pitfalls of insufficient datasets,which can severely constrain their performance

Distinguishing further between the sensor-based and vision-based approaches reveals adichotomy in gesture recognition methodology Sensor-based methods have the distinctadvantage of precision, with instruments like recognition gloves providing exactingmeasurements of hand movements Vision-based methods, on the other hand, present a moreorganic form of interaction by leveraging visual data, thus fostering a naturalistic and pliableuser experience This modality, while rich in potential, is particularly susceptible to thevicissitudes of environmental factors such as lighting and necessitates intricate computations

to ensure reliable recognition

The zeitgeist of sign language recognition has been markedly reshaped by the emergence ofdeep learning paradigms, which have been pivotal in sign language recognition Notable CNNarchitectures, including AlexNet, VGG, and ResNet, have been meticulously adapted andtrained on expansive sign language datasets, culminating in significant strides in gesturerecognition accuracy (Li et al., 2019)

Within the deep learning milieu, Graph Convolutional Networks (GCNs) have emerged as aformidable architecture, distinct in their capacity to map the spatial relationships inherent inskeletal data GCNs are adept at delineating the complex, hierarchical structures within andbetween frames of skeletal data, granting them a pronounced edge in the domain of dynamicsign language movement recognition The theoretical underpinnings of GCNs are grounded intheir ability to operate over graph-structured data, harnessing the power of topological dataanalysis to perceive and process the intricate patterns of connectivity that define humangestures By integrating the principles of graph theory and convolutional methodologies, GCNscan identify and extrapolate the relational features intrinsic to sign language, enabling aprofound comprehension of its spatial-temporal dynamics

The ramifications of GCNs' advancements in sign language recognition are multifold Not only

do they hold transformative potential for enhancing accessibility and facilitating real-timecommunication for the deaf and hard-of-hearing communities, but they also portend a breadth

Trang 8

of applications across diverse sectors These sectors range from augmented reality and computer interaction to assistive robotics and beyond, each standing to gain from the intricategesture recognition capabilities afforded by GCNs.

human-The evolution of GCNs epitomizes the synergy between deep learning innovation and the questfor universal communication As the nexus between computational architecture and linguisticexpression, GCNs not only pave the way for a new era of inclusivity but also signify awatershed moment in the confluence of artificial intelligence and human language

3 Data & Methodology

In this study, we approach the recognition of Vietnamese Sign Language (VSL) through twoprimary methods First, we repurpose and assess the performance of three prominent deeplearning models—VGG16 and Vision Transformer (ViT)—that have been previously tested onthe American Sign Language (ASL) dataset These models are then retrained on our VSLdataset to determine their adaptability to the unique characteristics of Vietnamese signlanguage The second method focuses on the application of Graph Convolutional Network(GCN) techniques, a novel approach for analyzing hand skeletal data, proposing an advancedsolution for more accurate sign gesture recognition

3.1 Preparation of the Vietnamese Sign Language Dataset

To start building an application model in sign usage, it is essential to create a dataset with handsign language In general, there have been many studies on this issue, and attached to them arefairly complete datasets such as American sign language [1], or the use of alphabet symbols inEnglish Our problem is to expand not only American sign language but also Vietnamese signlanguage with standard Vietnamese signs with the aim of being used by Vietnamese people.Unlike the English alphabet with 26 letters and 10 numerals, the Vietnamese character setcomprises 33 characters, including 23 letters and 10 numerals The dataset process is operated

by recording video and extracting it into images to ensure standards of dataset characteristicsand accurate detailed data collection

Dataset Property

This research introduces a novel dataset comprised of images meticulously extracted fromhigh-quality videos In order to get a sufficient label size and balanced data distribution foroptimal machine learning performance, this dataset encompasses a total of 118,800 images,including 33 distinct labels: 23 static letters (from A to Y) and 10 numerals (from 0 to 9) Eachlabel containing 3600 frames

Intending to assemble a comprehensive and representative dataset, data collection involvedseven individuals, including both male and female participants from the research team Thecaptured images adhere to stringent quality standards, guaranteeing the accurate representation

of the data without streaks, blurring, or noise artifacts Furthermoer, hand images were capturedfrom diverse angles, positions, and hand sizes (considering factors like length, width,thickness, of fingers and palm) Images are presented in “jpg, Jpeg” format, varying in size(spectrum from 720 x 1280 to 1920 x 3413 pixels) due to different image quality and using the

Trang 9

Letter C Letter D Letter N

Figure I - Sample Signal Language

v

Figure 2 - Official numerics Figure 3 - Official letters

Trang 10

Data Property Detail Character | Count | Character | Count

Total images 92,800 0 2827 H 233

1 2806 I 2763 Total: 33 (Figure 2, 3) 2 2808 K 2839

Label - 23 static letters (A - Y) 3 2780 L 2777

G 2786Table I - Data properties

Table 2 - Number of frame

Thus, through the recording process, the hand gesture dataset in the Vietnamese language waspresented with 33 characters, there was a difference with the data set using English language which1s Vietnamese without the characters "w", "f", "J", and "z"; Instead, it incorporates the Vietnameseletter "d", which is crucial for VSL representation

Data collection methods

The data collection process includes three stages: Recording, Triangulation, and Classification &Labeling

Effective data collection is paramount for robust hand gesture recognition models To achieve this,some data collection requirements have been imposed First, slight hand rotations during datacapture help broaden the dataset's scope and improve the model's ability to generalize acrossdiverse gesture orientations Second, maintaining well-lit environments with unclutteredbackgrounds minimizes noise and interference, allowing the machine learning algorithm to focus

on the hand gestures themselves Standardizing data collection using the left hand with the palmfacing the camera avoids potential gesture misinterpretations (Figure 4) Finally, maintaininguniformity among data collectors minimizes variations, ensures consistency, and enhances theoverall quality of the dataset By adhering to these comprehensive data collection guidelines, wecan cultivate a robust dataset that effectively trains hand gesture recognition models, leading to

Trang 11

à aa ="

| |

Figure 4 — Data collection example

Following individual data collection, cross-checking by each group is implemented Thiscollaborative approach ensures the accuracy and completeness of each video before proceeding tothe subsequent stage of frame extraction Each extracted character has been arranged in a folderstructure to facilitate frame feature extraction using Mediapipe, to serve the following steps.During the extraction process, frames are brought to a maximum size of 1920px while maintainingthe aspect ratio

Our dataset distinguishes itself from existing sign language datasets in several key aspects It hasVietnamese-specific characters, such as “d,” which are not found in the standard English alphabet.Secondly, the dataset removes characters that are not part of the Vietnamese alphabet, ensuring astreamlined representation of Vietnamese sign language These unique characteristics make ourdataset a valuable dataset for developing robust Vietnamese sign language recognition models.The bright spot of the dataset is that it is reusable because this is the first full dataset usingVietnamese sign language to be published, it can be completely expanded both in detail andincluded in specific problems and solutions such as deep learning models, Machine Learning, etc.The current dataset acknowledges a limitation — the absence of dynamic characters This omissionstems from the model's present recognizing symbols within individual frames, not yet analyzingsequences of frames that constitute a complete sign animation Our future efforts will target theincorporation of dynamic characters to enhance the model's capabilities

3.2 Using MediaPipe for feature extraction

Trang 12

The process begins with palm detection The model has been trained to recognize the uniquefeatures of the palm, which is critical for establishing a reference system for the rest of the hand.Once the palm is detected each landmark's position is represented by a normalized triplet ofcoordinates—x, y and z—that reflect its location in a normalized three-dimensional space Thesetriplets are recorded in a CSV file, where each line corresponds to a unique gesture, presenting the

x, y and z coordinates in a structured manner that preserves their spatial integrity across variousimage proportions and perspectives.These landmarks are carefully placed at strategic points acrossthe hand, including the fingertips, joints, and the base of the palm, offering a comprehensive map

of the hand's pose.

What sets MediaPipe apart is its real-time processing capability It's designed to functionseamlessly, regardless of common obstacles such as variable lighting conditions or partialocclusion of the hand This robustness is particularly important for applications in real-worldsettings, where conditions are rarely ideal Furthermore, MediaPipe Hand provides theselandmarks in three dimensions, offering a depth of data that goes beyond simple 2Drepresentations This 3D model allows for a much richer understanding of hand gestures, which iscrucial for nuanced applications like sign language recognition

For the specific task of classifying Vietnamese Sign Language (VSL) using a deep learningapproach, MediaPipe Hand's features offer several compelling advantages One significant benefit

is the precision of its landmark detection For any graph-based machine learning model, such as a

Trang 13

structure The GCN can then analyze the spatial relationships between these vertices to identifyand classify complex hand gestures.

The use of GCNs for sign language classification is particularly apt because these networks excel

at processing data that inherently possesses a graph structure, like the skeletal points of a hand.MediaPipe Hand's 3D data input allows GCNs to exploit spatial relationships between landmarksmore effectively, which is instrumental in distinguishing between the various signs in VSL Therichness of 3D data adds a layer of depth to the gesture recognition process, enhancing the model'sability to discern subtle differences between signs

The efficiency of MediaPipe Hand is another reason for its selection in this research endeavor Theability to process video streams in real-time significantly streamlines the data collection andpreprocessing stages, facilitating the construction of a comprehensive dataset necessary fortraining deep learning models The effectiveness of a model is often as good as the data it's trained

on, and MediaPipe ensures that high-quality, detailed data is available in abundance

Moreover, the cross-platform nature of MediaPipe aligns perfectly with the demands ofcontemporary research Compatibility with multiple operating systems and hardware means thatdata can be collected from a variety of sources, ensuring the diversity and representativeness ofthe training set This flexibility is critical for developing a robust model capable of operating indifferent environments and across various devices Additionally, the open-source aspect ofMediaPipe fosters a collaborative environment It allows for the constant improvement of the toolthrough community engagement, ensuring that any enhancements or solutions can be shared, thuspropelling the project forward with the collective effort of developers and researchers worldwide.3.3 Pretrained Models for Comparative Performance Evaluation

In the methodology of our study, the exploration of deep learning architectures for VietnameseSign Language (VSL) recognition was conducted through a meticulous assessment of VGG16v2and Vision Transformer (ViT) These models were not arbitrarily selected; instead, their inclusionwas justified by their architectural sophistication and algorithmic prowess, which we believedcould be leveraged to decode the nuanced gestural lexicon of VSL

VGG-16 Model, a robust convolutional neural network that has set a precedent in imageclassification tasks Developed by the Visual Graphics Group at the University of Oxford, theVGG-16 architecture, with its 16 layered design, including 13 convolutional and 3 fully connectedlayers, has been pivotal in computer vision This deep network is distinguished by its sequentialarrangement of 3x3 convolutional filters and max-pooling layers, culminating in a softmax layerfor classification among 1000 categories (Figure 1) Prior to its application to our Vietnamese SignLanguage (VSL) dataset, VGG-16 was pre-trained on the extensive ImageNet dataset, achieving

a remarkable top-5 test accuracy of 92.7% This pre-training has equipped the model with a richfeature representation for various visual objects, providing an extensive foundation for furtherlearning

The VGG16 model stands out due to its uniform architecture, employing a series of convolutional layers followed by max-pooling layers, and finally fully connected layers At the beginning of the

Trang 14

network, the input layer accepts an image of size 224x224 pixels with three channels,corresponding to the red, green, and blue components of the color image This input passes through

a stack of convolutional layers, where filters are applied to extract various features from the image.Each convolutional operation is followed by a non-linear activation function, ReLU (RectifiedLinear Unit), which introduces non-linearity into the system, allowing the model to learn morecomplex patterns

The convolutional layers in VGG16 are designed with a very small receptive field (3x3 filter sizewith a stride of 1), which is repeated throughout the entire network This small size is beneficial

as it allows the network to capture the finer details in the image The number of filters in theselayers starts at 64 and increases by a factor of two after each max-pooling layer, going up to 128,

256, and then 512 filters Max-pooling layers are employed after several convolutional layers toreduce the spatial dimension of the feature maps, thus reducing the number of parameters andcomputational complexity This downsampling is critical to ensure that the network can abstracthigher-level features from the data without becoming overwhelmed by the volume of raw inputdata.

Towards the end of the network, after a series of convolutional and max-pooling layers, thearchitecture flattens the three-dimensional feature maps into a one-dimensional vector This vectorfeeds into a sequence of fully connected layers VGG16 has three fully connected layers, with thefirst two having 4096 channels each, and the third performing the final classification, whichtypically has 1000 channels for the ImageNet challenge, corresponding to 1000 different classes.

The final layer uses a softmax activation function to convert the output of the network into aprobability distribution over the 1000 class labels This softmax layer ensures that the outputs sum

up to one and each value lies between 0 and 1, interpreted as the model's confidence in each of the

1000 classes that the input image could potentially belong to

VGG16's uniform architecture and its success in image recognition tasks have made it a popularchoice as a feature extractor for many other types of tasks beyond classification, such as objectdetection and even style transfer in images Its straightforward design also makes it an excellenteducational tool for those looking to understand the inner workings of convolutional neuralnetworks Despite its simplicity, VGG16's depth allows it to build an intricate hierarchy of features,making it powerful enough for a wide array of image processing tasks.

Trang 15

Figure 6 - VGG16 Model Architecture

As we integrate VGG-16 into our work, we adapt the model's layers to the specific contours ofVietnamese Sign Language (VSL), fine-tuning its parameters on our curated VSL dataset

The initial step in this adaptation process is the meticulous preprocessing of the dataset Imagesmust be standardized to fit the input layer of the VGG-16 model, which entails resizing them to224x224 pixels Beyond resizing, it is essential to normalize the images’ color values to the rangethe model was originally trained on This process ensures that the lighting and color variations inthe images do not unduly influence the feature extraction process Given that the signs are static,it's also crucial that the images be clear, with the hands prominently displayed against thebackground to facilitate the model's learning of the relevant features

The adaptation of VGG-16 for VSL requires customization of its architecture The original model

is designed for 1000 classes, which corresponds to the ImageNet challenge it was trained on.However, in the context of VSL, the model must be reconfigured to distinguish between only 33classes This involves modifying the final fully connected layers to output predictions across these

33 classes, effectively reshaping the output to match the dataset Fine-tuning the VGG-16 model

is the next crucial phase The process involves selectively retraining the network's layers, typicallybeginning with the higher-level layers that are responsible for detecting more abstract features Byadjusting the learning rate to a smaller value, the existing weights—derived from the model's initialtraining on the ImageNet dataset—are refined rather than overridden, thus preserving the generalfeature detection capabilities while aligning the model's focus towards the specific patterns ofVSL.

The training strategy should be approached cautiously, employing a low learning rate to fine-tunethe model parameters This approach is particularly important in preserving the integrity of thepre-learned features from the ImageNet dataset that provide a valuable foundation for therecognition of visual patterns

Evaluation metrics are paramount in gauging the model's performance accurately While accuracy

is an important metric, in the context of a multi-class classification problem such as sign languagerecognition, precision and recall also become critical These metrics will help to understand notonly how often the model is correct but also how often it misses relevant signs or misclassifies

11

Trang 16

them, which is particularly important if the dataset has an imbalance in the representation ofdifferent signs.

Interpretability is a vital component in understanding and refining the model Techniques likeGrad-CAM allow us to visualize the regions of the image that are most influential in the model'sdecision-making process This can be critical for ensuring that the model focuses on the hands andthe shape of the signs rather than irrelevant background features.

Through careful preprocessing, data augmentation, model customization, strategic fine-tuning,rigorous evaluation, and implementation of interpretability and regularization techniques, theadapted VGG-16 model is poised to become a highly specialized tool for the recognition of VSL.This fine-tuned model has the potential to become an instrumental asset for the VSL community,providing an accessible technology that could bridge communication gaps and support the learningand teaching of Vietnamese Sign Language

The Vision Transformer (ViT) is a paradigm-shifting architecture that applies the principles ofattention mechanisms, originally conceived for natural language processing, to the domain ofimage classification Illustrated in Figure 2, the ViT architecture deviates from conventionalconvolutional approaches by partitioning an image into a sequence of flattened patches, which are then linearly projected and enriched with positional information to preserve the spatial hierarchy

of the pixels.

At its core, as depicted in Figure 2(a), the ViT architecture comprises the following components:Patch & Position Embedding: The ViT begins by dissecting the input image into a series of fixed-size patches These patches are then flattened and transformed into a one-dimensional sequencethrough a linear projection, akin to words in a sentence for a language model To these projectedpatch embeddings, positional embeddings are added to ensure that the model retains the order andlocation of each patch, which is essential for maintaining the spatial relationships crucial forunderstanding the compositional semantics of an image

Transformer Encoder: An array of Transformer encoder layers, as shown in Figure 2(b), processesthe sequence of embedded patches The encoder layers consist of multi-head self-attentionmechanisms and multi-layer perceptrons (MLPs), with normalization applied after each operation.Classification Head: ViT introduces a learnable 'class' token to the sequence of embedded patches.After processing through the Transformer encoder stack, the representation of this class tokenserves as the aggregated image representation for classification tasks The output corresponding tothe class token is passed through an MLP head to output the final predictions

Trang 17

Classification Task “ +

Transformer Encoder

ñ Class Token a Patch Tokens

Figure 7 - Standard ViT architecture (with W=9 patches as an example) (a) Data Flow of ViT.

(b) Transformer Encoder Block

As we proceed to fine-tune the model for VSL, we construct a robust data augmentation pipeline

By integrating techniques such as `RandomRotation` and `RandomResizedCrop`, we teach theViT model about the variability inherent in human communication through sign language Thesetechniques are not random choices; they are specifically designed to mirror the real-worldconditions under which sign language is both produced and perceived For instance,

`RandomRotation` allows the model to understand that the orientation of signs is not fixed andthat the spatial arrangement of hand gestures is vital for semantic interpretation Meanwhile,

`RandomResizedCrop` ensures that the model remains flexible and robust, capable of interpretingsigns regardless of variations in the signer's position or camera framing

The fine-tuning phase is managed with precision using the ~TrainingArguments’ and `Trainer`modules Here, we meticulously calibrate the learning rate, batch size, and the number of trainingepochs This calibration is not arbitrary; it is a strategic choice that reflects the unique learningdynamics of VSL The learning rate, for example, is fine-tuned to ensure the model weights areadjusted in a manner that is neither too abrupt nor too gradual, allowing the ViT to assimilate the new domain knowledge without forgetting its pre-trained capabilities.

Furthermore, the fine-tuning of positional embeddings plays a pivotal role in the adaptationprocess In ViT, these embeddings are not just ancillary features; they are crucial for informing themodel about the relative positions of the image patches it processes For the domain-specific task

of VSL, where the configuration of hand gestures can entirely alter the meaning of a sign, theseembeddings need to be finely attuned During fine-tuning, these positional embeddings areadjusted, aligning them closely with the specific hand gestures unique to VSL

The Transformer encoders within ViT are adept at discerning the subtle interplay between differentimage segments This capability is especially beneficial for VSL, where the meaning of a sign canhinge on the nuanced relationship between the hands and other parts of the body Through the fine-tuning process, these encoders become skilled at detecting these nuances, allowing them to

13

Trang 18

differentiate between signs with high precision This refinement process does not happenovernight; it is the result of iterative training and continuous evaluation.

Speaking of evaluation, throughout the fine-tuning process, it is imperative to maintain a rigorousevaluation protocol Using validation datasets, we consistently assess the performance of themodel, making use of techniques like confusion matrix analysis to pinpoint areas where the modelmay falter This analysis is not just diagnostic; it informs the ongoing training process, highlightingspecific signs or gestures where the model needs additional data or more focused training Thereal-world application of the fine-tuned ViT model comes with its own set of challenges andconstraints When integrating the model into an application environment, further adjustments may

be necessary to account for the real-time processing needs of VSL interpretation These mayinclude additional post-processing steps or the fusion of the ViT model's output with other systemcomponents to ensure a seamless user experience

The end goal of this intricate process is a fine-tuned ViT model that can seamlessly interpret thecomplexities of VSL This endeavor is not just a technical challenge; it represents a meaningfulstride toward inclusivity, bridging communication gaps for the VSL community and facilitatingsmoother interactions in a variety of settings Through the careful application of advanced machinelearning techniques and thoughtful consideration of the end-user experience, the model

promises to be a valuable asset both technologically and socially

3.4 Graph Convolutional Network for VSL Recognition

To deploy an efficient system capable of recognizing hand gestures, allowing flexible andaccessible interaction for a broad user base in everyday life, we have chosen an image-basedapproach Our initial idea was to develop a classification task based on graph convolutions,employing techniques for feature extraction, processing, and classification based on graphs.Broadly, our model consists of three main parts:

e Feature extraction of the hand's skeletal framework

e Preprocessing, feature augmentation, and graph structure construction

e Classification model: implementation of models for graph classification tasks

With the initial idea to deploy a classification task based on graph convolution, techniques forextraction, processing, and classification were used In this study, with an orientation to build asystem based on graph classification methods, we start with a simple classification module thatincludes 2 hidden blocks of the graph convolutional network Each block in the architectureconsists of a normalization layer, a graph convolutional layer, and an activation function Aspreprocessed in the feature extraction and preprocessing stages, the input is an undirected graphwith 21 nodes interconnected through an adjacency matrix E (2, 42), with each node having 4features Through the use of graph convolution, the number of features at each layer isincrementally expanded to 16 features at the first hidden block and 64 features at the second hiddenblock After the feature aggregation and extraction process, the graph is synthesized through aglobal mean pooling layer before passing through two dense layers for classification

Trang 19

Figure 8 - Architecture of a Graph Convolutional Network for Gesture Classification

Recognizing the exemplary performance of the model achieved with minimal computational

resources, particularly its efficiency in parameter usage at just 2,500 parameters, our enthusiasm

is buoyed towards further enhancements We aim to bolster the model's performance while

preserving its inherent compactness, an attribute critical for deployment in resource-constrainedenvironments.

Embracing a multi-stream paradigm, the refined model we propose delineates input data into

three distinct streams This triadic configuration is engineered to scrutinize various dimensions

of the graph data in parallel, akin to an ensemble of specialized analytical processes concurrentlydissecting a complex dataset

Each stream inherits the proven two-layer graph convolutional topology, ensuring that the robustfeature extraction methodologies established by preceding models are perpetuated This

structural recursion is strategic, facilitating the isolation of salient features within the

high-dimensional space.

Post-extraction, an integration of these diversified feature matrices ensues, synthesizing disparatedata interpretations into a comprehensive feature representation The ensuing matrix exhibits a

rich tapestry of features, subsequently presented to the classification layers

The architecture’s classification mechanism is a composite of multilayer perceptrons, each

neuron serving as an integrator of the extracted features This multi-layered approach is not

merely a nod to traditional neural networks but an articulate choice, enhancing the translationalcapacity of complex features into precise predictive outputs Moreover, the integration of globalmean pooling within each convolutional stream underscores the model's commitment to

dimensional reduction This pooling modality distills voluminous feature sets into their most

expressive constituents, thereby elevating the most significant features for subsequent

classification processes

15

Trang 20

An emphasis on input normalization prior to convolutional operations ensures a homogenized

scale for input data Such standardization is imperative for gradient stability across training

epochs, especially when engaging with heterogeneous datasets that may otherwise introduce

scale variance and compromise learning efficacy

INPUT GRAPH CONVOLUTION LAYERS CLASSIFICATION LAYERS

Figure 9 - Multi-Stream Graph Convolutional Network for Gesture Classification

Building upon the foundational design, the proposed GCN architecture incorporates a novelmodule aiming to enhance the extraction of higher-level features from input data through advancedgraph convolution processes The strategic augmentation is premised on the model's imperative toremain dimensionally economical while striving for minimal impact on performance efficacy

The module draws inspiration from the architectural ingenuity of MobileNetV3's bottleneckdesign, a paradigm of classification models recognized for their potent performance and judicioususe of computational resources The crux of MobileNetV3's efficiency lies within its bottleneckstructure, an elegantly orchestrated constriction of data channels that simultaneously engages with

Trang 21

Adopting this principle, our model introduces a bottleneck block in the aftermath of the featurecombination layer—a pivotal juncture where the feature maps from multiple graph convolutionstreams coalesce This integration embodies a harmonization of distinct feature perspectives, eachstream having parsed the graph data through its specialized convolutional lens The bottleneckblock's role is thus to distill this confluence of features, pruning the informational expanse into adense, potent representation.

In operational terms, the bottleneck block applies a series of convolutions that initially compressthe feature channels, deliberately reducing dimensionality This compression is immediatelysucceeded by an expansion convolution, a maneuver that broadens the information channels oncemore, albeit with an acute focus on the most salient features as dictated by the training regimen.The expansion is tactical, reinforcing the model's capability to accentuate relevant patterns whileprecluding extraneous data

The bottleneck's depthwise convolutions impart another layer of refinement, deploying a grained filter across each channel, reinforcing the model's interpretative precision A subsequentglobal pooling layer stands sentinel at the terminus of this block, its charge to aggregate the spatialinformation into a singular vector that encapsulates the essence of the input's topologicalcharacteristics

fine-Such a vector then traverses to a projection layer, whose purpose is to project the condensed featurearray onto the output space Here, the classification process culminates, with each feature vectorundergoing a transformation into a probabilistic interpretation of class membership Theapplication of the softmax function at this final layer ensures a probabilistic distribution overpotential classes, crystallizing the network's deductions into definitive predictions.

In synthesis, the described enhancements to the GCN architecture encapsulate a meticulousbalance between feature richness and computational frugality Through the judicious application

of a bottleneck module, the architecture achieves a feat of maintaining a compact footprint whileexhibiting formidable analytical prowess—a quintessential exemplification of efficiency in deeplearning architectures

This advanced architectural blueprint holds promise for a wide array of applications, particularly

in domains where resource constraints are paramount, yet the demand for sophisticated dataanalysis and pattern recognition is non-negotiable The ongoing development and refinement ofsuch models will undoubtedly propel the frontiers of graph-based machine learning, paving theway for innovations that are as resource-efficient as they are analytically profound

17

Trang 22

INPUT GRAPH CONVOLUTION LAYERS CLASSIFICATION LAYERS

F A 7 r A k ———— A.

uoJn|oAuo2 ude19 U1E941S-I1|n|A| 329IgJ29ug

BneckBlock

4 Experimental Results and Discussion

4.1 Experimental Results

In our experimental section, we present a detailed comparative analysis that evaluates theperformance of three distinct pre-trained neural network models alongside three variations of ourprimary Graph Convolutional Network (GCN) models The pre-trained models under scrutiny areVGG16 and Vision Transformer (ViT), each fine-tuned on our Vietnamese Sign Language (VSL)dataset to ensure relevance and applicability to the task at hand

The GCN models, denoted as Model 1, Model 2, and Model 3, have been meticulously designedand iterated upon to optimize sign language classification Model 1 serves as our baseline GCN,providing a foundation for comparison Model 2 incorporates advanced strategies such as multi-stream graph convolution blocks and a MobileNetV3-inspired bottleneck feature to enhancefeature extraction while maintaining model efficiency Model 3 represents our most sophisticatedarchitecture, potentially featuring further layers or complex configurations, as evidenced by itsincreased parameter count and lowest test loss

Through rigorous training and testing protocols, each model has been evaluated on metrics thatinclude training accuracy, test accuracy, precision, F1 score, and test loss The parameters count

is also recorded to weigh the models’ computational demands against their performance This

Trang 23

study are crucial in guiding future research directions and potential real-world implementations ofsign language recognition systems.

Models Train Test Precision | Fl score | Recall Test Loss

Accuracy | AccuracyVGGI6 V2 0.9875 0.9875 0.9877 0.9875 0.9875 0.0477

ViT 0.9998 0.9999 0.9998 0.9998 0.9998 0.6903

Table 3 - Performance Metrics of VGG16 V2 and ViT Models

Comparison of Model Metrics

Metrics

Figure II - Performance Metrics Comparison between VGG16 V2 and ViT Models

The table presents performance metrics for two models, VGG16 V2 and ViT, which have beentrained and tested on a dataset, likely related to image recognition tasks such as sign languagedetection

19

Trang 24

Looking at the VGG16 V2 model, it exhibits high train and test accuracies at 98.75% This closecorrelation between training and testing performance indicates that the model generalizes well tounseen data The precision, Fl score, and recall are consistent and high, which is excellent forpractical applications; however, the test loss at 0.0477, while relatively low, is higher than idealwhen compared to the ViT model.

The ViT model, on the other hand, showcases extremely high training and testing accuracies, bothrounding up to 99.99% This remarkable consistency demonstrates the model's ability to learn andgeneralize from the data exceptionally well The precision, F1 score, and recall metrics all matchthe accuracy, indicating an almost perfect classification system with a balance between thesensitivity and precision of the model Yet, the test loss for the ViT model is significantly higherthan that of VGG16 V2 at 0.6903, which is counterintuitive given the other metrics

The discrepancy between the ViT model's test loss and its other performance metrics could indicate

an issue with how the loss was calculated, reported, or perhaps a sign of an anomaly in the testdataset that did not impact accuracy Typically, a high test loss would correspond to lower accuracymetrics, so further investigation is needed to understand the reason behind this inconsistency

In comparison, while the ViT shows superior accuracy, the VGG16 V2 model may be morereliable in terms of loss metric consistency This could suggest that in scenarios whereinterpretability of loss is crucial, VGG16 V2 might offer a more comprehensible performanceprofile

Models Train Test Precision FI score Test Loss Params

Accuracy AccuracyModel 1 98.14 99.15 99.15 99.15 0.002500 4441

Model 2 99.40 99.60 99.60 99.60 0.0009089 | 8665

Model 3 99.94 99.82 99.82 99.82 0.9890 7673

Table 4 - Comparative Performance Metrics of Three GCN Models

Trang 25

Comparison of Model Performance Across Metrics

Figure 12 - Performance Metrics of GCN Models Across Key Evaluation Criteria

The table presents the performance metrics for three distinct Graph Convolutional Network (GCN)models, which have been fine-tuned for the task of Vietnamese Sign Language recognition Eachmodel is evaluated based on its accuracy in both training and testing phases, its precision, F1 score,and test loss Additionally, the table includes the number of parameters for each model, providinginsight into the model's complexity

Starting with Model 1, the transition from training to testing accuracy indicates a model thatgeneralizes well, reflected in its significant precision and F1 score of 99.15% The low test lossfurther substantiates the model's ability to maintain its performance on unseen data, an essentialquality for practical applications where predictability is crucial

Model 2 elevates these metrics, exhibiting a remarkable synergy between training and testaccuracies, both exceeding 99.4% The precision and F1 score alignment suggests a high truepositive rate and a balanced sensitivity and specificity—attributes that are critical in systems wheremisclassification could significantly impact the user experience The reduced test loss implies anenhancement in the model's ability to not just fit but truly understand the nuances of the VSL data

Model 3 presents a conundrum: while it shows the highest training accuracy and an impressivetest accuracy that nearly matches, its test loss is unexpectedly higher than what one wouldanticipate This incongruity hints at a potential overfitting or an error in data reporting orprocessing that requires attention, as such a loss value contradicts the other performance indicators

21

Trang 26

When contextualizing these models against the backdrop of pre-trained models such as VGG16V2 and ViT, an intriguing narrative unfolds The pre-trained models, drawing on extensive anddiverse visual datasets, demonstrate exceptional test accuracies with ViT standing out with almostperfect metrics across the board These models, particularly ViT, have leveraged their vastexposure to various visual patterns to deliver exemplary performance on VSL recognition—a taskthat requires discerning subtle differences in hand gestures.

The performance metrics for the three GCN models tailored for Vietnamese Sign Languagerecognition provide a layered perspective on the strengths and potential areas of refinement withineach architectural design

The conclusion drawn from juxtaposing the GCN models against the pre-trained architectures ismultidimensional Firstly, the performance of the pre-trained models affirms the power of transferlearning, especially in domains rich with visual nuances such as sign language Among these, ViT'sperformance is notably stellar, likely due to its ability to capture the sequential and relationalaspects inherent in sign language through its self-attention mechanisms

Secondly, the GCN models, with their specialized architecture designed to capture relational data,illustrate the potential for bespoke solutions in sign language recognition Particularly, Model 2'sperformance suggests that there is a sweet spot where a model can be both efficient and highlyaccurate without an overbearing parameter count

This exploration reveals a compelling argument for a hybrid approach: combining the broadlearning capabilities of pre-trained models with the specialized, graph-based nuance understanding

of GCNs Such a hybrid could potentially capture the best of both worlds, achieving highperformance with computational efficiency

Further investigation and experimentation could focus on reconciling the disparities in Model 3'stest loss, integrating lessons learned from both the pre-trained and GCN models, and potentiallyexploring ensemble methods or novel architectures that marry the conceptual strengths of bothapproaches The ultimate goal remains to develop an interpretable, reliable, and efficient modelthat can democratize communication for the VSL community in diverse, real-world environments.4.2 Visual Performance Insights

Upon reviewing the overall performance of the VGG16 model, two sets of results have beenprovided that shed light on its effectiveness for the classification of Vietnamese Sign Language.The first is a confusion matrix that illustrates the model's predictive accuracy across all classes,while the second comprises training and validation curves over a series of epochs Together, theseresults offer a comprehensive view of the model's strengths and areas where further improvementscould be made

The first confusion matrix, attributed to the VGG16 model, indicates that while there is a strong

Tiêu đề	Deep Learning Based Graph Convolutional Network Using Hand Skeletal Points For Vietnamese Sign Language Classification
Tác giả	Le Quang Nhat, Phung Phuong Uyon, Tran Thi Kim Ngoc, Nguyễn Thị Huyộn Trang, Đỗ Thị Trung Anh
Người hướng dẫn	Ha Manh Hung, Ph.D
Trường học	National University
Chuyên ngành	Management Of Information System
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	52
Dung lượng	27,52 MB