Developing vietnamese sign language to text translation system

By leveraging our acquired knowledge, we can design a sophisticated systemthat adeptly captures the intricacies of sign language through computer visiontechniques, employs natural langua

Trang 1

Do Thi Thanh Nha - Le Thanh Luan

FINAL THESIS

Developing Vietnamese-Sign-Language To Text

Translation System

SOFTWARE ENGINEERING MAJOR

HO CHI MINH CITY, JULY 2023

Trang 2

Do Thi Thanh Nha - Le Thanh Luan

FINAL THESIS

Developing Vietnamese-Sign-Language To Text

Translation System

SOFTWARE ENGINEERING MAJOR

Instructor: Dr Nguyen Trinh Dong

HO CHI MINH CITY, JULY 2023

Trang 3

The Graduation & Thesis Evaluation Committee was established according toDecision No dated by the Rector of the University of

Trang 4

First and foremost, we are profoundly grateful to the esteemed members of thethesis committee, particularly those from the Faculty of Software Engineering atthe University of Information Technology - VNUHCM Their valuable insights,constructive criticism, and scholarly contributions have significantly enhanced theacademic rigor of this thesis We are indebted to their expertise and rigorousevaluation.

We would also like to express our heartfelt appreciation to our family andfriends for their unconditional love, encouragement, and belief in our abilities.Their unwavering support, understanding, and patience have been the drivingforce behind our academic journey

Furthermore, we would like to express our deep appreciation to Professor NguyenTrinh Dong for his invaluable guidance, expertise, and mentorship throughout theprocess this thesis’ journey His unwavering support, constructive feedback, andscholarly insights have played a crucial role in shaping the academic rigor andoverall quality of this work

To everyone who has contributed to this thesis in various ways, whether rectly or indirectly, we extend our heartfelt appreciation Your support has beeninvaluable in the successful completion of this academic endeavor

di-Thank you very much, we wish you all the best

Ho Chi Minh City, July 2023

Students

Le Thanh Luan

Do Thi Thanh Nha

Trang 5

.

Trang 6

1 Le Thanh Luan 19520702 50%

Trang 7

1 Introduction 8

1.1 Problem statement 8

1.2 Approach 10

1.3 Results 11

2 Foundational knowledge 13 2.1 Vietnamese Sign Language (VSL) 13

2.2 Survey of Existing Sign Language Translation Technologies 16

2.2.1 Sign language recognition using sensor gloves 16

2.2.2 A Cost Effective Design and Implementation of Arduino Based Sign Language Interpreter 18

2.2.3 Neural Sign Language Translation 18

2.2.4 Deep Learning for Vietnamese sign language recognition in video sequence 19

2.3 Machine Learning Techniques and Algorithms 21

2.3.1 Machine learning 21

2.3.2 Deep Learning 22

2.3.3 Model Architectures for Sign Language Recognition 24

2.3.4 Deep Neural Network (DNN) 26

2.3.5 Convolutional Neural Networks (CNN) 29

2.3.6 Recurrent Neural Network (RNN) 36

2.3.7 Long Short-Term Memory (LSTM) 38

2.4 Software Background 40

2.4.1 Python 43

2.4.2 Javascript 43

2.4.3 Tensorflow 44

2.4.4 Keras 45

2.4.5 scikit-learn 46

2.4.6 Jupyter Notebook 47

2.4.7 Mediapipe 48

2.4.8 Expo-React Native 51

Trang 8

3 Data Collection and Preprocessing 60 3.1 Reason for Dataset Creation: Scarcity of Existing Datasets and

Unresponsive Researchers 60

3.2 Gathering and Preparing the Dataset for VSL Recognition Model 63 3.3 Data Preprocessing Steps 64

3.4 Matrix Formation - Input for Machine Learning Model 66

3.5 File Organization 67

4 Model Training 69 4.1 Model Architecture 69

4.2 Training Process 72

4.3 Model Evaluation 74

5 Mobile Application Development 78 5.1 Designing the Mobile Application 78

5.1.1 Use cases 78

5.1.2 Database Diagram 86

5.1.3 The streaming system for processing videos 86

5.2 Implementation and Results 95

5.2.1 Integrating the Model into the Mobile App with Python 95 5.2.2 Results 97

5.3 User Testing and Feedback: Improving the User Experience 100

6 Feedback and Discussion 102 6.1 User Testing Results and Feedback on the Mobile Application 102

6.2 Comparison of our Mobile Application with Existing Sign Language E-learning Apps 103

6.2.1 ASL Bloom 103

6.2.2 Lingvano 104

6.3 Other Assistive Technologies 105

7 Conclusion and Future Work 106 7.1 Summary of Contributions and Accomplishments 106

7.2 Future Directions for the Project: Expanding the dictionary 107

Trang 9

2.1 VSL alphabet (Source: Circular 17/2020/TT-BGDĐT) 14

2.2 VSL shares similarities with global sign language dictionaries 15

2.3 An example of sensor glove used for detecting movement sequences Source: Cornell University ECE 4760 Sign language glove prototype 17 2.4 LSTM architecture 26

2.5 Structure of a simple neural network with an input layer, an output layer, and two hidden layers 27

2.6 Starting the convolutional operation 30

2.7 Step two in the convolutional operation 31

2.8 Finish the convolution operation when the kernel goes through the 5*5 matrix 31

2.9 Example of the convolutional matrix 32

2.10 X matrix when adding outer padding of zeros 32

2.11 Convolutional operation when stride=1, padding=1 33

2.12 Convolutional operation stride=2, padding=1 33

2.13 Illustration of convolutional operation on a color image with k=3 34 2.14 Tensor X, W 3 dimensions are written as 3 matrices 35

2.15 Max pooling layer with size=(3,3), stride=1, padding=0 36

2.16 Example of pooling layer 36

2.17 The structure of a recurrent neural network 37

2.18 The flowchart of RNN-T algorithm Used in Reliable Multi-Object Tracking Model Using Deep Learning and Energy Efficient Wireless Multimedia Sensor Networks[1] 38

2.19 Python Language Syntax 43

2.20 Javascript Language Syntax 44

2.21 Sklearn Metrics - Confusion matrix 46

2.22 Jupyter Notebook IDE 47

2.23 Mediapipe Hand Landmarks 49

2.24 Mediapipe Holistic Landmarks 50

2.25 Mediapipe Face Mesh Landmarks 51

2.26 Expo can run cross-platform 52

Trang 10

nition in Video Sequence Source: [11] 61

3.2 Some actual footage of the dataset used in Deep Learning for Viet-namese Sign Language Recognition in Video Sequence Source: [11] 62 3.3 Letter D 66

3.4 Letter L 66

3.5 Letter V 66

3.6 Letter Y 66

3.7 Folder tree of the sign ’a’ 68

4.1 Mediapipe Face Mesh Landmarks 75

5.1 The database diagram designed for SignItOut 86

5.2 The diagram illustrates the relationships among the elements within the API layer 88

5.3 The diagram illustrates the relationships among the elements within the Engine layer 91

5.4 Run Strapi with command prompt 97

5.5 Strapi UI run on localhost 98

5.6 Login Screen 99

5.7 Home screen 99

5.8 Course details Screen 99

5.9 Plain Text Lesson 99

5.10 Quiz Lesson 100

5.11 Video Lesson 100

6.1 The logo of ASL Bloom, one of the most popular sign language E-Learning applications with more than 100.000 users 103

6.2 Lingvano, the ASL learning application which uses artificial intel-ligence for giving feedback about users’ signing accuracy 105

Trang 11

3.1 Description of VSL Dataset with Alphabet signs 64

4.1 Model Architecture 71

4.2 Model Parameters 71

4.3 Confusion Matrix Results 76

5.1 Use Case Table 79

5.2 Use Case UC001: User Login 80

5.3 Use Case UC002: Browse Courses 80

5.4 Use Case UC003: Enroll in Course 81

5.5 Use Case UC004: View Course Details 82

5.6 Use Case UC005: Learn course 82

5.7 Use Case UC006: Take Quiz 83

5.8 Use Case UC007: Track Lesson Progress 84

5.9 Use Case UC008: User Logout 84

5.10 Use Case UC009: Use continuous signing detection 85

Trang 12

This chapter provides a comprehensive introduction to the problem addressed

in this thesis, presenting its significance and potential implications Through athorough review of existing literature, the chapter identifies research gaps andestablishes a clear research objective The chosen methodology, including researchdesign, data collection procedures, and analysis techniques, is outlined to provide

a logical framework for the study Additionally, an overview of the research results

is presented, summarizing the data collected, analysis conducted, and key findingsderived from the study

Vietnamese Sign Language (VSL) is the primary means of communication forthe deaf community in Vietnam It is a unique visual language with its grammar,vocabulary, and syntax VSL is not only a means of communication but also acritical part of Deaf culture and identity However, the VSL has been marginalizedfor a long time, and the deaf community faces significant challenges in their dailylives due to the lack of recognition and understanding of VSL by the wider society

Within the realm of the Fourth Industrial Revolution, scientists worldwide areactively working towards resolving the pervasive issue of machine translation.However, the field of sign language translation remains largely underserved, leav-ing minimal resources available to comprehend the language used by deaf andmute individuals For instance, in online meetings, individuals who are unable tospeak solely rely on textual communication to convey their thoughts Regrettably,this reliance on chat messages poses obstacles to maintaining focus during discus-sions of serious matters, resulting in delays that hinder the flow of the meeting

Trang 13

When contemplating artificial intelligence-integrated products in Vietnam, thepredominant associations typically revolve around chatbots, recommendation sys-tems, and autonomous vehicles - all of which find utility in business settings,aiding individuals in their professional endeavors Conversely, in advanced na-tions such as Germany or the United States, artificial intelligence extends beyondwork-related applications to encompass aiding disabled individuals in their dailyactivities This concept, known as science for humanity, has evolved in parallelwith science for business since the inception of the artificial intelligence indus-try While Vietnam is witnessing a progressive surge in science for business, withnew research being published daily, achieving technological parity with developedcountries necessitates a heightened emphasis on science for humanity This thesisaddresses this need by delving into the domain of sign language translation.

This thesis provides a valuable opportunity to apply the knowledge and skillsacquired during our four-year period at the University of Information Technol-ogy - VNUHCM Throughout our academic journey, we have cultivated a strongfoundation in diverse disciplines, encompassing programming, computer science,machine learning, computer vision, and natural language processing This wealth

of expertise can now be harnessed to address the challenge of sign language tion By leveraging our acquired knowledge, we can design a sophisticated systemthat adeptly captures the intricacies of sign language through computer visiontechniques, employs natural language processing methodologies, and harnessesadvanced machine learning algorithms Furthermore, our understanding of soft-ware engineering allows us to extend our efforts beyond the sign language trans-lation system alone by incorporating the development of educational software forsign language This integration of academic learning and practical implementa-tion holds tremendous potential for developing innovative solutions in the realm

transla-of sign language translation

This project delves into a profoundly relevant and widely-discussed topic of ourtime: artificial intelligence By undertaking this endeavor, we not only contribute

to the growing body of knowledge and advancements in this field but also tion ourselves for future opportunities to engage with state-of-the-art systems Theprocess of designing and developing a sign language translation system presents uswith a unique advantage - the firsthand experience of working with large-scale sys-tems This invaluable experience equips us with the skills and insights necessary

posi-to thrive in the dynamic and fast-paced environments of prominent companies.Additionally, the use of English throughout the entire research and developmentprocess augments our ability to collaborate on international projects This expo-sure to a broader scope of work further expands our horizons and enhances our

Trang 14

prospects for engaging in diverse and impactful initiatives on a global scale.

Our project holds a crucial position in the dynamic world of technology, whereconstant progress and innovation are the norm It serves as both a referencepoint and an inspiring source for students and researchers seeking to explorethe frontiers of possibility Our primary objective is to push the boundaries ofwhat can be achieved, and we envision our project as a guiding force, providinginvaluable insights and lessons that will drive future endeavors in the development

of state-of-the-art systems By openly sharing our methodologies, discoveries, andbreakthroughs, particularly in relation to the comprehensive data set we havemeticulously constructed for sign language translation, we foster a collaborativeenvironment that nurtures the growth of technological excellence and propels thefield forward

This thesis represents a meticulous and comprehensive endeavor, drawing upon

a strong foundational knowledge base that will be thoroughly expounded upon astrong foundational knowledge base that will be thoroughly expounded upon inChapter 2 The development process unfolds through several distinct stages, each

of which assumes a pivotal role in shaping the trajectory of our research

To commence, we embark on a thorough survey of existing American Sign guage (ASL) translation technologies and mobile applications This investigationserves to identify best practices, potential challenges, and valuable insights to in-form our own approach Subsequently, we delve into a comprehensive study ofVietnamese Sign Language (VSL), delving deeper into its unique features andintricacies This exploration provides us with a profound understanding of thelanguage and forms a critical basis for our subsequent endeavors

Lan-The next phase of our development process centers around the collection andpreprocessing of a robust dataset consisting of VSL videos This dataset serves

as the foundation for training and testing our sign language recognition models,allowing us to refine their accuracy and effectiveness Through extensive researchand experimentation, we endeavor to develop a sophisticated deep-learning modelthat optimizes the learning process and enhances the recognition capabilities ofthe system

Trang 15

With the core models in place, we proceed to develop the sign language tion system, integrating it with a streamlined and efficient streaming architecture.This integration ensures seamless real-time translation capabilities, bolstering thesystem’s practicality and usability To further enhance the user experience andvalidate the effectiveness of the developed system, extensive user testing is con-ducted Through this process, we gain valuable insights into user feedback, en-abling iterative improvements and refining the system’s performance.

transla-In the final stages of our development process, we focus on integrating the signlanguage translation system into a mobile application This integration facilitateswider accessibility and usage, enabling individuals to easily access and benefitfrom the translation capabilities on their mobile devices Through rigorous usertesting and evaluation, we aim to optimize the user experience, ensure the system’seffectiveness, and demonstrate the practical application of the translation modelwithin a software context

In summary, this thesis encompasses a multifaceted development journey, compassing a comprehensive survey, in-depth study of VSL, dataset collection,preprocessing, model development, system integration, and user testing Throughthese stages, we aim to contribute to the advancement of sign language translationtechnology, improve user experiences, and exemplify the practical application ofour research findings

The outcomes of this thesis encompass a multitude of valuable contributions tothe field of Vietnamese sign language These achievements include the construc-tion of a meticulously curated Vietnamese sign language dataset, which serves as

a valuable resource for further research and development Additionally, we havesuccessfully developed a robust and accurate model for sign language recogni-tion and classification, enabling precise interpretation and understanding of signlanguage gestures

Building upon these foundations, we have also created a real-time sign languagetranslation system, incorporating cutting-edge technologies and algorithms Thissystem facilitates seamless and instantaneous translation between sign languageand natural language, fostering effective communication and bridging the gapbetween the deaf and hearing communities

Trang 16

Furthermore, we have developed a mobile application dedicated to sign guage education This application serves as a comprehensive learning platform,providing resources, tutorials, and interactive exercises to empower individuals inacquiring the foundation of Vietnamese sign language Through this application,

lan-we strive to promote inclusivity, accessibility, and equal opportunities for viduals with hearing impairments Moreover, beyond its primary focus on signlanguage education for the deaf community, this application also serves a vitalsecondary purpose It functions as a platform that extends the opportunity forindividuals who are not deaf to delve deeper into the world of sign language,fostering a greater understanding and appreciation of the deaf community

indi-Collectively, these achievements signify a significant advancement in the main of Vietnamese sign language The dataset, sign language recognition model,real-time translation system, and educational mobile application collectively con-tribute to the enrichment of communication, education, and inclusivity within thedeaf community

Trang 17

do-Foundational knowledge

This chapter serves as a comprehensive introduction to the key concepts andfoundations that underpin our thesis We delve into essential areas such as the coreprinciples of Vietnamese Sign Language, the fundamentals of machine learning,algorithms, and pertinent technologies By establishing this groundwork, we lay asolid foundation for the subsequent exploration and development of our research.Furthermore, we acknowledge the significant contributions of related research that

we have consulted and drawn upon to enrich and inform our thesis, highlightingthe valuable insights and knowledge from the wider academic community thathave shaped our work

Vietnamese Sign Language (VSL) holds a significant place as the primary mode

of communication for the deaf community in Vietnam It is a visual and gesturallanguage that utilizes hand movements, facial expressions, and body postures toconvey meaning and express thoughts and emotions Over the course of history,sign language in Vietnam has evolved and diversified, catering not only to ev-eryday activities but also to the demands of professional settings This dynamicevolution has led to the development of a comprehensive sign language systemthat encompasses a wide range of expressions and vocabulary, enabling effectivecommunication in various contexts

Trang 18

Fig 2.1: VSL alphabet (Source: Circular 17/2020/TT-BGDĐT)

Common misconceptions regarding sign language often arise among individualswho are not familiar with its intricacies One prevalent misunderstanding pertains

to sign language being perceived as a universal or international language, whereindeaf communities worldwide utilize a shared sign language system However, thisassumption is inaccurate since sign language vocabularies differ across cultures.Even within a single country, variations in vocabulary can be observed between dif-ferent regions Research conducted by the National Center for Special Education

in Vietnam has revealed that sign language in The Southern, Central, and ern regions of the country exhibits a similarity of only 50% Moreover, the studyhas identified approximately 200 words that are commonly understood withinthe deaf community in Vietnam These findings shed light on the rich diversityand unique linguistic characteristics of sign language, reinforcing the importance

North-of recognizing and appreciating the cultural and regional nuances that shape itsvocabulary and expression

Trang 19

Though it shares some common similarities with global sign language ies regarding vocabularies, Vietnamese Sign Language possesses a distinct gram-mar, vocabulary, and syntax that set it apart from spoken Vietnamese Notably,while the structure of the Vietnamese spoken language typically follows a subject-verb-complement order, Vietnamese Sign Language adopts a different pattern Insign language, the order is subject-complement-verb, offering a unique linguisticframework for conveying meaning Additionally, the placement of numbers di-verges between the two languages In spoken Vietnamese, numbers tend to precedethe subject, while in Vietnamese Sign Language, numbers are typically positionedafter the subject These variations highlight the intricacies and idiosyncrasies thatexist within the linguistic systems, emphasizing the need for a comprehensive un-derstanding of Vietnamese Sign Language as a distinct and separate mode ofcommunication.

dictionar-Fig 2.2: VSL shares similarities with global sign language dictionaries

Despite its importance, Vietnamese Sign Language has faced challenges andmarginalization in society Limited recognition and understanding of sign lan-guage by the wider community have created barriers for the deaf community invarious aspects of life, including education, employment, and social interactions.Efforts are being made to promote awareness and inclusivity, advocating for the

Trang 20

recognition of Vietnamese Sign Language as an official language and ensuringaccessibility for the deaf community.

Technological advancements, including the development of sign language nition and translation systems, hold promise for improving communication andaccessibility for the deaf community These innovations aim to bridge the com-munication gap between deaf and hearing individuals, facilitating effective inter-actions and fostering equal opportunities

Transla-tion Technologies

Sign language plays a crucial role in facilitating communication for individuals

in the deaf and hard-of-hearing community In Vietnam, Vietnamese Sign guage (VSL) is the predominant sign language, while other countries have theirown unique sign languages such as American Sign Language (ASL) for the UnitedStates and British Sign Language (BSL) for the United Kingdom Over the years,there has been significant progress in the development of technologies for ASLtranslation to facilitate better communication between the deaf community andthe hearing world In this section, we will provide a detailed survey of existingsign language translation technologies and mobile apps

One of the early methodologies employed for sign language recognition involvedthe use of sensor gloves, which are equipped with specific sensors to capturehand movements and utilize machine learning algorithms for recognition purposes.Notably, a significant contribution in this domain is the research paper titled

"Sign language recognition using sensor gloves" authored by Syed Atif Mehdi

et al, published in "Proceedings of the 9th International Conference on NeuralInformation Processing, 2002 (ICONIP’02)"

The paper explores the feasibility of recognizing sign language gestures by lizing sensor gloves, leveraging the idea of employing these gloves in gaming orapplications involving custom gestures The outcome of this research effort is thedevelopment of a project named "Talking Hands" This project features a sensorglove capable of capturing American Sign Language signs performed by a userand subsequently translating them into English sentences Artificial neural net-works are employed to recognize the sensor values obtained from the sensor glove,

Trang 21

uti-which are then categorized into 24 alphabetic characters in English along withtwo punctuation symbols.

The system achieves an accuracy rate of up to 88% The authors acknowledgedthat this accuracy could potentially be even higher if the dataset used were sourcedfrom individuals with a proficient understanding of sign language, rather thanrelying on samples from individuals who were not well-versed in sign language

Fig 2.3: An example of sensor glove used for detecting movement sequences.Source: Cornell University ECE 4760 Sign language glove prototype

However, it is worth noting that this approach faces certain limitations Onesuch challenge encountered by the project is its inability to effectively handlecharacters associated with dynamic gestures or those that require the use of bothhands Although this approach is practical and offers considerable applicability, it

is not without critical limitations These include the need for a specialized ware setup, limited accuracy rates, high cost, and difficulties in detecting facialexpressions and body language, which are integral components of sign languagecommunication

Trang 22

hard-2.2.2 A Cost Effective Design and Implementation of

Ar-duino Based Sign Language Interpreter

In response to the high cost associated with developing sign language tion systems using sensor gloves, researchers have been focused on finding ways

recogni-to reduce the overall cost of the device One such study is the research paper title

"A Cost Effective Design and Implementation of Arduino Based Sign LanguageInterpreter" authored by Anirbit Sengupta et al, published in "2019 Devices forIntegrated Circuit (DevIC)"

This research explores the use of cloth-driven gloves with Bluetooth tivity The gloves are equipped with one accelerometer and five flexible sensors,strategically placed along the length of each finger, including the thumb Theseflexible sensors play a crucial role in recognizing intricate hand gestures, whilethe resistance value changes, generated by the extent of curvature in the sensors,

connec-in combconnec-ination with the accelerometer values measurconnec-ing the slant position of thehand in relation to the land surface, are also taken into account The collecteddata is then processed by a microcontroller module and can be transmitted toany smartphone user through Bluetooth connectivity

The research findings indicate an accuracy rate of approximately 86.67%, withsome biases observed in recognizing letters such as A, B, F, H, I, J, U, W, Y,and Z The authors also mentioned that certain letters like M, N, O, R, S, T, V,and X cannot be effectively demonstrated as they share gestural similarities withother letters The glove can recognize the user’s hand gestures and convert theminto text and voice with the assistance of a smartphone application

It is important to note that efforts to reduce the cost of implementing based sign language translation systems come with trade-offs, as the reduced costmay affect the overall performance or effectiveness of the device Despite theselimitations, such research endeavors pave the way for exploring more accessibleand affordable alternatives for sign language recognition, contributing to the on-going development and evolution of technology for the deaf community

Another approach is the use of computer vision techniques that use cameras

to capture the signer’s hand, and body movements and recognize them as specificsigns This approach involves three main steps: hand segmentation, feature extrac-tion, and recognition Hand segmentation involves separating the signer’s hand

Trang 23

from the background Feature extraction involves extracting relevant features fromthe segmented hand image Recognition involves classifying the extracted features

to recognize the sign One of the prominent researches using this approach is theresearch paper "Neural Sign Language Translation"[3]

This paper introduces the problem of Sign Language Translation (SLT), whichaims to generate spoken language translations from sign language videos, takinginto account the grammatical and linguistic structures of sign language The au-thors propose a framework based on Neural Machine Translation (NMT) to learnthe spatial representations, language model, and mapping between sign and spo-ken language They utilize 2D convolutional neural networks (CNNs) to learn spa-tial embeddings for sign videos and word embeddings for spoken language words.The sign videos are tokenized using either frame-level or gloss-level tokenization,while the spoken language sentences are tokenized at the word level

In addition, an attention-based encoder-decoder network is employed to erate the target spoken language sentences, with attention mechanisms captur-ing the alignment between sign videos and spoken language sentences The au-thors also introduce the RWTH-PHOENIX-Weather 2014T dataset, which pro-vides continuous sign language videos with gloss annotations and spoken languagetranslations Experimental results on this dataset demonstrate the effectiveness

gen-of their approach The paper concludes by discussing the findings and the futuredirections for SLT research

recog-nition in video sequence

The paper Deep Learning for Vietnamese sign language recognition in videosequence[?] by Nguyen Thien Bao and other researchers is one of the more simi-lar approaches towards Sign Language recognition and translation we could findfor our thesis This paper presents a detailed investigation into the automaticrecognition of Vietnamese Sign Language (VSL) using various feature extrac-tion approaches and deep learning techniques The authors address the specificchallenges associated with VSL recognition in video sequences, including cameraorientation, hand position, inter-hand relation, and other factors that make thetask complex

The proposed approach comprises two main types of feature extraction: tial features and scene-based features Spatial features involve the utilization oflocal descriptors, namely Local Binary Pattern (LBP), Local Phase Quantization

Trang 24

spa-(LPQ), and Histogram of Oriented Gradients (HOG) These techniques aim tocapture essential information about hand gestures by analyzing texture, inten-sity, and gradient patterns within specific regions of interest On the other hand,scene-based features employ the GIST descriptor, which focuses on the dominantspatial structure of a scene, taking into account perceptual dimensions such asnaturalness, openness, roughness, expansion, and ruggedness.

For the recognition stage, the authors explore traditional classification methods,with Support Vector Machine (SVM) being the chosen approach SVM classifiersare trained using the extracted spatial and scene-based features, and the recogni-tion performance is evaluated Additionally, a deep learning-based approach calledDeep Vietnamese Sign Language (DVSL) is introduced In this approach, Convo-lutional Neural Network (CNN) features are extracted from a pre-trained VGG16model These features are then fed into Long Short-Term Memory (LSTM) mod-els, which learn to predict sign language based on image sequences

Two VSL datasets are collected and utilized for experimentation The firstdataset focuses on relative family topics and contains words with minimal changesbetween frames The second dataset involves more complex gestures, includingrelative positions and orientations of body parts To augment the datasets andenhance the training process, data augmentation techniques are applied Thesetechniques include rotation transformations and the addition of salt-and-peppernoise, which provide variations in hand movement and position

Experimental results demonstrate the effectiveness of the proposed approaches.The SVM-based models achieve an accuracy of 88.5% on the VSL-WRF-01-EXTdataset, while the DVSL model achieves an even higher accuracy of 95.83% onthe same dataset These results indicate the promising performance of both thetraditional SVM-based approach and the deep learning-based DVSL approach inrecognizing VSL

However, it should be noted that the performance of the DVSL model is atively lower on the VSL-WRF-02-EXT dataset, which contains more complexgestures involving the relationship between body parts This highlights a potentiallimitation of the deep learning-based approach and suggests the need for furtherinvestigation to improve its performance in handling such complex gestures

Trang 25

rel-2.3 Machine Learning Techniques and AlgorithmsMachine learning techniques and algorithms are at the core of modern artificialintelligence systems, enabling computers to learn from data and enhance theirperformance autonomously These methods encompass various approaches thatseek to extract valuable patterns and insights from intricate datasets In recentyears, machine learning has made significant strides in the field of sign languagetranslation, particularly through computer vision algorithms and deep learningarchitectures These advancements have paved the way for the creation of ad-vanced systems that facilitate communication between sign-language users andnon-sign-language speakers In this section, we will delve into an exploration ofmachine learning algorithms, concepts, architectures, and techniques employed inthis thesis, focusing specifically on their application to sign language translation.

Machine learning is a subset of artificial intelligence (AI) that focuses on veloping algorithms and models that enable computers to learn from data andmake predictions or decisions without being explicitly programmed It empow-ers machines to automatically improve their performance through experience andexposure to relevant information Machine learning algorithms can process andanalyze vast amounts of data to identify patterns, relationships, and insights thathumans may not easily discern

de-At its core, machine learning is driven by the principle of training models onlabeled data to recognize patterns and generalize from examples These models canthen be applied to new, unseen data to make predictions or classify informationaccurately The training process involves adjusting the model’s parameters tominimize the difference between predicted outputs and the known correct outputs

in the training data, effectively learning from the data patterns

Machine learning techniques can be broadly categorized into three main types:supervised learning, unsupervised learning, and reinforcement learning In super-vised learning, models are trained on labeled data where the desired outputs areknown, enabling them to make accurate predictions or classifications Unsuper-vised learning, on the other hand, involves training models on unlabeled data todiscover patterns or groupings within the data Reinforcement learning involvestraining models to interact with an environment and learn optimal actions throughtrial and error, guided by a reward system

Trang 26

Machine learning has applications in various domains, including healthcare,finance, marketing, computer vision, natural language processing, robotics, andmore It enables tasks such as image and speech recognition, language transla-tion, recommendation systems, fraud detection, and autonomous vehicles Ma-chine learning algorithms and models have demonstrated remarkable success insolving complex problems and are continuously advancing as more data becomesavailable and computational power increases.

Key techniques used in machine learning include decision trees, support vectormachines, random forests, neural networks, deep learning, clustering, and dimen-sionality reduction These techniques provide a rich set of tools for data scientistsand researchers to explore, analyze, and extract valuable insights from diversedatasets

As machine learning continues to evolve, challenges such as interpretability,fairness, and robustness are actively researched Ethical considerations aroundbias, privacy, and transparency also arise in the development and deployment

of machine learning systems These considerations highlight the importance ofresponsible and ethical use of machine learning technologies

In conclusion, machine learning is a powerful field that empowers computers

to learn from data and make accurate predictions or decisions Its wide range ofapplications and continuous advancements hold great potential for driving innova-tion, solving complex problems, and improving decision-making processes acrossvarious industries and domains

Deep learning is a subfield of machine learning that focuses on training ficial neural networks to learn and make intelligent decisions It is inspired bythe structure and function of the human brain, where neural networks consist ofinterconnected layers of artificial neurons that process transform data

arti-Deep learning excels at handling complex and large-scale problems by leveragingdeep neural networks with multiple hidden slayers These deep neural networksare capable of automatically learning hierarchical representations of data, allowingthem to extract intricate patterns and features

Trang 27

One of the key strengths of deep learning is its ability to perform end-to-endlearning, where the network learns directly from raw data without the need formanual feature extraction By training deep neural networks on large labeleddatasets, deep learning models can autonomously learn representations and makepredictions or classifications with remarkable accuracy.

Deep learning has achieved groundbreaking results in various domains In puter vision, deep learning has revolutionized tasks such as image classification,object detection, and image segmentation In natural language processing, deeplearning models have demonstrated significant advancements in tasks like machinetranslation, sentiment analysis, and text generation Deep learning has also madesignificant contributions to areas such as speech recognition, recommendation sys-tems, autonomous driving, and drug discovery

com-The success of deep learning can be attributed to the development of tional neural networks (CNN) for image data, recurrent neural networks (RNN)for sequential data, and transformers for natural language processing tasks Thesespecialized architectures, along with advancements in optimization techniques, ac-tivation functions, and regularization methods, have propelled deep learning tonew heights This thesis delves into more details about CNN and RNN in thefollowing sections

convolu-However, deep learning also comes with challenges Training deep neural works requires substantial computational resources and large amounts of labeledtraining data Additionally, overfitting, interpretability and the need for extensivehyperparameter tuning remain ongoing research areas

net-To train deep learning models, frameworks such as TensorFlow, Keras, or Torch provide powerful tools and abstractions that simplify the implementationand training process These frameworks enable researchers and practitioners tobuild and train deep neural networks efficiently The details about these frame-works are provided in this thesis in the following chapter

Py-Deep learning continues to evolve rapidly, with ongoing research focusing onimproving model performance, interpretability, and training efficiency As moredata becomes available and computational power increases, deep learning is poised

to continue driving advancements in AI, impacting a wide range of industries andshaping the future of technology

Trang 28

2.3.3 Model Architectures for Sign Language Recognition

Here we provide an overview of the existing computer vision and machinelearning techniques that have been applied to the recognition of American SignLanguage (ASL) that helps provide a wider look on the available options for ourmodel selection We explore various models and papers in the field, highlightingtheir contributions and approaches We would also give a comparative analysisthat explains why we choose the LSTM-DNN architecture for our VSL model

1 Convolutional Neural Networks (CNNs): CNNs are widely used forimage and video recognition tasks They can be utilized for extracting visualfeatures from individual frames of a video, which can then be used for ASLrecognition CNN architectures like VGGNet, ResNet, or InceptionNet can

be employed as feature extractors [6]

2 Recurrent Neural Networks (RNNs): RNNs are well-suited for ing sequential data, making them useful for capturing temporal information

process-in videos Models such as Long Short-Term Memory (LSTM) or Gated rent Units (GRU) can be used to process the extracted visual features fromthe CNNs over time

Recur-3 3D Convolutional Neural Networks (3D CNNs): Unlike traditional 2DCNNs, 3D CNNs can capture both spatial and temporal information simul-taneously These models take a sequence of video frames as input and learnspatio-temporal features directly Architectures like C3D or I3D (Inflated 3DCNN) can be employed for video-based ASL recognition [9]

4 Transformer-based Models: Transformer models have shown excellentperformance in various natural language processing tasks By combining bothvisual and temporal information, a video-to-text ASL recognition model can

be formulated as a transformer-based architecture This involves encoding thevideo frames and generating corresponding ASL text Transformer models likeBERT or GPT can be adapted for this purpose [2]

5 Two-stream Networks: Two-stream networks employ separate streams forprocessing spatial and temporal information in videos One stream focuses

on visual appearance (RGB frames), while the other focuses on motion formation (optical flow) The extracted features from both streams are thencombined for ASL recognition This approach can leverage both CNNs andRNNs or 3D CNNs

Trang 29

in-Upon inspecting the existing models and papers in ASL recognition, we haveexamined various approaches and techniques utilized in this field These mod-els have demonstrated the effectiveness of different methodologies, including thecombination of CNNs and RNNs, hand-crafted features with traditional classi-fiers, depth information with graph convolutional networks, and multimodal fu-sion techniques.

After careful consideration and analysis, it is evident that the LSTM-basedmodel has exhibited several advantages for ASL recognition The utilization ofLSTM networks allows the model to capture temporal dependencies within signlanguage sequences effectively This is crucial as ASL relies heavily on the dynamicmovement and sequencing of hand gestures

LSTM networks have proven to be successful in capturing long-term cies and temporal patterns, making them well-suited for ASL recognition tasks

dependen-By learning from the sequential nature of the input data, LSTM-based models caneffectively model the temporal dynamics of sign language, resulting in improvedrecognition performance

Additionally, LSTM networks offer the ability to handle variable-length put sequences, which is advantageous in ASL recognition where sign languagesequences may have different durations The ability to handle variable-length in-put allows for more flexibility and generalization in recognizing signs of varyinglengths

in-Considering these factors, we conclude that utilizing LSTM networks for ASLrecognition holds great potential The temporal modeling capabilities and flexi-bility of LSTM make it a suitable choice for capturing the sequential nature ofsign language gestures

Trang 30

Fig 2.4: LSTM architecture

A deep neural network, also known as an artificial neural network, is a tational model inspired by the structure of human neural networks Each neuron

compu-in a deep neural network is a mathematical function that takes compu-input data, appliestransformations to the data to make it more adjustable, and produces an output.This output is then passed on to the next layer as input or computed to generatethe final output

All deep neural networks have an input layer where the initial data is fed and

a final output layer for prediction However, what makes deep neural networkspowerful are the hidden layers that exist between the input and output layers.These layers are interconnected, with the output of one layer serving as the inputfor the next layer Hence, the term "deep learning" is associated with deep neuralnetworks that have a significant number of hidden layers between the input andoutput

Trang 31

Fig 2.5: Structure of a simple neural network with an input layer, an output layer,and two hidden layers.

The simplified diagram above provides an understanding of the structure of

a basic deep neural network Each circular node represents a neuron in the work, organized into vertical layers As shown, each neuron is connected to everyneuron in the preceding layer, indicating that each neuron generates an inputvalue for each neuron in the next layer The arrows connecting the neurons arecrucial, as they carry different values representing the importance of the connec-tions between neurons Stronger connections amplify the value of the neuron as itpasses through the layers, contributing to the activation of the receiving neuron

net-It can be understood that each connection between two nodes in the deep neuralnetwork represents a coefficient in the equation used to calculate the value of thenext node

A neuron is activated when the sum of the input values surpasses a certainthreshold This threshold is typically determined by activation functions used

in each layer The activation process varies across the layers of the deep neuralnetwork For example, in a deep neural network designed to recognize handwrittendigits, the input layer consists of pixels representing handwritten images The firsthidden layer may detect various lines and diagonals in different directions Thesecond hidden layer may detect more complex patterns such as angles, curves, andoverlaps The activated neurons in the output layer of the deep neural networkcorrespond to the number of labels to be classified, such as 10 labels from 0 to 9

Trang 32

ji represents the weight coefficient of the connection between the i-th node

in the previous layer and the j-th node in the current layer

• b(l)

i represents the bias term in the equation that calculates the value of thei-th node in the l-th layer of the deep neural network It is often associatedwith a node with a value of 1 added in the input layer and hidden layers tolimit the bias towards the origin

• σ represents the activation function

When converted into vector and matrix form, the above equations become:

z(l) = (W(l))T ∗ a(l) + b(2) (2.3)

The model learns the connections between neurons, which is crucial for ful predictions during the training process At each step of training, the networkuses a mathematical function to determine the accuracy of its latest predictioncompared to the expected output, known as the loss function This function pro-duces a series of error values, typically computed by measuring the differencebetween the model’s prediction and the true value of the output, and seeks tominimize them As a result, the system can compute how the model should up-date the values of the weights attached to each connection, with the ultimategoal of improving the accuracy of the network’s predictions This improvement

success-is achieved through the calculation of gradients to find the "drop points" wherethe loss function is minimized The magnitude of these changes is determined

by an optimization algorithm, such as gradient descent, and these updates are

Trang 33

propagated throughout the network at the end of each training cycle in a processknown as backpropagation Through many adjustments, the network continues

to compute and make increasingly accurate predictions until it reaches a state ofhigh accuracy

Each hidden layer is also known as a fully connected layer, and deep neuralnetworks with only input and output layers along with hidden layers are oftenreferred to as fully connected deep neural networks (FCN) However, the limitation

of FCNs arises when dealing with complex input data, such as large-sized images,videos, long texts, speech, etc This led to the development of alternative layertypes and specialized deep neural network architectures tailored to address suchproblems

Convolutional Neural Networks (CNNs) are specialized deep-learning neuralnetworks crafted to handle structured grid-like data like images, videos, and timeseries data CNNs have brought about a revolution in computer vision, unlockingextraordinary accomplishments in tasks such as image recognition, object detec-tion, and image segmentation

A typical CNN architecture comprises multiple layers, including convolutionallayers, pooling layers, and fully connected layers Convolutional layers play a cru-cial role in filtering and extracting relevant features from the input data Poolinglayers are responsible for downsampling the spatial dimensions, reducing compu-tational complexity, and ensuring spatial invariance Lastly, fully connected layers,which are also known as the dense layers, connect the extracted features to theoutput layer, enabling the network to perform classification or regression tasks.This section provides an in-depth analysis of the convolutional layer and pool-ing layer The previous section has already covered information about the fullyconnected layer

Convolutional Layer

In the context of image processing, let’s consider a training dataset ing color images of size 64 × 64 with three color channels (R, G, and B) Whenpassing these images through a neural network, the input layer needs to have

contain-64 × contain-64 × 3 = 12288 nodes, with each node corresponding to a pixel in theimage Assuming the first hidden layer consists of 1000 nodes, there will be atotal of 12288 × 1000 = 12288000 weights (W) connecting the input layer to the

Trang 34

first hidden layer, along with 1000 free coefficients Consequently, the total ber of parameters in the model amounts to 12289000 This count only considersthe parameters between the input layer and the first hidden layer, disregardingthe additional layers within the model Moreover, as the image size grows, for in-stance, to 512×512, the number of parameters increases exponentially To addressproblems involving image input data more effectively, we require an alternativesolution, which is where the convolutional layer comes into play.

num-The convolutional layer is constructed based on the convolution operation,which enables the reduction of matrix dimensions while preserving key charac-teristics To illustrate this, consider the figure below:

Fig 2.6: Starting the convolutional operation

In the given example, we conduct the convolution operation between a 5 × 5square matrix and a kernel of size 3 × 3 The kernel, being a square matrix, alwayshas an odd dimension Acting as a sliding window, the kernel moves across theentire 5 × 5 matrix During each slide, it performs an operation by multiplyingeach element of the kernel matrix with the corresponding element in the targetframe of the 5×5 matrix The resulting value is then recorded in an output matrix

As depicted in the image above, we can observe that the calculation for the firstelement of the output matrix is as follows:

1 ∗ 1 + 1 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 1 ∗ 1 + 1 ∗ 0 + 0 ∗ 1 + 0 ∗ 0 + 1 ∗ 1 = 4 (2.5)This process is repeated until the kernel has traversed the entire 5 × 5 matrix,continuing the convolution operation

Trang 35

Fig 2.7: Step two in the convolutional operation

Fig 2.8: Finish the convolution operation when the kernel goes through the 5*5matrix

To perform the convolution operation between a matrix X of size m × n and akernel matrix W of size k × k (where k is odd), denoted as Y = X N W , followthe steps outlined below First, for each element xij in the matrix X, create amatrix A of the same size as the kernel matrix W with xij as the center (the oddkernel size facilitates this) Next, calculate the element-wise product of matrix Aand matrix W , summing the resulting values Write this sum in the correspondingposition of the matrix Y Notably, the resulting matrix Y will have a smaller sizethan the matrix X Specifically, the size of matrix Y is (m − k + 1) × (n − k + 1)

Trang 36

Fig 2.9: Example of the convolutional matrix

When dealing with elements on the outer border, such as x11, in the convolutionoperation, a common approach is to handle the missing elements Typically, theelements on the outer edge are ignored since matrix A cannot be fully extractedfrom X To address this, a common solution is to augment matrix X by addingzero values along its outer border By padding the matrix with zeros, the convo-lution operation can be performed, including the multiplier for the edge elements

of X

Fig 2.10: X matrix when adding outer padding of zeros

It is important to note that when referring to padding = p, it indicates theaddition of a vector of p zeros to each side of the matrix

As mentioned earlier, we perform the convolution operation by sequentiallymultiplying the convolution with the elements in matrix X, resulting in the matrix

Y of the same size as X In this context, we refer to the convolution as having astride of 1

Trang 37

Fig 2.11: Convolutional operation when stride=1, padding=1

However, when the stride is set to a value of s (where s > 1), the convolutionoperation is only performed on elements in the matrix X that follow the pattern

x1+i×s,1+j×s For instance, if the stride is set to 2, the convolution operation isapplied as follows:

Fig 2.12: Convolutional operation stride=2, padding=1

To develop a simple understanding, we can start at position x11 and then move

s steps vertically and horizontally until reaching the end of the matrix X In theexample mentioned, when using stride = 2, the result Y matrix has a size of 3×3,which is significantly smaller than the original matrix X Therefore, stride is oftenemployed to reduce the size of the matrix after convolution The general formulafor the convolutional multiplication of a matrix X with dimensions m × n and akernel of size k × k (where k is odd), considering stride = p and padding = p,

Trang 38

yields a Y matrix with dimensions (m−k+2ps + 1) × (n−k+2ps + 1).

The convolution layer, as its name suggests, conducts the convolution tion on the pixel matrix of an image However, since color images consist of threechannels—R, G, and B—the image is represented as a 3-dimensional tensor Con-sequently, in this scenario, the kernel is also defined as a 3-dimensional tensorwith dimensions k × k × 3

opera-Fig 2.13: Illustration of convolutional operation on a color image with k=3

In order to align the kernel with the image representation, we set the depth

of the kernel to match the depth of the image This allows for the application

of consistent kernel block movement, similar to that of a two-dimensional trix Consequently, the convolution operation operates uniformly across all imagechannels Each distinct kernel enables the learning of different image features.Therefore, multiple kernels are employed in each convolution layer to capture var-ious attributes of the image As each kernel produces a single matrix, k kernels willgenerate k output matrices These k output matrices are then combined to form

ma-a 3-dimensionma-al tensor with ma-a depth of k, fma-acilitma-ating the comprehensive ma-anma-alysis

of the image’s attributes

Trang 39

Fig 2.14: Tensor X, W 3 dimensions are written as 3 matrices.

In general, considering the input of a convolutional layer as a tensor with mensions H × W × D, where H represents the height, W the width, and D thedepth, the kernel is of size F × F × D (with the kernel’s depth always equal tothe input’s depth) and F being an odd number The convolutional layer applies Kkernels, with stride S and padding P As a result, the output of the convolutionallayer is a 3-dimensional tensor with dimensions (H−F +2PS + 1) × (W −F +2PS + 1) × K.Pooling Layer

di-Pooling layers are commonly utilized in conjunction with convolutional layers

to decrease data size while retaining crucial properties This reduction in data sizehelps reduce computational requirements within the model Assuming a poolingsize of K ×K, when applied to an input of size H ×W ×D, it is split into D matrices

of size H × W For each matrix, within a K by K region, the maximum or averagevalue of the data is determined and recorded in the resulting matrix Similar toconvolution on images, stride and padding rules are applicable in pooling layers

Trang 40

Fig 2.15: Max pooling layer with size=(3,3), stride=1, padding=0

When employing pooling layers, a common configuration is to use a size of 2×2,stride of 2, and padding of 0 This setup effectively reduces the length and width

of the output data by half while maintaining the same depth Two widely usedtypes of pooling layers are max pooling and average pooling

Fig 2.16: Example of pooling layer

Recurrent Neural Networks (RNNs) are a class of neural networks specificallydesigned to handle sequential data While traditional neural networks processdata inputs independently, RNNs introduce the concept of memory to capturetemporal dependencies and context within sequential data

The key feature of RNNs is the presence of recurrent connections within thenetwork, which allows information to persist and flow through time This enablesRNNs to effectively process and generate sequential data, making them suitablefor tasks such as natural language processing, speech recognition, time series anal-ysis, and machine translation

At each time step, an RNN takes an input and produces an output, whilealso considering the previous hidden state as additional input This recurrent

Tiêu đề	Developing Vietnamese Sign Language to Text Translation System
Tác giả	Do Thi Thanh Nha, Le Thanh Luan
Người hướng dẫn	Dr. Nguyen Trinh Dong
Trường học	Vietnam National University, Ho Chi Minh City University of Information Technology
Chuyên ngành	Software Engineering
Thể loại	Graduation thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	114
Dung lượng	15,48 MB