GRADUATION PROJECTSERVICE ROBOT FOR STUDENTS BASED ON COMPUTER VISION AND NATURAL LANGUAGE PROCESSING Major: AUTOMATION AND CONTROL ENGINEERING TECHNOLOGY Advisor: Assoc... GRADUATION PR
Trang 1Ho Chi Minh City, August, 2022
S K L 0 0 9 3 2 5
GRADUATION PROJECT AUTOMATION AND CONTROL ENGINEERING
Trang 2GRADUATION PROJECT
SERVICE ROBOT FOR STUDENTS BASED ON
COMPUTER VISION AND NATURAL LANGUAGE
PROCESSING
Major: AUTOMATION AND CONTROL ENGINEERING TECHNOLOGY
Advisor: Assoc Prof PhD LE MY HA
NGUYỄN TUẤN THANH Student ID: 17151028
Ho Chi Minh City, August 2022
Trang 3GRADUATION PROJECT
SERVICE ROBOT FOR STUDENTS BASED ON
COMPUTER VISION AND NATURAL LANGUAGE
Advisor: Assoc Prof PhD LE MY HA
Ho Chi Minh City, August 2022
Trang 4Independence – Freedom– Happiness
-Ho Chi Minh City, August 6 th , 2022
GRADUATION PROJECT ASSIGNMENT
Student name: Nguyen Tuan Thanh Student ID: 17151028
Major: Automation and Control Engineering
Technology Class: 17151CLA1
Advisor: Assoc Prof PhD Le My Ha Phone number: 0938811201
Date of assignment: Feb 21 th , 2022 Date of submission: August 6 th , 2022
1 Project title: Service robot for students based on computer vision and natural language
processing
2 Initial materials provided by the advisor: References, reference programs, data sets, expectedparameters of the Robot
3 Content of the project:
- Design, implement a service robot with two functions: chat and talk
- Apply computer vision to identify wearing a mask and user information
- Apply natural language processing in virtual voice assistant to communicate with human
- Apply natural language toolkit (NLTK) to build chatbot to communicate with human
- Build database and collect more database when communicate with human
4 Final product: Finish a service robot that have abilities to recognize human with high accuracyand communicating with human by given knowledge database
CHAIR OF THE PROGRAM
(Sign with full name) ADVISOR
(Sign with full name)
Trang 5Faculty for High Quality Training – HCMC University of Technology and Education
THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-Ho Chi Minh City, August 6, 2022 ADVISOR’S EVALUATION SHEET Student name: Nguyen Tuan Thanh Student ID: 17151028 Major: Automation and Control Engineering Technology Project title: Service robot for students based on computer vision and natural language processing Advisor:Assoc Prof PhD Le My Ha EVALUATION 1 Content of the project: - Design, implement a service robot with two functions: chat and talk - Apply computer vision to identify wearing a mask and user information - Apply natural language processing in virtual voice assistant to communicate with human - Apply natural language toolkit (NLTK) to build chat bot to communicate with human - Build database and collect more database when communicate with human 2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark: ………… (in words: )
Ho Chi Minh City, August 6 th , 2022
ADVISOR
(Sign with full name)
Trang 6Faculty for High Quality Training – HCMC University of Technology and Education
THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-Ho Chi Minh City, August 6, 2022 PRE-DEFENSE EVALUATION SHEET Student name: Nguyen Tuan Thanh Student ID: 17151028 Major: Automation and Control Engineering Technology Project title: Service robot for students based on computer vision and natural language processing Name of Reviewer:
EVALUATION 1 Content of the project: - Design, implement a service robot with two functions: chat and talk - Apply computer vision to identify wearing a mask and user information - Apply natural language processing in virtual voice assistant to communicate with human - Apply natural language toolkit (NLTK) to build chat bot to communicate with human - Build database and collect more database when communicate with human 2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark: ………… (in words: )
Ho Chi Minh City, August 6 th , 2022
REVIEWER
(Sign with full name)
Trang 7Faculty for High Quality Training – HCMC University of Technology and Education
THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER
Student name: Nguyen Tuan Thanh Student ID: 17151028Major: Automation and Control Engineering Technology
Project title: Service robot for students based on computer vision and natural languageprocessing
Name of Defense Committee Member:
EVALUATION
1 Content of the project:
- Design, implement a service robot with two functions: chat and talk
- Apply computer vision to identify wearing a mask and user information
- Apply natural language processing in virtual voice assistant to communicate with human
- Apply natural language toolkit (NLTK) to build chat bot to communicate with human
- Build database and collect more database when communicate with human
2 Strengths:
3 Weaknesses:
4 Overall evaluation: (Excellent, Good, Fair, Poor)
5 Mark: ………… (in words: )
Ho Chi Minh City, August 6 th , 2022
COMMITTEE MEMBER
(Sign with full name)
Trang 8Graduation Thesis
ACKNOWLEDGEMENT
In the process of completing the graduation project, in addition to my ownunderstanding, I have received a lot of support and dedicated help
First, I would like to express my deep gratitude to Associate Professor Dr Le My
Ha, who is both a teacher, a supporter, and an inspiration for me to complete this thesis
He oriented me to the right topic, and how to do it, and gave objective feedback to help
me when defending myself in front of the council Therefore, I feel very fortunate to haveworked with him
Next, I would like to thank the faculty of electronics and electronics faculty aswell as the high-quality training department for imparting useful knowledge during fouryears at the university This knowledge plays a fundamental role in the implementation of
my graduation thesis
In addition, I would also like to thank the Intelligent Systems Laboratory (ISLAB)
of the Faculty of Electrical and Electronic Engineering for supporting me in terms offacilities as well as useful knowledge during the completion of the project Andindispensable is the deep thanks to a friend Tran Thanh Hung who supported and guided
me to develop this topic
Finally, I would like to thank my family for always supporting, caring, andmotivating me to complete the project in the best possible way
Ho Chi Minh city, August 6th2022
Student
Trang 9Graduation Thesis
Table of Contents
CHAPTER 1: INTRODUCTION 1
1.1 Define a problem 1
1.2 Project objectives 2
1.3 Project task 2
1.4 Project scopes 2
1.5 Approach and research 2
1.6 Project description 2
CHAPTER 2: LITERATURE REVIEW 4
2.1 Survey of robots being used in service industry 4
2.1.1 Mission of robots in the service industry 4
2.1.2 Pepper robot 4
2.2 Background of face recognition system 6
2.2.1 Concept 6
2.2.2 Structure and procedure for face recognition 6
2.2.3 Face Detection 8
2.3 Color spaces in image processing 9
2.3.1 RGB color space (Red-Green-Blue) 9
2.3.2 HSV color space (Hue-Saturation-Value) 9
2.4 Histogram of Oriented Gradients algorithm 10
2.5 Support Vector Machine algorithm 11
2.6 Background of speech recognition system 13
2.6.1 Concept 13
2.6.2 Speech Recognition 13
2.6.3 Applications 15
2.7 Framework and libraries 17
2.7.1 Framework Pytorch 17
2.7.2 Pandas 18
2.7.3 Numpy 18
2.8 Voice Assistant 18
2.9 ChatBot 19
Trang 10Graduation Thesis
CHAPTER 3: SYSTEM DESIGN AND CONSTRUCTION 22
3.1 Requirements of the system 22
3.2 System description 22
3.2.1 The block diagram of the system 22
3.2.2 The function of each block 22
3.3 System design 23
3.3.1 Face detection: 23
3.3.2 Face recognition and identification: 24
3.3.3 Face mask detection 28
3.3.4 Speech recognition and voice assistant 30
3.3.5 Chatbot 30
CHAPTER 4: EXPERIMENT RESULTS, FINDINGS AND ANALYSIS 36
4.1 Face detection 36
4.2 Face recognition and identification 37
4.2.1 Training image data 37
4.2.2 Performing the face recognition 37
4.3 Face mask detection 38
4.4 Speech recognition and voice assistant 40
4.5 Chatbot 43
4.5.1 Create Training Data 43
4.5.2 NLP Basics 44
4.5.3 Complete chatbot 45
4.6 User interface 46
CHAPTER 5: CONCLUSIONS AND DIRECTIONS OF DEVELOPMENT 47
5.1 Conclusion 47
5.2 Direction of development 47
REFERENCES 48
Trang 11Graduation Thesis
ABBREVIATIONS
NLP: Natural Language Processing OpenCV: Open Source Computer Vision Library HOG: Histogram of Oriented Gradients
SVM: Support Vector Machine Q&A: Question and answer
Trang 12Graduation Thesis
List of figures
Figure 2 1 Pepper robot working in a mobile store 5
Figure 2 2 A typically procedure for face recognition model 7
Figure 2 3 Face and eye detection 8
Figure 2 4 RGB color space (Red-Green-Blue) 9
Figure 2 5 HSV color space (Hue-Saturation-Value) 10
Figure 2 6 Applications of HOG 11
Figure 2 7 An example of support vector in 2-Dimensional data 11
Figure 2 8 Margins describing in a plane 12
Figure 2 9 An example of linearly non separable dataset 12
Figure 2 10 Speech Recognition 14
Figure 2 11 Implementation of Speech Recognition 14
Figure 2 12 Interface of Windows Speech Recognition 16
Figure 2 13 Interface of Voice-To-Text Facebook Messenger 16
Figure 2 14 Interface of Google Speech to Text 17
Figure 2 15 Pytorch and TensorFlow Frameworks from 2017 to 2021 [10] 17
Figure 2 16 Reading CSV files with Pandas [11] 18
Figure 2 17 Example illustrating some functions in Numpy 18
Figure 2 18 Market share of voice assistants in the US, May 2018 [12] 19
Figure 2 19 Illustration for Chatbot 20
Figure 3 1 Block diagram of service robot designing by student 22
Figure 3 2 Face detection with 6 landmarks and multi-face support [17] 23
Figure 3 3 Training Process of face recognition 24
Figure 3 4 Five features of Haar cascade method [18] (a) Edge features (b) Line features (c) Four-rectangle feature 24
Figure 3 5 Cascade structure for Haar classifiers [18] 25
Figure 3 6 Sliding window in grayscale image [19] 27
Figure 3 7 Image meshing and histogram calculation [19] 27
Figure 3 8 Face recognition and identification processing 28
Figure 3 9 Face mask detection process 29
Trang 13Graduation Thesis
Figure 3 10 Plotting all the milestone central issues of an individual's face on a white
foundation can provide us with a best guess of the shape [20] 29
Figure 3 11 The structure of training data [22] 31
Figure 3 12 Example of training data bag of words [22] 32
Figure 3 13 Example of NLP preprocessing pipeline [22] 32
Figure 3 14 Structure of Feed Forward Neural Network [23] 33
Figure 3 15 The simplest form of perceptron [23] 34
Figure 3 16 Chatbot training structure [22] 34
Figure 4 1 Six facial features are displayed when human face is detected and frame rate is measured 36
Figure 4 2 Detecting multiple faces in the same frame 37
Figure 4 3 The process of training image data 37
Figure 4 4 Username recognition and display 38
Figure 4 5 Detect 68 landmarks on user's face 39
Figure 4 6 Bounding the mouth and warning when the user is not wearing a mask 39
Figure 4 7 When the user wears a mask, the system will not give an alert 40
Figure 4 8 Identify and answer questions from users when the question is in the data set .41
Figure 4 9 Identify and answer questions from users when the question is not in the data set 41
Figure 4 10 Save unknown questions to unknown question sheet in excel 42
Figure 4 11 Relative calculation of response speed of gtts library 42
Figure 4 12 Relative calculation of response speed of pyttsx3 library 43
Figure 4 13 Training data made by the student 43
Figure 4 14 Tokenize all questions from data file 44
Figure 4 15 Lowercase all word tokenized and remove characters 44
Figure 4 16 All words after remove duplicate word and sorted 45
Figure 4 17 Example of the bag of words for all patterns 45
Figure 4 18 Chatbot interface 46
Figure 4 19 User interface designed by the student 46
Trang 14Graduation Thesis
List of Tables
Table 2- 1 Specifications of Pepper robot 5Table 2- 2 The speech recognition package in Python 15Table 4- 1 Sample collects data from students 40
Trang 15Graduation Thesis
ABSTRACT
With the advancement of science and technology, robots are gradually replacinghumans in work or help in daily life Similarly, to bring convenience to answering thedaily questions of students, this project will design a service robot that combinescomputer vision and natural language processing to adapt to this purpose Compared tothe traditional way of answering questions, students can go to school personnel ormessage student forums to ask about the problem they are facing These forms will oftentake a lot of time because the response time is often quite long, the number of staff islimited, and the number of students asking questions is often quite large Therefore, thistopic proposes a solution to replace traditional question-answering forms with robotscapable of consulting and answering questions of students through two forms ofcommunication: talking and chatting When in talking mode, the robot will recognize theuser, recognize the question by voice, and process it to give the appropriate answer.When in chatting mode, the user will enter a question into the chat box, and then therobot will process and give an appropriate answer From the descriptions above, this topicshows the convenience of answering questions of students quickly, saving humanresources for the school, and at the same time capturing objectively questions of students
Keywords: service robot, computer vision, natural language processing.
Trang 16Graduation Thesis Chapter 1: Introduction
CHAPTER 1: INTRODUCTION 1.1 Define a problem
In the field of education, in addition to imparting useful knowledge, it is alsonecessary to listen to and answer the questions of students in the most effective way.Usually, the school will set up counseling teams or online forums for students to givetheir opinions or ask about unclear issues For the form of Q&A with the counselor, theschool will set up a team to take charge of this task The advantage of this form is thatstudents will easily communicate and receive the right answers with more focus As forthe online asking form through forums, the university also has to hire human resourcesfor the waiter to reply to messages to answer the questions of the students This formbrings convenience, even students can ask for answers through this format withouthaving to go to school On the other hand, these two forms have disadvantages such aslong waiting time for counseling, inflexible counseling hours, a limited number ofconsultants, and a large number of students Figure 1.1 reflects the fact that students have
to queue to receive advice from the school
Figure 1 1 Students line up to wait for their turn for advice from the school
In addition, due to the impact of the Covid-19 pandemic, human-to-humancommunication has become increasingly difficult From the above problems, the robotcannot be a more suitable solution in reducing the limitations of the two forms above.This device can effectively work with inquiries of students through two forms talking andchatting by using computer vision and natural language Therefore, this thesis will beproposed with the name “Service robot for students based on computer vision and naturallanguage processing”
Trang 17Graduation Thesis Chapter 1: Introduction
1.2 Project objectives
With the essential need to serve the needs of students in answering the problemsencountered, this thesis was created to build a service robot with two functions talkingand chatting This robot is capable of recognizing and warning when the user is notwearing a mask, user information storage, and communicating by voice or text depending
on the intended use of the user
1.3 Project task
The project is implemented with the following main contents:
Task 1: Collecting inquiries from students in the university
Task 2: Surveying methods for face detection and face recognition
Task 3: Surveying methods for speech recognition and processing
Task 4: Researching about virtual assistant and chatbot
Task 5: Researching natural language processing methods
Task 6: Write the outlines to summarize the requirements of the project, design theblock diagram of the system, and explain the functions of the blocks
Task 7: Designing software interfaces to interact with users
Task 8: Test experiment, evaluate and calibrate the entire system
Task 9: Write the project report
1.4 Project scopes
This project was created just to serve the questions of students on campus inVietnamese language on software interface, the accuracy of the answers is based on thevariety of data collected and suitable in a low-noise environment
1.5 Approach and research
Approach:
– Reach out to the research object
– List the challenges that can be encountered when solving the problem
– Survey, evaluate and select algorithms, thereby forming the suitable system
1.6 Project description
The project is presented in 5 chapters as follows:
Chapter 1: INTRODUCTION
Trang 18Graduation Thesis Chapter 1: Introduction
Introducing the research content of the topic, setting out the objectives and tasksthat the topic needs to achieve, as well as clearly identifying the specific subjectand scope of research for the topic
Chapter 2: THEORETICAL BASIS
A general presentation of the subject of study, the algorithms used, and theknowledge involved in the system training process
Chapter 3: SYSTEM DESIGN AND CONSTRUCTIONDetailing the functionality of each working block, explaining specifically theimprovements used in system development, the functionality of the interface andsoftware
Chapter 4: RESULTS ACHIEVEDGiving the test results that have been achieved proving the system's ability tocomplete the work
Chapter 5: CONCLUSIONS AND DIRECTIONS OF DEVELOPMENTSummarizing the solved problems and bring out the remaining problems, therebygiving directions to solve them
Trang 19Graduation Thesis Chapter 2: Literature Review
CHAPTER 2: LITERATURE REVIEW
In this chapter, the student will introduce the application of robots in the industry,the theory of face recognition and speech recognition, and their applications Besides, thestudent also introduced the Pytorch framework, a popular framework for MachineLearning problems, and some other libraries
2.1 Survey of robots being used in service industry
It is necessary to first describe robots in order to talk about their purposes A robot
is, in the simplest words, a machine designed to do difficult actions or jobs automatically.Some robots are designed to resemble humans and these are called androids, but manyrobots do not take such a form
Modern robots may employ artificial intelligence (AI) and speech recognitiontechnologies, and they may be fully or partially autonomous The industrial robots used
in factories or production lines are an example of how most robots are programmed tocarry out certain jobs with remarkable precision
2.1.1 Mission of robots in the service industry
Robots have been a prominent technology trend in the hospitality sector in partbecause self-service and automation concepts are becoming more and more important tothe client experience The usage of robots can result in advancements in efficiency,accuracy, and even speed
For example, chatbots allow a hotel or travel company to provide 24/7 supportthrough online chat or instant messaging services, even when staff would be unavailable,delivering extremely swift response times Meanwhile, a robot used during the check-inprocess can speed up the entire process, reducing congestion
2.1.2 Pepper robot
Pepper is a semi-humanoid robot manufactured by SoftBank Robotics (formerlyAldebaran Robotics), designed with the ability to read emotions It was introduced in aconference on 5 June 2014, and was showcased in SoftBank Mobile phone stores inJapan beginning the next day Pepper's ability to recognize emotion is based on detectionand analysis of facial expressions and voice tones To do so, Pepper has been equippedwith hardware such as:
20 degrees of freedom for normal and expressive movements
Speech recognition and voice assistant in 15 languages
Perception modules
Touch sensors, LEDs and microphones
Infrared sensors, bumpers, an inertial unit, 2D and 3D cameras, and sonars
Figure 2.1 shows a robot called Pepper working in a mobile store [1]
Trang 20Graduation Thesis Chapter 2: Literature Review
Figure 2 1 Pepper robot working in a mobile store
● Specifications:
The robot's head has four microphones, two HD cameras (in the mouth andforehead), and a 3-D depth sensor (behind the eyes) There is a gyroscope in the torso andtouch sensors in the head and hands The mobile base has two sonars, six lasers, threebumper sensors, and a gyroscope
It is able to run the existing content in the app store designed for SoftBank's Naorobot Some necessary information about the robot is shown in that specifications table2.1
Table 2- 1 Specifications of Pepper robot
Dimensions Height: 1.20 meters (4 ft)
Depth: 425 millimeters (17 in)Width: 485 millimeters (19 in)
Capacity: 30.0Ah/795WhDisplay 10.1-inch touch displayHead Mic × 4, RGB camera × 2,3D sensor × 1,
Trang 21Graduation Thesis Chapter 2: Literature Review
Touch sensor × 3
Legs Sonar sensor × 2, Laser sensor × 6, Bumper
sensor × 3, Gyro sensor × 1
Moving parts Degrees of motion
Head (2°), Shoulder (2° L&R), Elbow (2rotations L&R), Wrist (1° L&R), Hand with
5 fingers (1° L&R), Hip (2°), Knee (1°),Base (3°)
2.2.2 Structure and procedure for face recognition
Generally, a face recognition system is often described as a process that involvesfour stages as shown in Figure 2.2: face detection, face alignment, feature extraction, andfinally face recognition
Trang 22Graduation Thesis Chapter 2: Literature Review
Figure 2 2 A typically procedure for face recognition model
Regarding the image above, it is able to conclude that a face recognition modelcontains 5 stages as described in detail below
Face detection: As can be seen from the chart, the input of face detection is asequence of images captured from a video stream The detected faces may need to betracked across multiple frames using a face tracking component While face detectionprovides a coarse estimate of the location and scale of the face, face landmarkinglocalizes facial landmarks (e.g., eyes, nose, mouth, and facial outline) This may beaccomplished by a landmarking module or face alignment module In short, facedetection will locate one or more faces in the image and mark them with a bounding box[2]
Face alignment: This stage is performed to normalize the face geometrically andphotometrically This is necessary because state-of-the-art recognition methods areexpected to recognize face images with varying pose and illumination The geometricalnormalization process transforms the face into a standard frame by face cropping.Warping or morphing may be used for more elaborate geometric normalization Thephotometric normalization process normalizes the face based on properties such asillumination and gray scale [2]
Feature extraction: This is vital for face recognition Face feature extraction isperformed on the normalized face to extract salient information that is useful fordistinguishing faces of different persons and is robust with respect to the geometric andphotometric variations The extracted face features are used for face matching, which isdescribed at the next stage [2]
Feature matching: The final stage which performs matching of the face against one
or more known faces in a prepared database is shown the matcher outputs ‘yes’ or ‘no’for 1:1 verification In case of 1: N identification, the output is the identity of the inputface when the top match is found with sufficient confidence or unknown when the tipmatch score is below a threshold The main challenge in this stage of face recognition is
to find a suitable similarity metric for comparing facial features [2]
Trang 23Graduation Thesis Chapter 2: Literature Review
Figure 2 3 Face and eye detection
The algorithms must be trained on huge data sets with hundreds of thousands ofboth positive and negative images in order to assist assure accuracy The algorithms'capacity to identify faces in a picture and where they are increases with training
The methods used in face detection:
Knowledge-based, or rule-based methods, describe a face based on rules Thechallenge of this approach is the difficulty of coming up with well-defined rules
Feature invariant methods which use features such as a person's eyes or nose todetect a face
Template-matching methods are based on comparing images with standard facepatterns or features that have been stored previously and correlating the two to
Trang 24Graduation Thesis Chapter 2: Literature Review
detect a face Unfortunately, these methods do not address variations in pose, scale,and shape
Appearance-based methods employ statistical analysis and machine learning tofind the relevant characteristics of face images This method, also used in featureextraction for face recognition, is divided into sub-methods
2.3 Color spaces in image processing 2.3.1 RGB color space (Red-Green-Blue)
RGB color models use complementary modeling in which red, green, and bluelight are combined in different ways to form other colors There, colors are represented asone or more integer decimal values The RGB color model was represented in Figure 2.4
Figure 2 4 RGB color space (Red-Green-Blue)
If each color channel is encoded with 1 byte (8 bits), and the value is in thesegment [0, 255], then we have a 24-bit color image, and all 28 × 28× 28 = 16,581,375colors can be encoded (about16 million colors) For example, some of the basic colorsrepresented in the RGB color space such as: [0; 0; 0] is Black, [255; 255; 255] is White,[255; 0; 0] is Red, [0; 255; 0] is Green, [0; 0; 255] is Blue
2.3.2 HSV color space (Hue-Saturation-Value)
HSV color space, which is also known as HSI (Hue-Saturation-Intensity), HSL(Hue-Saturation-Light) It is based on visual color properties such as tint, shade, and tone;
in other words, they are color, purity, and brightness Figure 2.5 showing the briefdescription of HSV space color
Trang 25Graduation Thesis Chapter 2: Literature Review
Figure 2 5 HSV color space (Hue-Saturation-Value)
● Hue: color tone, runs from 0 to 360
● Saturation: is the degree of purity of the color, which means how much white isadded to the pure color The value of S is in the segment [0, 255], where S = 255
is the purest color, completely non-white In other words, the larger the S, thepurer color
● Value: Also known as Intensity, Lightness, the value ranges in [0, 255], where V =
0 is completely dark (black), V = 255 is completely bright In other words, thelarger the V, the brighter color
2.4 Histogram of Oriented Gradients algorithm
HOG (Histogram of oriented gradient) [5] is an algorithm that will generate afeature descriptor to detect objects From a photo, we will take out two importantmatrices that help save image information: gradient magnitude and gradient orientation
By combining these 2 pieces of information into a histogram distribution chart, where thegradient magnitude is counted according to the bins groups of the gradient equation.Finally, we will obtain the HOG-specific vector representing the histogram Someapplications of HOG are shown in Figure 2.6
Trang 26Graduation Thesis Chapter 2: Literature Review
Figure 2 6 Applications of HOG
2.5 Support Vector Machine algorithm
Supervised learning algorithms, such as SVM, are used to solve both classificationand regression issues However, it is largely employed in Machine LearningClassification issues Using a method or parameter known as Kernel, SVMs caneffectively conduct non-linear classification in addition to linear classification byimplicitly mapping their inputs into high-dimensional feature spaces
SVMs are based on the idea of finding a hyperplane that best divides a dataset intotwo classes, as shown in Figure 2.7
Figure 2 7 An example of support vector in 2-Dimensional data
For 1 Dimensional data, the support vector classifier is a point Similarly, for Dimensional data, the support vector classifier will be a line, and for 3-dimensional data,
2-a support vector cl2-assifier is 2-a pl2-ane And for 4 dimension2-al or more, the support vectorclassifier will be a hyperplane
In geometry, a hyperplane is a subspace whose dimension is one less than that ofits ambient space If space is 3-dimensional then its hyperplanes are the 2-dimensional
Trang 27Graduation Thesis Chapter 2: Literature Review
planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines.This notion can be used in any general space in which the concept of the dimension of asubspace is defined [3]
Figure 2 8 Margins describing in a plane
The distance between the hyperplane and the nearest data point from either set isknown as the margin The goal is to choose a hyperplane with the greatest possiblemargin between the hyperplane and any point within the training set, giving a greaterchance of new data being classified correctly, as shown in Figure 2.8
However, data is rarely ever as clean as our simple example above A dataset willoften look more like the jumbled balls below which represents a linearly non-separabledataset To classify a dataset like the one above it's necessary to move away from a 2dview of the data to a 3d view as shown in Figure 2.9
Figure 2 9 An example of linearly non separable dataset
Because we are now in three dimensions, our hyperplane can no longer be a line
It must now be a plane as shown in the example above The idea is that the data willcontinue to be mapped into higher and higher dimensions until a hyperplane can beformed to segregate it
SVM could work well on smaller and cleaner datasets with high accuracy.Because of using a subset of training points, it gives more efficient results Despite giving
Trang 28Graduation Thesis Chapter 2: Literature Review
some advantages, there are also some drawbacks when applying SVM algorithm Firstly,this algorithm is not suitable for dealing with larger datasets which makes the trainingtime longer Secondly, the critical problem is less effective on noisier datasets withoverlapping classes
2.6 Background of speech recognition system 2.6.1 Concept
The method of speech recognition is intricate The voice output signal is analog.These signal samples are feature extracted by sampling, quantizing, and coding to create
a digital signal These characteristics will serve as the identification process' input Therecognition outcome will be produced by the recognition system
Some difficult factors for speech recognition problem:
- When pronouncing, speakers are fast and slow
- The spoken words are often different in length
- The same person says the same word but has different pronunciations and endings withdifferent analysis results
- Each person has their own voice expressed through pitch, loudness, intensity, pitch andtimbre Noise factors of the environment, receiving equipment… also not small to therecognition efficiency
The speech-to-text recognition and conversion system is widely researched anddeveloped by domestic and international scientists
a suitable form, as shown in Figure 2.10
Trang 29Graduation Thesis Chapter 2: Literature Review
Figure 2 10 Speech Recognition
Although speech recognition appears quite futuristic, it is already commonplace
We can speak out our question or the one we want help with on automated phone calls,and voice recognition is also used by your virtual assistants like Siri or Alexa to conversewith you naturally
Python's speech recognition uses algorithms that model speech in terms of bothlanguage and sound In order to extract the more important parts of speech, such as wordsand sentences, acoustic modeling is utilized to distinguish the phenones and phonetics inour speech
Figure 2 11 Implementation of Speech Recognition
Shown in Figure 2.11, speech recognition begins by using a microphone totransform the sound energy produced by the speaker into electrical energy This electricalenergy is subsequently transformed from analog to digital and eventually to text
It separates the audio data into sounds and then uses algorithms to analyze thesounds to determine which word is most likely to fit the audio Neural Networks andNatural Language Processing [6, 7] are used for all of this The accuracy of voicerecognition can be increased by identifying temporal patterns using hidden Markovmodels
For the speech recognition task, python supports a lot of packages Table 2.2outlines some of these packages and highlights their specialty
Trang 30Graduation Thesis Chapter 2: Literature Review
Table 2- 2 The speech recognition package in Python
Package FunctionalityApiai Includes natural language processing for identifying a
speaker’s intent
Google-cloud-speech Offers basic speech to text conversion
Speech Recognition Offers easy audio processing and microphone accessibility
Watson-developer-cloud Watson developer cloud is an Artificial Intelligence API that
makes creating, debugging, running, and deploying APIseasy It can be used to perform basic speech recognitiontasks
Table 2.2 provides information about packages available in python, in which, there
is one package that stands out in terms of ease-of-use is Speech Recognition [6, 8]
Recognizing speech requires audio input, and Speech Recognition makesretrieving this input really easy Instead of having to build scripts for accessingmicrophones and processing audio files from scratch, Speech Recognition can have we
up and running in just a few minutes
The Speech Recognition package offers the following advantages [6, 8]:
Easy speech recognition from the microphone
Makes it easy to transcribe an audio file
It also lets us save audio data into an audio file
It also shows us recognition results in an easy-to-understand format
2.6.3 Applications
● Windows Speech Recognition
As seen in Figure 2.12, the 2009-born "Windows Voice Recognition" applicationbuilt into Microsoft Windows 7, Windows 8, and Windows 10 can identify speech tomanage and control software and apps on the Windows operating system to speed up userexperience
Trang 31Graduation Thesis Chapter 2: Literature Review
Figure 2 12 Interface of Windows Speech Recognition
The ability to manage and control computer software and applications, as well asproduce text from voice, are the application's key capabilities However, there are still alot of issues with this identifier to be fixed, like the need to memorize it before using it;trouble accurately separating voices; the identifier is ineffective and cannot yet discernVietnamese
● Voice-To-Text Facebook Messenger
Facebook Messenger launched with the "Voice-To-Text" feature added in 2013
As seen in Figure 2.13, this program identifies the voice and turns it into a text messagethat is transmitted to the receiver via the Facebook Messenger application's text messageinput
Along with other benefits like the ability to recognize quite precisely, employingthe Facebook machine's data warehouse eliminates the requirement for prior training.However, this tool does not support Vietnamese and simply converts voice to text
Figure 2 13 Interface of Voice-To-Text Facebook Messenger
● Google Speech to Text
Google created Google Speech to Text around two years ago The programintegrates into the Chrome browser, works on a variety of platforms, including Windows,iOS, and Android, and can detect lengthy texts The Google Speed to Text API isdisplayed in Figure 2.14
Trang 32Graduation Thesis Chapter 2: Literature Review
Figure 2 14 Interface of Google Speech to Text
This tool was launched in 2017 and has significantly improved the disadvantages
of its predecessors such as the ability to recognize good language conversion and supportVietnamese Figure 2.14 shows famous voice recognition tools, which can realize theusefulness of voice recognition in the world Based on the above research, the studentdecided to use Google's voice recognition tool to apply to this project
2.7 Framework and libraries 2.7.1 Framework Pytorch
PyTorch [9] is a Python-based library for creating Deep Learning models andusing them for various applications PyTorch is not just a Deep Learning library, but apackage for scientific computing as the official documentation of PyTorch [9] mentioned:
“It's a Python-based scientific computing package targeted at two sets of audiences:
1 A replacement for NumPy to use the power of GPUs
2 A deep learning research platform that provides maximum flexibility and speed.”
PyTorch [9] is similar to Python, it is designed with a focus on ease of use andeven users with very basic programming knowledge can use it in Deep Learning relatedprojects Figure 2.15 compares two popular frameworks in Machine Learning problems,TensorFlow and Pytorch in the period from 2017 to 2021
Figure 2 15 Pytorch and TensorFlow Frameworks from 2017 to 2021 [10]