The thesis proposes a process of building data based on similar questions including 02 main steps: collecting data through two systems named Written Collection System and Speech Collecti
Trang 1TIANOI UNIVERSITY OF SCIENCE AND TECIINOLOGY
MASTER THESIS
Question Answering in Vietnamese
NGUYEN THI MUNG Mung.NT211261 M@sis hust.edu.vn School of Information and Communication Tcchnology
Supervisor: PLD Nguyen Thi Thu Trang
Schook Information and Communication Technology
Hanoi, 04/2023
Trang 2HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Question Answering in Vietnamese
GUYEN THI MUNG Mung.NT211261 M@sis.hust.cdu.vn
School of Information and Communication Technology
Supervisor: Phi) Nguyen ‘Thi ‘thu ‘Trang
Supervisor's signature
School: Information and Communication Technology
Tianoi, 04/2023
Trang 3CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIET NAM
Độc lập — Tự do— Hạnh phúc
BẢN XÁC NIIẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Ho và én tác giá luận văn: Nguyễn Thị Mừng
ĐỀ tải luận văn: Hỏi dáp tiếng Việt tự động
Chuyên ngành: Khoa học dữ liệu
1 Lam rõ đóng gp của luận văn
- Trong bản sửa đễi, tác giả đã trinh bày rõ rảng hơn vẻ những đóng góp
của luận văn thông qua các nội dụng sau:
Tác giả bỗ sung mục 2 — Những đóng góp của luận văn (Thesis's
cơntributions) trong chương kết luận (trang 46)
~ Tác giả trình bày rõ về tính cần thiết trong việc cản có một quy trình xây đựng dữ liệu đối với bài toán hội đáp, trang đó xem xét đầu vào là giọng
nói (trang 17-18)
~ Tác giả bố sung làm rõ cách thức đánh giá dữ liệu huân hiyện và đữ liệu
kiểm thứ trong quy trình để xuất (trang 28-29, 31-32)
2, Lam rõ các giải thuật sử dụng
- Tác giá đã mô tá rõ hơn về cách thức giải quyết bài toán hỏi đáp theo hưởng phân loại văn bản bằng việc chí rõ cách tạo ra các nhãn lớp và tim
ra câu trả lời khi tim duoc nhén lớp của câu hôi (trang 36)
~ Tác giả trình bảy rõ ràng hơn về cách tìm cầu bã lời sử đụng mỗ hình so
sánh độ tương đồng và cách thức để đánh giá hai hướng tiếp cân trên củng,
một thang dơ (trang 38-39)
3 Giải thích thêm thực nghiệm cũng như lựa chọn thuật toán
- Tác giả đã bê sung các siêu tham số của cáo mô hình sử dụng trong hướng
phan loai van ban, bao gm Random Forest, SVM, LSTM, PhoBert và mô
tịnh so sảnh độ tương đồng SberL (trang 39 - 41)
- Các tham số trong các mô hình học sân đền được tác giâ giữ nguyên so
với mỏ hình gốc Với các mô hình học máy cö kết hop tuning trong quá
trình huấn luyện đề tự động tìm ra bộ tham số phủ hợp nhất
Trang 44 Bỏ các phần trình bảy không cần thiết
- Tác giả dã lược bỏ phần trình bày về nhóm câu hối (trang I8 quyển
Taiận văn chưa chỉnh sửa và trang 19 — quyển Luận văn đã chỉnh sửa)
~ Tác giả đã lược bỏ thuật toán Naive Rayex trong Tuuận văn trước đó (trang
7 của 2 bản Luận văn và trang 39, 41 — quyễn Luận văn chưa chỉnh sửa và
trang 43-44 quyén Luận văn đã chỉnh sửa)
5 Chỉnh sửa cách trình bảy công thức
~ Tác giã đã chỉnh sửa cách trình bày công thức theo mẫu của trường (trang
- Tác giả đã thực hiện bỏ đánh số chương với chương 1 va chương Š Cụ
thể: chương 1 chuyển từ “Chapter L Introduction” thinh “Introduction”, chương 5 chuyển từ “Chaptor 5 Conclusion and fubre works” thành
“Cenehaion and fuwe works” (rang ] và trang 46)
Trang 5GRADUATION TLIESIS ASSIGN
Affiliation: Ianoi University of Science and Technology
Nguyen Thi Mung - hereby warrants that the work and presentation in this thesis performed by myself under the supervision of PAD Nguyen Thi Thu Trang All the resulls preserited in this thesis are truthful and are not copied from any other works All references in this thesis including images, tables, figures, and quotes are clearly and fully documented in the bibliography I will take full responsibility for even one copy that violates school regulations
Student
Signature and Name
Nguyen Thi Mung
Trang 6sometimes ike a fnend so thal we can casily confide and share our difficulties
‘Under her guidance, I feel that I have improved a lot
I would like to express my sincere thanks to the leadership and teachers at Lanoi University of Science and ‘Technology in general and in the School of Information and Communication Technology in particular, for giving me the opportunity ta sludy ina new environment useful and inemorable school in my
student life
Talso want to thank my brothers and sisters, friends, students in laboratory
914, and partners ‘thank you to everyone for their detailed guidance, enthusiastic help and encouragement during my time in the laboratory as well as during my thesis work Along with that, I would like to thank my friends inside and outside the School of Information and Communication Technology for their interest, sharing and help in the past time
EFinally, [ would like to express my sincere thanks to my family | thank my family for always loving, caring, and being, a spiritual support, a great source of motivation for me to overcome my difficulties and challenges 1 thank Mr Tuan - xay love, for being there to encourage me in the most stressful times In the process
of making my graduation thesis, even though I have tried my best, it is still inevitable thal mistakes will be made look forward to receiving suggestions from teachers and friends so that I will not encounter these errors in the future
Once again, I sincerely thank you!
Trang 7ABSTRACT
Finding information is becoming more and more challenging as the amount
of knowledge on the Intemet is getting bigger and bigger Conventional search engines only retum lists of short paragraphs or related links, which makes it difficult for users, especially those who lack experience and search skills Therefore, it is essential lo build a Question Answering system capable of quickly giving an accurate answer to a question Therefore, the author proposes the topic
“Question Answering in Vietnamese” with the goal of building a question- answering systen applicable lo Vietnamese, especially considerng the input factor through the human voice Previous studies have solved the problem with many different approaches, in which, the approach of using similar questions helps to store and deploy the system easily Data is an important factor in helping ensure oulput for Lhe system The thesis proposes a process of building data based on similar questions including 02 main steps: collecting data through two systems named Written Collection System and Speech Collection System and applying this process in building data with the Digital Transformation domain with the initial question-answer pairs provided by the Ministry of Information and Communication Based on the built data, the thesis also evaluates the question answering models in two approaches: classification and comparing the similarity between questions The resulls show that the models have high accuracy [rom 82- 94% In which, the SVM model has the highest accuracy At the same time, the model size is not too large and the prediction time is fast, which is suitable for deployment in practice The evaluation results also show that, the Automatic Speech Recognition (ASR) module affects the quality of the madel by 3.9 to 10% Inthe future, the thesis aims to expand the initial questions based on the available
documents, and at the same time, pattially automate and create tools to support the
data quality controller to evaluate the data for the model
Trang 8
2.2.3 Training data building process
2.234 Test data building process
3.1 Vietnamese Question Answering problem seo 3Ó
Trang 93.1.2 Similarity questions problem
Trang 10TIST OF TABLES
Table 1 Some examples of questions in the initial dataset
Table 2 Information about question length
‘Table 3 Some examples of short and long, questions
‘Table 4 Special questions
Table 5 Data collection campaigns information
Table 6 Information about data collected through campaigns
Table 7 Model's hyperparameter
Table 8 Confusion matrix
'Table 9 Evalustiơn results of experimented modsls
Table 10 Sizc and average prodiclion time
43
44
Trang 11LIST OF FIGURES
Figure 1 Approaches to the QÀ problem
Figure 2 An example of a decision tree ào se 7
Figure § The structure oŸ the RNN network
Figure 9 An example of the input in the BERT model - 14
Figure 13 The data collection irter[aee is sreoree 128 igurc 14 Training data evaluation pTooe5s — -.ˆ
#igurc 15 Speech data eollection prOC685 is 1 SO
#igure 16 Speech data collection interfaoe e ¬
Figure 17 Distribution of the number of words in a sentence with the Written Collection Systent jcteneeisesseenessm sine sesieetsataeetinetn 1 4
Figure 18 Distribution of word count in the data collected by the speech system
34 Figure 19 Text classification model architecture 37
Eigure 20 Sùnilarity comparison model 8bet 5 3B
Trang 12LIST OF ACRONYMS
Trang 13INTRODUCTION
In this chapter the thesis presents the reasons for choosing the topic, based an the analysis of actual needs as well as previous studies on the question-answering system in Vietnamese and in the world, Along with that, this chapter also gives the aim, scope of the topic, research orientation, and layout of the thesis
1 Problem Formula
Finding information is becoming more and more challenging as the amount
of knowledge on the Internet is getting bigger and bigger Conventional search engines only return lists of short paragraphs or related links, which makes it difficult for users, especially those who lack cxpericnoe and scarch skills
‘Therefore, it is essential to build a question-answering system capable of giving
an accurale answer lo a question quickly Question Answering (QA) is a large
branch in the field of natural language processing (NLP), which takes as input a
question mm the form of a natural language queslion, possibly text or sound, then
give the corresponding answer [1]
Classification of the QA sysiem
There arc many ways to classify QA systems Based on the data source, we can divide the QA problem into three main categories structured data, semi- structured data, and unstructured dala [2] A knowledge graph is a representation
of structured dala Semi-structured data is usually presented mi the form of sls or
tables And unstructured dala is ofien represented as text in watural language such
as sentences, paragraphs, documents, etc Based on the domain, the question- answering, system is divided into twa main types: open-domain QA system and
closed-domain QA system [2] The goal of an open-domain system is to answer
questions in many different fields, based on data mining from rich information sources such as Wikipedia, Web Search, Meanwhile, a closed-domain system is geared towards answering a question for a particular domain The mumber of questions in a closed domain system is smaller with limited resources and
participalory conslruction by a leam of expenenecd experts in that field
Approaches for solving QA problems
Previous studies have solved the QA problem in many different ways According to our knowledge, the approaches to this problem can be divided into 4 main groups as described in Figure 1 [3]
Trang 14(ra Information: ae _ E————
| Dats :
= Information Retrieval (IR) & Machine Reading Comprehension (MRC) Approach S
Question Entailment (QE) Approach
Figure 1 Approaches to the QA problem
Figure 1 describes the approaches that previous studies have used to solve the QA problem, including (i) the traditional approach, (ii) Information Retrieval
(IR) combined with Machine Reading Comprehension (MRC), (iii) using knowledge-based (KB) and (iv) based on similar questions with Question Entailment (QE) [3]
With the first approach, the QA problem is solved by a pipeline consisting of
three main components: Question Processing, Document Retrieval, and Answer Extraction [1] First, the user's question will be analyzed and processed by the
Question Processing component The task of this component is to understand the user's question and generate the query as input for the next component At the same
time, this component also exploits the content of the question to be able to provide
useful information such as question type, entities, and important information,
helping to increase the accuracy of the answer extraction process [4] An 8-step
pipeline in this module includes entity labeling, POS tagging, linguistic trimming
heuristics, dependency parsing, sentiment analysis, and generating patterns for queries with ranking given by Usbeck, Ngomo, Bihmann, and Unger introduced
in their study [5] After the question has been analyzed by the first component, the
Document Retrieval component will rely on that analysis to search for related documents, usually texts or paragraphs based on an IR module or Web Search
2
Trang 15Fogines [6] Finally, the Answer Processing corponirt will search and return the
final answer based on those documents To extract the answers, the research is usually based on the extraction of real information available in the documents [7]
18] [9], combined with previously analyzed answer-type information [10] suggests
that some sludies generate lalenL formation from answers and questions, and then
use matching technologies such as surface text pattern matching [11] [12], word
ot phrase matching |6], and syntactic structure matching |7| |13] [14] Deploying
a QA system in this approach helps lo control the system iz a belter way, but this
is a rather complicated task hecause it requires a combination of many natural
language processing and information retrieval technologies
‘The development of deep leaming technologies has allowed data processing -with a large amount of computation, which makes the research directions of QA problems based on reading comprehension more widely sludied [1] MRC is the
problem of finding an answer to a question in natural language, based on a given
passage his passage will be selected among many text fragments in the database
‘under the evaluation of the IR component, using document querying technologies
Ta solve the MRC problem, based ou the answers, there are lwo main research directions: (I) generating the answer (Generative MRC) and (ii) extracting the answer from the passage (Extractive MRC) In the first direction, the answer will
‘be generated automatically based on the input information This is also how people
read and understand the content of the passage and give their answers, so the answer will be mare natural and close However, this also makes the construction
of the training data more difficult and the quality assessment of the modcl more
complicated Some dalasels built for this problem canbe mentioned as the English
data set NarrativeQA [15], Natural Questions Dataset [16], and Chinese DuReader [17] LSTM [18], ELMo [19] and GPT-2 [20] are popular models that researchers
use lo solve MRC problems im this direction Unhke the first direction, in the
second direction, the answer is part of the input passage This makes evaluation
easier and data construction less expensive With the appearance of large data sets
such as CNN/Daily Mail [21], MS MARCO [22], RACE 123], and SQUAD 2.0 [24], the studies following this approach achieved very good resulls Notably, the
BiDAF model [25] solves the problem by representing the text with different levels, or the QANet [26] combines CNIN and self-attention Language models
wilh encoder-decoder archilectures such as BERT ]27 |, XLM-R |28|, and TS [29],
allow encoding of the question and corresponding passage and return the start and
end positions of the answer through the decoder, also achieved high results with
the above data sets or Vietnamese, there are two open datasets, UlT-ViQuAD
[30] and UIT-ViNowsQA [31] for problems Collowimg this research direction The
research direction of MRC models has shown the ability of computers to exploit
information from documents to give answers However, building data in this dirootion requires a Jot of work and effori The storage and management of
doouments are also very expensive
Trang 16Researches follows a KB approach using structured data, in the form of a imowledge graph or SQL databases, representing facts and relationships between
entities [32] Berant et al built the WEBQUESTION dataset, using the Google
Suggested API Lo generate actual questions that starl with wh and have a umque entity [33] However, the questions in WERQUESTION were still quite simple, so
Berant and Talmer later improved this dataset with COMPLEXWEBQUESTIONS in [34[ Compared with WEBQUESTION, the
questions m COMPLEXWERQUESTIONS can have more enlilies and contain
more types of questions such as compound, association, and comparison, In
research on building an application query system answering questions related to a particular product, Frank ct al [35], Li ct al 136] use product attributes to form a
knowledge graph for the system For Vietnamese, Phan and Nguyen [37] build a
imowledge graph in terms of triads (subject, verb, object) and transform the input question into the corresponding form Dat et al [38] used intermediate
representative choments, including information about question structure, question type keywords, and semantic relationships between keywords to build a system
jmowledge in its system Studies in this approach follow two main methods: (i)IR, searching for answers by sorting possible candidates [39], and (ii) semantic
analysis lo convert natural language queries into queries usable in Search Engines
[AO] However, building such a knowledge system requires having a rather complex knowledge system and requires a lot of effort in maintaining and expanding Uhat knowledge system
Finally, the studies follow the QE approach using question-answer pairs as a
source of knowledge for the system This method builds on the defirition of siuilar
questions, which can be answered in the same way With a user input question, the
task of the QA problem is to find a question similar to that question and retum the
corresponding answer [3] [41] builds a pair of questions and answers from Frequently Asked Questions (FAQs) of Microsoft Download Center, combining AND/OR search techniques and combinatorial search techniques to create a full
list of results for a related search The CMU OAQA [42] uses a Bidictionary
Recurrent neural network (BIRNN) combined with an allention mech:
predict the similarity between two questions For Vietnamese, T Minh Triet and
colleagues also published the dataset LIT-ViCoV19QA [43], including 4,500
question-answer pairs related lo the Covid epidemic, onllected from FAQs on the
official websites of healthcare organizations in Vietnam and around the world
With this approach, QA systems oan easily build and store data, especially in closed systems to serve the needs of a particular organization Llowever, data
building in this approach ofen requires (he involvernent of experts in thal field
Tn addition, the input of a QA system can be Lexl or speech With speech
input, there are two main approaches for the system to understand the user's
question: (i) training the end-to-end model and (Gi) modularizing each component separately With the first approach, the system is a unit, which takes audio as input
4
Trang 17and relurns the corresponding response [44] proposes joinl audio-and-lext
SpeechBERT, a pre-trained model trained from both audio and text data However,
building high-quality training data for such systems is difficult and expensive In the sccond approach, the system is segregated inlo independent modules First, the
ASR module will be responsible for converling the speech Lo text, then the
generated text will be passed to the next processing module to give the answer With this approach, the system has a modular division, so it is easy to control the quality of cach module However, building such a pipeline sysicmm will also be
more complicated
Based on the analysis of the approaches to the QA problem, and al the same
time, based on the need for building a Virtual Assistant for the Digital
‘Transformation domain of the Ministry of Information and Communications, the
thusis aims lo research and evaluate QA models bused on ximilarily questions,
which considers the input of the system in speech
2 Goal and scape
Based on the studies in scction 1, the thesis consists of two main objectives: (i) building and enriching datasets on the Digital Transformation domain and (ii) evaluating QA models based on built data,
With the first goal, the thesis aims to build a process of collecting data for the QA problem, with the initial data being question-answer pairs in the Digital Transformation domain The data after going through this process will be uscd as training data and testing data for the corresponding QA model, with the user input
in speech Although tested and implemented in the Digital Transformation data domain, the proposed process should be general, extensible, and applicable to other data domains This will he # sample process for individuals, organizations, and subsequent researchers to apply in designing and building their own QA system
With the data that has been built, the second goal of the thesis is to evaluate
QA models on this dataset The results of the test will contribute to proving the quality of the built data and also serve as a basis for evaluating the feasibility of deploying QA models in practice
3 Solution orientation
Jrom the objectives given in section 2, our proposed oriented solutions for the thesis are as follows: (1) building a process of collecting training and testing
data for the QA model and (ii) evaluating QA models based on built data
The data will he collected in written and spoken form, with Lhe participation
of collaborators guided by experts in the field of Digital ‘lransformation In particular, the Written Collection System will support the collection of training data, and the Specch Collection System will help ereaic the building of tust data
for the model.
Trang 18Besides building dala, the thesis will also test different models to solve the
QA problem in two main approaches: text classification and similarity comparison
‘These are the methods that fit the built data, Based on the experimental results, the Qhesis will evaluate the [casibilily of this method when deployed in practice
4, Outline
‘The rest of the thesis is presented as follows
Chapter 1 presents an overview of the theoretical background related to the
QA problem, focusing on models of text classification and assessing the similarity
of questions This is the basis for cvaluating the solutions in the next chapters
Next, Chapter 2 presents studies on data construction based on available question-answer pairs Based on those studies, the thesis presents a proposal on the data-building process to solve the QA problem, specifically applied in the Digital
‘Transformation domain
Based on the built data set, Chapter 3 presents the evaluating results obtained
on the built datascl This will be the basis for considermg the
implementing the QA model in practice
Trang 19CHAPTER 1 THEORETICAL BACKGROUND
In this chapter, the thesis presents the theoretical background used in the thesis, from which readers can grasp the basic concepts The theoretical background presented includes basic classification algorithms, feature extraction, and the BERT language model
1.1 Text classification algorithms
1.1.1 Random Forest
Random Forest [46] is a set of many Decision Trees [47] The number of
trees in the forest can be up to hundreds or thousands of trees, A Decision Tree is
a structured hierarchical tree made up of sequences of rules In the case of a boy
who wants to play soccer, Figure 2 is an example of a decision tree Describe a
tree that decides to go soccer or stay at home based on weather, humidity, and
normal, the boy is more likely to go to soccer And if it rains and has strong winds,
it is likely that he will choose to stay at home
Trang 20The key point in building a Decision Tree lies in the Terative Dicholomiser
3 (ID3) algorithm [45] Usually, data will have many different attributes In the
original example, the data will include a lot of information, about the weather
(sunnyAainy), humidity Cugh/nonnal), wind strength (strong/light) With these
aliributes, 13 determines their order al each decision step The best attribute will
be selected through a measurement A division is considered good if the data at that step is entirely in one class On the contrary, if the data is still mixed together
by a large proportion, the division is nol really good
The forest will generale decision recs randomly, depouding on Ue judgments
in the learning process The final decision results are aggregated based on the judgments of ali the decision trees present in the forest,
1.12 Support Vector Machine
Support Vector Machine (SVM) [48] is @ linear classification model, used in
dividing data into distinct classes Consider the binary classification problem with
N data points:
(91), eer Vad, oe Cow Yu}
Where x; is ar inpul veclor represented in the space X CR? and y; is the class label corresponding to the input vector, y, € (1; —1}, in which y, = 1 means the data belongs lo the positive class and y; = —1 means the dala belongs to the negative class Formula for calculating labels for a data point
The goal of SVM is ta define a linear classifying function between two classes:
Where w is the weight vector of the altributes, and b is a real numeric value Based
on the function f(x), we determine the output value of the model as follows:
y¡ ={11ƒ (w.x) +b>U —1fƒ (w.x)+b<0 (Eg 12)
Suppose (x',1) is a point im the positive class and (x ,—1) is a pom in the
negative class, closest to the dissecting hyperplane Iy Let /1,,/L be two parallel hyperplanes, where H,, passes through (x*, 1) and is parallel lo Hy and HL passes through (x~,—1) and parallel to My The margin is the distance between two hyperplanes H, and H_ In order to minimize the crror in the classifying process,
we need to choose the hyperplane with the largest margin, such a hyperplane is called the maximum margin hyperplane
The distance from x* to Ip is:
Trang 21Therefore, the problem of determining the maximum margin between two
hyperplanes is reduced to delermining w and B so that margin = qa reaches the maximum value and satisfies the condition:
{Ww.n) +b2>1ify=1 Wath <oify= -1
(since x*, x7 are the closest points to the separation hyperplane and belong to
(w,b) = BE = 2) with condition 1 yw.) +b) S0 Ba 17)
The above algorithm is applied in the ease of Hincar data, For non-linear data,
SVM uses kernel functions to transform the data into a new space, in which the
resulting data is linearly discriminant Some common multiplication functions can
‘be mentioned as linear, polynomial, Radial Basis Function (RBF), sigmoid, [45]
With the multiclass classification problem using SVM, there are many ways for us to relum to the binary classification problem Among thom, the most commonly used method is ane-vs-rest (also known as ane-vs-all, one-apainst-all) [45] Specifically, if there are (classes, we will build C models, in which, each model corresponds to a class Fach of these models helps to distinguish whether a data point belongs to that class or not or calculate the probability that a point falls into that class ‘I'he final result can be the class with the highest probability
Trang 221.1.3, Neural network
Neural networks are made up of single neurons, called perceptrons, which
are inspired by biological human neurons Figure 3 depicts the structure of a
Figure 3 A biological neuron
Each biological neuron consists of three main components: (i) the cell body
is the bulge of the neuron, contains the cell nucleus, plays a role in providing
nutrition to the neuron, can generate nerve impulses, and can receive nerve
impulses transmitted to neuron, (ii) dendrites are short dendrites that develop from
cell bodies, which function to receive nerve impulses from other neurons, transmit
to the cell body, and (iii) axons are long single nerve fibers, responsible for
transmitting signals from the cell body to other neurons Inspired by this, the artificial neuron is designed with the structure depicted in Figure 4
Figure 4An artificial neuron
Each neuron has inputs corresponding to dendrites, a processor using
activation functions corresponding to cell bodies, and neuron outputs corresponding to axons The activation function is usually a nonlinear function such as sigmoid function, tanh function, ReLU function, sign function [49]
10
Trang 23Many neurons combine together lo become neural networks Neural networks usually have 3 layers: the input layer receives input data from the dataset, the output layer shows the predicted value of the model for the input data, and the Indden layer is the layers between the [ist layer input and output layers
LL4 Recurrent Neural Network
With conventional neural networks, all inputs are independent of each other,
so there is no chain link between them In word processing problems, the order before and aller in a document is very important Based on this, the Recurrent
‘Neural Network (RNN) determines the value of the next element based on previous calculations Figure 5 depicts the structure of the RNN network
Figure 5 The structure of the RNN network
The computation inside the network al each stap is calculated based on the inpul
at each step and the hidden state at the previous steps through same function like tanh or RLU, The output at this step usually uses the softmax function
L125 Long-Short Term Memory
The theory has shown that for distant steps, the RNN has a problem of remote
dependence, ie the network can only remember a small interval ‘This happens because of the vanishing gradient This is the phenomenon that occurs when the
value of the gradient will get smaller as they go down the lower Inyers, so the
update performed by the Gradient Descent does not change much of the weights
of those layers, making them impossible to converge and RNN will not get good
yosults The Long short-ierm memory network (7.STM) [49] was born lo overvome
this limitation T.STM also has a sequence architecture similar to RNNs, bul instead
of having only one layer of neural networks, they have up to four layers that interact with cach other in avery special way.
Trang 24kia
Figure 6 The architecnure of LSTM
A special feature of LSTM is the cell state - the line that runs across the top
of the diagram This is considered network memory At each cell, the LSTM can
add or remove the necessary information through the ports: forget port, input port,
and output port, respectively, as shown in the figure The forget gate will decide
what information is unnecessary and should be discarded in this state The input
port indicates what information should be added to the cell state The output port
decides the output of this cell With such a structure, the LSTM network has the ability to remember the more distant states, thereby creating better efficiency than the RNN network
1.2 BERT language model
BERT [50] stands for Bidirectional Encoder Representations from
Transformers, which is understood as a pre-learned model, also known as the pre- train model, which learns two-dimensional contextual representation vectors of
words, which are used to transfer to other problems in the field of natural language
processing Compared with Word Embedding models, BERT has more
breakthroughs in representing a word to a digital vector based on the word's
context
The architecture of the BERT model is a multilayer architecture consisting
of several Bidirectional Transformer encoder layers BERT Transformer uses two-
way attention mechanisms while GPT Transformer uses one-way attention
(unnatural, inconsistent with the way the language appears), where all words pay
attention only to the left context Two-dimension Transformer is often referred to
as a Transformer encoder while versions of Transformer using only the left-hand
context are often referred to as a Transformer decoder because it can be used to
generate text The comparison between BERT, OpenAI GPT and ELMo is shown
visually in Figure 7
Trang 25Figure 7 BERT, OpenAl and ELMo
There are two tasks to create a model for BERT, includes Masked LM and Next Sentence Prediction [27]
Masked LM
To train a representation model based on a two-dimensional context, we use
a simple approach to mask some random input tokens and then we only predict the
tokens hidden and call this task a "masked LM" (MLM) In this case, the hidden
vectors in the last layer corresponding to the hidden tokens are put into a softmax
layer over the entire vocabulary for prediction Google researchers tested masking
15% of all tokens taken from WordPiece's dictionary in a sentence randomly predicting only the masked words Figure 8 is the BERT training scheme under the
Although this allows us to obtain a two-dimensional training model, two
disadvantages exist The first is that we are creating a mismatch between pre-train
and fine-tuning because the tokens that are [MASK] are never seen during model
13
Trang 26refinement To mitigate this, we won't always replace hidden words with the
[MASK] token Instead, the training data generator chooses 15% of tokens at
random and performs the following steps: For example with the sentence "con_cho ctia t6i dep qua" (my dog is so beautiful), the word chosen to mask is "dep"
(beautiful), replace 80% of the words selected in the training data to token [MASK]
to "con_ché ctia toi [MASK] qua” (my dog is so [MASK]), 10% of the selected
words will be replaced by arandom word, for example, "con chỏ của tôi máy tính qua" (my dog is so computer), the remaining 10% is kept unchanged as "con cho
của tôi dep qua" (my dog is so beautiful)
Transformer encoder has no idea which word will be asked to predict or
which word has been replaced by a random word, so it is forced to keep a
contextual representation of each input token Also, replacing 1.5% of all tokens
with a random word does not seem to affect the model's ability to understand the language The second disadvantage of using MLM is that only 15% of tokens are
predicted in each batch, which suggests that additional steps may be needed using
other pre-train models for the model to converge
Next Sentence Prediction
Many important tasks in natural language processing such as Question
Answering require understanding based on the relationship between two text
sentences, not directly using language models To train the model to understand
the relationship between sentences, we build a model that predicts the next
sentence based on the current sentence, the training data can be any corpus
Specifically, when choosing sentence A and sentence B for each training sample, there is a 50% chance that sentence B is the next sentence after sentence A and the
remaining 50% is a random sentence in the corpus
In order for the model to distinguish between two sentences, we need to mark
the beginning positions of the first sentence with the token [CLS] and the positions
at the end of the sentence with [SEP] Figure 9 shows an example input description
Trang 2713° Feature extraction
Feature extraction is the selection of text attributes and veclortaing ther ito
a vector space that can be easily processed by computers In the following, we will
present some popular feature extraction operations
1.3.1 Term Frequency — Inverse Document Frequency (TF-IDF)
The TF-IDF value [51] of a word represents the importance of a word in a document TF (Term Frequency) is the frequency occurrence of a word in a document and is calculated according to Nquation 2.8
Whore f(t, d) is the number of occurrences of the word ¢ in the document d The denominator in the above formula is the total mumber of words in the document
IDK (Inverse Document l‘requency) is the inverse frequency of a word ina
corpus The purpose of TOF is to reduce words Khát ofTen appear in the text but de
not carry much meaning The formula for calculating IDF is as follows:
|p|
log 995 rept € dl log ——————— (Bg 1.9)
IDE(t,D) =
In which: || is the total number of documents in set 1), and the denominator
is the total number of documents in set D containing the word ¢ The IF-IDF value
is calculated as follows:
TF — IDF (t,d,D) = IF(t,d) IDF(t,D) (Eq 1.10)
Words with high TF-IDF values are those that appear more in one document!
and less in another This value helps us filter owt common words and retain high-
value words (keywords of the document) TF IDF is a simple way to vectorize
textual data, but dhe magnitude of the vector is equal to the number of words,
increasing the computational load Furthermore, word representation using TF-
IDF suffers from the problem of not being able to represent words that are outside the dictionary and not being able to show relationships between words
Trang 28The CBOW model uses the context, around cach word as inpul and tries to predict the word that corresponds to that context, The architecture of the model is shown in Figure 10
Figure 10 One-word CBOW model structure
‘The input or the context word is a vector X = [X¡ X; xy] encoded as one-
hot with size V, which is the size of the dictionary, where x; = 1 with i is the position of the dictionary word, otherwise x; = 0 ‘he hidden layer consists of N
neurons and the output layer is also a one-hot vector ¥ = [yị y; vụ] of sive V
Wyxw is the weight matrix between the input and hidden layer Wi, is the weight
matrix between the hiddeu layer and the output layer The neurons of the lidden
layer just copy the sum of the weights from the inputs to the next layer There are
no triggers like sigmoid, tanh or ReLU The only nonlinearities are the softmax
calculations in the output layer During target word prediction, we leam a vector
representation of the target word SkipGram works similarly ta the CBOW model,
SkipGram model takes as input a word and tries to predict the context around it
Compared with IF-IDF, Word?Vec models have demonstrated semantic relationships between words The distance between pairs of words with similar meanings will be approximately the same, for example, "king" - "queen" and
"man" - "woman" However, Word2Vec has not yet solved the problem of representing wards that are not in the dictionary
Tins, through CHAPTER 1 we have grasped the basic theoretical bases for the next chapters Next, in CIIAPTIER 2, the author will present the process of building data for the QA problem
16
Trang 29CHAPTER 2 BUILDING A VIETNAMESE QA DATASET
In this chapter, the thesis will present how to build data for the Vietnamese Question Answering problem The construction process uses two data collection systems, the Written Collection System and the Speech Collection System
2.1 Overview
With the proliferation of documents, the need to find information is increasing day by day A simple way to meet this need is to develop frequently asked questions (AQs) related to the organization, field, or issue that the
organization wants to convey For example, the Microsoft Download Center
provides FAQs so that users can search for problems related to Microsoft products and services [53] The Ministry of Information and Communications Technology
also released the "C4m nang chuyén déi sé" (Digital Transformation Handbook)
[54], which includes questions and answers on the field of Digital Transformation,
‘based on the speeches of the Minister of Information and Communications Nguyen Manh Hung,
Llowever, when the number of questions is getting bigger and bigger, finding
questions similar Lo Che problem you are interested in will take a Jot of lime and
effort Search engines only retum results for the words contained in the original
question But in the reality of daily communication, we humans have many different ways to express a certain request, For example, in the field of smart
‘homes, to ask about how to tum on the fan, we can directly ask “Iam sao để bật
quạt" (JTaw đo T nữ on the fan) or say "Tôi muôn cho quạt chạy thì làm thể nào" (L want the fan to run, how do I do it) Apparently in the last example, the verb
“bat” (turn on) bas been replaced with “chay" (run), If only word matching 1s used,
the system may confuse it with another question when there is missing
information Although expressed in two different ways, we can all have the same answer for instructions on how to use a fan in the case of smart homes The two questions im this case can be considered “sunilar” lo cach other Therefore, approaching the QA system in the direction of developing similar questions for VAQs is a softer approach, helping the system to understand the input questions amore flexibly
For the QA system to work effectively in this approach, building a high- quality similarity datasct is essential However, organizations do not always have large enough data available for training the QA model Several methods have been proposed to enhance the data collection capabilities of the QA system One of the
methods is to use machine translation to automatically generate questions that are similar to the questions in the database of the system However, this method still has many limitations, because the translation results of the current machine
translation methods have not vet reached high accuracy, leading to the generated
data nol really meeting the requirements of a quality QA system Another method
is to collect data from various sources, such as discussion sites, online teaching
17
Trang 30siles, question-and-answer forums, and online conversations between humans Tlowever, collecting and processing data from these sources requires special skills and tools, and it is also necessary to ensure the reliability of the collected data Thorefure, lo achieve the best efficiency in building QA systems, organizations need to have @ clear and reasonable dala collection process, combining many different methods to collect quality and reliable data for training the QA model
The data to be collected through the organization's process should be divided
into two eategories: training data and testing data, ‘Training data needs to include
cough cases and the basic questions that the organization asks At the same time,
testing dala also needs Lo be crealed lo evaluate the capabilities of lhe QA moxdel
and ensure that it meets the requirements and can work correctly in different scenarios In particular, with user input via speech, the QA model needs to be evaluated under the influence of the ASR module This is especially important with
live mobile QA systems, where the use of speech is common However, the ASR
process can face many problems such as unclear utterance, sound disturbance, noise, and different pronunciations by users, affecting the ability of the QA model
1m giving Ihe correct answer Therefore, evaluating the QA model under the
influence of the ASR module is an important factor in evaluating the operability
of the QA model with speech input
When building data for an AL model, the data is usually split in a certain ratio, for example, 80:20 or 70:30 to produce training data and test data The training
data is used lo train the QA model to understand the user's inpul question and give
the corresponding answer ‘I'he test data will be used to evaluate the quality of the
QA model Considering the input of the model is speech if building data from specch and using the above building data method, the cosl of data construction will
be extremely expensive Because the speech data itself takes a lot of work and time
to collect and evaluate, it is not easy to fine-tane to create a high-quality dataset
On the other hand, the current QA systems mostly follow the direetion of modularivation, in which the system will use an ASR module to conver! the speech received from the user into text, then analyze this text to understand the user's question ‘Therefore, we can only build text data to train the model and evaluate the
trained model on speech data lo sec the correct quality of the model under the
influence of the ASR module This approach minimizes the cost of data building
because text can be constructed faster and less expensive than speech Based on this, the thesis proposes a data-building process consisting of 2 main steps: step 1
is to build lratming data, which focuses on 1cxt data and step 2 is Lo build test dala
with the received data in speech
2.2 The process of building # Vietnamese QA dataset
3.3.1 fnitial detaset
‘The initial dataset provided by the Ministry of Information and Communications, consisting of the corresponding question and answer pairs,
18
Trang 31belongs to the field of Digital Transformation Each pair of questions is divided into a group for management purposes, The dataset includes 194 question-answer pairs, divided into 9 groups These questions were built on the basis of the speeches
of the Minister of Information and Communications Nguyen Manh Hung, related
to issues of basic concepts, planning mechanisms, policies, and strategies,
including issues related to digital transformation in Vietnam and around the world
Questions in different groups are clearly differentiated, however, some
questions within the same group can have a high degree of ambiguity Some
examples of question-answer pairs are given in Table 1
Table I Some examples of questions in the initial dataset
Chuyển đổi số là bước phát triển
tiếp theo của tin hoc hỏa, có
được nhờ sự tiên bộ vượt bậc
của những công nghệ mới mang
tinh đột phá, goi chung là eng
nghé
(Digital transformation is the
computerization, made possible
by the remarkable progress of
breakthrough new technologies,
collectively known as digital
technology.)
tà Câu hỏi chung về
chuyển đi số Có thẻ nói rõ hơn vẻ chuyển đổi số
Thông tin và Truyền thông xuất
bản, hoặc truy cập website
dx.mic gov vn dé tim hiểu thêm thong tin chi tiet
(You can refer to "Cém nang
chuyên đổi số" (the Digital
Transformation Handbook)
published by the Ministry of
Communication, or visit the
website dx.mic.gov.vn for more
detailed information.)
3 Lam rõ một số
khai niêm có liên
Học máy là gì? Hoe máy là mỏt nhánh nghiên
cửu của tri tuệ nhân tạo và khoa
19
Trang 32Num Group Question Answer
Science that focuses on using
data and algorithms to mimic
how humans learn.)
In Table 1, the first question "Chuyén doi số là gi" (What is digital
transformation?), and the second "Có thể nói rõ hơn về chuyển đổi số được không" (Can you be more specific about digital transformation) are in the group "Các câu hỏi chung vẻ Chuyen déi so" (General questions about digital transformation)
When asking such questions, questioners are all looking to lean about the concept
of digital transformation However, the answer to the first question can be
answered in a succinct way, while the second question needs to provide an explanation in more detail, or it is necessary to provide the questioner with a useful
source of information that can be consulted to better understand digital
transformation Question number | and 3 are two questions in two different groups,
"Câu hỏi chung vẻ chuyên đổi s6" and "Lam rõ các thông tin liên quan đến Chuyên đổi số" (Clarification of some concepts related to digital transformation) These questions are a pretty clear separation in terms of semantics as the first question
asks about the concept of digital transformation and the second one deals with the
concept of machine learning
The questions in the initial dataset also vary in question length Information
about the length of questions in this dataset is given in Table 2
Table 2 Information about question length