Luận văn thạc sĩ question answering in vietnamese

The thesis proposes a process of building data based on similar questions including 02 main steps: collecting data through two systems named Written Collection System and Speech Collecti

Trang 1

TIANOI UNIVERSITY OF SCIENCE AND TECIINOLOGY

MASTER THESIS

Question Answering in Vietnamese

NGUYEN THI MUNG Mung.NT211261 M@sis hust.edu.vn School of Information and Communication Tcchnology

Supervisor: PLD Nguyen Thi Thu Trang

Schook Information and Communication Technology

Hanoi, 04/2023

Trang 2

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER THESIS

Question Answering in Vietnamese

GUYEN THI MUNG Mung.NT211261 M@sis.hust.cdu.vn

School of Information and Communication Technology

Supervisor: Phi) Nguyen ‘Thi ‘thu ‘Trang

Supervisor's signature

School: Information and Communication Technology

Tianoi, 04/2023

Trang 3

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIET NAM

Độc lập — Tự do— Hạnh phúc

BẢN XÁC NIIẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Ho và én tác giá luận văn: Nguyễn Thị Mừng

ĐỀ tải luận văn: Hỏi dáp tiếng Việt tự động

Chuyên ngành: Khoa học dữ liệu

1 Lam rõ đóng gp của luận văn

- Trong bản sửa đễi, tác giả đã trinh bày rõ rảng hơn vẻ những đóng góp

của luận văn thông qua các nội dụng sau:

Tác giả bỗ sung mục 2 — Những đóng góp của luận văn (Thesis's

cơntributions) trong chương kết luận (trang 46)

~ Tác giả trình bày rõ về tính cần thiết trong việc cản có một quy trình xây đựng dữ liệu đối với bài toán hội đáp, trang đó xem xét đầu vào là giọng

nói (trang 17-18)

~ Tác giả bố sung làm rõ cách thức đánh giá dữ liệu huân hiyện và đữ liệu

kiểm thứ trong quy trình để xuất (trang 28-29, 31-32)

2, Lam rõ các giải thuật sử dụng

- Tác giá đã mô tá rõ hơn về cách thức giải quyết bài toán hỏi đáp theo hưởng phân loại văn bản bằng việc chí rõ cách tạo ra các nhãn lớp và tim

ra câu trả lời khi tim duoc nhén lớp của câu hôi (trang 36)

~ Tác giả trình bảy rõ ràng hơn về cách tìm cầu bã lời sử đụng mỗ hình so

sánh độ tương đồng và cách thức để đánh giá hai hướng tiếp cân trên củng,

một thang dơ (trang 38-39)

3 Giải thích thêm thực nghiệm cũng như lựa chọn thuật toán

- Tác giả đã bê sung các siêu tham số của cáo mô hình sử dụng trong hướng

phan loai van ban, bao gm Random Forest, SVM, LSTM, PhoBert và mô

tịnh so sảnh độ tương đồng SberL (trang 39 - 41)

- Các tham số trong các mô hình học sân đền được tác giâ giữ nguyên so

với mỏ hình gốc Với các mô hình học máy cö kết hop tuning trong quá

trình huấn luyện đề tự động tìm ra bộ tham số phủ hợp nhất

Trang 4

4 Bỏ các phần trình bảy không cần thiết

- Tác giả dã lược bỏ phần trình bày về nhóm câu hối (trang I8 quyển

Taiận văn chưa chỉnh sửa và trang 19 — quyển Luận văn đã chỉnh sửa)

~ Tác giả đã lược bỏ thuật toán Naive Rayex trong Tuuận văn trước đó (trang

7 của 2 bản Luận văn và trang 39, 41 — quyễn Luận văn chưa chỉnh sửa và

trang 43-44 quyén Luận văn đã chỉnh sửa)

5 Chỉnh sửa cách trình bảy công thức

~ Tác giã đã chỉnh sửa cách trình bày công thức theo mẫu của trường (trang

- Tác giả đã thực hiện bỏ đánh số chương với chương 1 va chương Š Cụ

thể: chương 1 chuyển từ “Chapter L Introduction” thinh “Introduction”, chương 5 chuyển từ “Chaptor 5 Conclusion and fubre works” thành

“Cenehaion and fuwe works” (rang ] và trang 46)

Trang 5

GRADUATION TLIESIS ASSIGN

Affiliation: Ianoi University of Science and Technology

Nguyen Thi Mung - hereby warrants that the work and presentation in this thesis performed by myself under the supervision of PAD Nguyen Thi Thu Trang All the resulls preserited in this thesis are truthful and are not copied from any other works All references in this thesis including images, tables, figures, and quotes are clearly and fully documented in the bibliography I will take full responsibility for even one copy that violates school regulations

Student

Signature and Name

Nguyen Thi Mung

Trang 6

sometimes ike a fnend so thal we can casily confide and share our difficulties

‘Under her guidance, I feel that I have improved a lot

I would like to express my sincere thanks to the leadership and teachers at Lanoi University of Science and ‘Technology in general and in the School of Information and Communication Technology in particular, for giving me the opportunity ta sludy ina new environment useful and inemorable school in my

student life

Talso want to thank my brothers and sisters, friends, students in laboratory

914, and partners ‘thank you to everyone for their detailed guidance, enthusiastic help and encouragement during my time in the laboratory as well as during my thesis work Along with that, I would like to thank my friends inside and outside the School of Information and Communication Technology for their interest, sharing and help in the past time

EFinally, [ would like to express my sincere thanks to my family | thank my family for always loving, caring, and being, a spiritual support, a great source of motivation for me to overcome my difficulties and challenges 1 thank Mr Tuan - xay love, for being there to encourage me in the most stressful times In the process

of making my graduation thesis, even though I have tried my best, it is still inevitable thal mistakes will be made look forward to receiving suggestions from teachers and friends so that I will not encounter these errors in the future

Once again, I sincerely thank you!

Trang 7

ABSTRACT

Finding information is becoming more and more challenging as the amount

of knowledge on the Intemet is getting bigger and bigger Conventional search engines only retum lists of short paragraphs or related links, which makes it difficult for users, especially those who lack experience and search skills Therefore, it is essential lo build a Question Answering system capable of quickly giving an accurate answer to a question Therefore, the author proposes the topic

“Question Answering in Vietnamese” with the goal of building a question- answering systen applicable lo Vietnamese, especially considerng the input factor through the human voice Previous studies have solved the problem with many different approaches, in which, the approach of using similar questions helps to store and deploy the system easily Data is an important factor in helping ensure oulput for Lhe system The thesis proposes a process of building data based on similar questions including 02 main steps: collecting data through two systems named Written Collection System and Speech Collection System and applying this process in building data with the Digital Transformation domain with the initial question-answer pairs provided by the Ministry of Information and Communication Based on the built data, the thesis also evaluates the question answering models in two approaches: classification and comparing the similarity between questions The resulls show that the models have high accuracy [rom 82- 94% In which, the SVM model has the highest accuracy At the same time, the model size is not too large and the prediction time is fast, which is suitable for deployment in practice The evaluation results also show that, the Automatic Speech Recognition (ASR) module affects the quality of the madel by 3.9 to 10% Inthe future, the thesis aims to expand the initial questions based on the available

documents, and at the same time, pattially automate and create tools to support the

data quality controller to evaluate the data for the model

Trang 8

2.2.3 Training data building process

2.234 Test data building process

3.1 Vietnamese Question Answering problem seo 3Ó

Trang 9

3.1.2 Similarity questions problem

Trang 10

TIST OF TABLES

Table 1 Some examples of questions in the initial dataset

Table 2 Information about question length

‘Table 3 Some examples of short and long, questions

‘Table 4 Special questions

Table 5 Data collection campaigns information

Table 6 Information about data collected through campaigns

Table 7 Model's hyperparameter

Table 8 Confusion matrix

'Table 9 Evalustiơn results of experimented modsls

Table 10 Sizc and average prodiclion time

43

44

Trang 11

LIST OF FIGURES

Figure 1 Approaches to the QÀ problem

Figure 2 An example of a decision tree ào se 7

Figure § The structure oŸ the RNN network

Figure 9 An example of the input in the BERT model - 14

Figure 13 The data collection irter[aee is sreoree 128 igurc 14 Training data evaluation pTooe5s — -.ˆ

#igurc 15 Speech data eollection prOC685 is 1 SO

#igure 16 Speech data collection interfaoe e ¬

Figure 17 Distribution of the number of words in a sentence with the Written Collection Systent jcteneeisesseenessm sine sesieetsataeetinetn 1 4

Figure 18 Distribution of word count in the data collected by the speech system

34 Figure 19 Text classification model architecture 37

Eigure 20 Sùnilarity comparison model 8bet 5 3B

Trang 12

LIST OF ACRONYMS

Trang 13

INTRODUCTION

In this chapter the thesis presents the reasons for choosing the topic, based an the analysis of actual needs as well as previous studies on the question-answering system in Vietnamese and in the world, Along with that, this chapter also gives the aim, scope of the topic, research orientation, and layout of the thesis

1 Problem Formula

Finding information is becoming more and more challenging as the amount

of knowledge on the Internet is getting bigger and bigger Conventional search engines only return lists of short paragraphs or related links, which makes it difficult for users, especially those who lack cxpericnoe and scarch skills

‘Therefore, it is essential to build a question-answering system capable of giving

an accurale answer lo a question quickly Question Answering (QA) is a large

branch in the field of natural language processing (NLP), which takes as input a

question mm the form of a natural language queslion, possibly text or sound, then

give the corresponding answer [1]

Classification of the QA sysiem

There arc many ways to classify QA systems Based on the data source, we can divide the QA problem into three main categories structured data, semi- structured data, and unstructured dala [2] A knowledge graph is a representation

of structured dala Semi-structured data is usually presented mi the form of sls or

tables And unstructured dala is ofien represented as text in watural language such

as sentences, paragraphs, documents, etc Based on the domain, the question- answering, system is divided into twa main types: open-domain QA system and

closed-domain QA system [2] The goal of an open-domain system is to answer

questions in many different fields, based on data mining from rich information sources such as Wikipedia, Web Search, Meanwhile, a closed-domain system is geared towards answering a question for a particular domain The mumber of questions in a closed domain system is smaller with limited resources and

participalory conslruction by a leam of expenenecd experts in that field

Approaches for solving QA problems

Previous studies have solved the QA problem in many different ways According to our knowledge, the approaches to this problem can be divided into 4 main groups as described in Figure 1 [3]

Trang 14

(ra Information: ae _ E————

| Dats :

= Information Retrieval (IR) & Machine Reading Comprehension (MRC) Approach S

Question Entailment (QE) Approach

Figure 1 Approaches to the QA problem

Figure 1 describes the approaches that previous studies have used to solve the QA problem, including (i) the traditional approach, (ii) Information Retrieval

(IR) combined with Machine Reading Comprehension (MRC), (iii) using knowledge-based (KB) and (iv) based on similar questions with Question Entailment (QE) [3]

With the first approach, the QA problem is solved by a pipeline consisting of

three main components: Question Processing, Document Retrieval, and Answer Extraction [1] First, the user's question will be analyzed and processed by the

Question Processing component The task of this component is to understand the user's question and generate the query as input for the next component At the same

time, this component also exploits the content of the question to be able to provide

useful information such as question type, entities, and important information,

helping to increase the accuracy of the answer extraction process [4] An 8-step

pipeline in this module includes entity labeling, POS tagging, linguistic trimming

heuristics, dependency parsing, sentiment analysis, and generating patterns for queries with ranking given by Usbeck, Ngomo, Bihmann, and Unger introduced

in their study [5] After the question has been analyzed by the first component, the

Document Retrieval component will rely on that analysis to search for related documents, usually texts or paragraphs based on an IR module or Web Search

2

Trang 15

Fogines [6] Finally, the Answer Processing corponirt will search and return the

final answer based on those documents To extract the answers, the research is usually based on the extraction of real information available in the documents [7]

18] [9], combined with previously analyzed answer-type information [10] suggests

that some sludies generate lalenL formation from answers and questions, and then

use matching technologies such as surface text pattern matching [11] [12], word

ot phrase matching |6], and syntactic structure matching |7| |13] [14] Deploying

a QA system in this approach helps lo control the system iz a belter way, but this

is a rather complicated task hecause it requires a combination of many natural

language processing and information retrieval technologies

‘The development of deep leaming technologies has allowed data processing -with a large amount of computation, which makes the research directions of QA problems based on reading comprehension more widely sludied [1] MRC is the

problem of finding an answer to a question in natural language, based on a given

passage his passage will be selected among many text fragments in the database

‘under the evaluation of the IR component, using document querying technologies

Ta solve the MRC problem, based ou the answers, there are lwo main research directions: (I) generating the answer (Generative MRC) and (ii) extracting the answer from the passage (Extractive MRC) In the first direction, the answer will

‘be generated automatically based on the input information This is also how people

read and understand the content of the passage and give their answers, so the answer will be mare natural and close However, this also makes the construction

of the training data more difficult and the quality assessment of the modcl more

complicated Some dalasels built for this problem canbe mentioned as the English

data set NarrativeQA [15], Natural Questions Dataset [16], and Chinese DuReader [17] LSTM [18], ELMo [19] and GPT-2 [20] are popular models that researchers

use lo solve MRC problems im this direction Unhke the first direction, in the

second direction, the answer is part of the input passage This makes evaluation

easier and data construction less expensive With the appearance of large data sets

such as CNN/Daily Mail [21], MS MARCO [22], RACE 123], and SQUAD 2.0 [24], the studies following this approach achieved very good resulls Notably, the

BiDAF model [25] solves the problem by representing the text with different levels, or the QANet [26] combines CNIN and self-attention Language models

wilh encoder-decoder archilectures such as BERT ]27 |, XLM-R |28|, and TS [29],

allow encoding of the question and corresponding passage and return the start and

end positions of the answer through the decoder, also achieved high results with

the above data sets or Vietnamese, there are two open datasets, UlT-ViQuAD

[30] and UIT-ViNowsQA [31] for problems Collowimg this research direction The

research direction of MRC models has shown the ability of computers to exploit

information from documents to give answers However, building data in this dirootion requires a Jot of work and effori The storage and management of

doouments are also very expensive

Trang 16

Researches follows a KB approach using structured data, in the form of a imowledge graph or SQL databases, representing facts and relationships between

entities [32] Berant et al built the WEBQUESTION dataset, using the Google

Suggested API Lo generate actual questions that starl with wh and have a umque entity [33] However, the questions in WERQUESTION were still quite simple, so

Berant and Talmer later improved this dataset with COMPLEXWEBQUESTIONS in [34[ Compared with WEBQUESTION, the

questions m COMPLEXWERQUESTIONS can have more enlilies and contain

more types of questions such as compound, association, and comparison, In

research on building an application query system answering questions related to a particular product, Frank ct al [35], Li ct al 136] use product attributes to form a

knowledge graph for the system For Vietnamese, Phan and Nguyen [37] build a

imowledge graph in terms of triads (subject, verb, object) and transform the input question into the corresponding form Dat et al [38] used intermediate

representative choments, including information about question structure, question type keywords, and semantic relationships between keywords to build a system

jmowledge in its system Studies in this approach follow two main methods: (i)IR, searching for answers by sorting possible candidates [39], and (ii) semantic

analysis lo convert natural language queries into queries usable in Search Engines

[AO] However, building such a knowledge system requires having a rather complex knowledge system and requires a lot of effort in maintaining and expanding Uhat knowledge system

Finally, the studies follow the QE approach using question-answer pairs as a

source of knowledge for the system This method builds on the defirition of siuilar

questions, which can be answered in the same way With a user input question, the

task of the QA problem is to find a question similar to that question and retum the

corresponding answer [3] [41] builds a pair of questions and answers from Frequently Asked Questions (FAQs) of Microsoft Download Center, combining AND/OR search techniques and combinatorial search techniques to create a full

list of results for a related search The CMU OAQA [42] uses a Bidictionary

Recurrent neural network (BIRNN) combined with an allention mech:

predict the similarity between two questions For Vietnamese, T Minh Triet and

colleagues also published the dataset LIT-ViCoV19QA [43], including 4,500

question-answer pairs related lo the Covid epidemic, onllected from FAQs on the

official websites of healthcare organizations in Vietnam and around the world

With this approach, QA systems oan easily build and store data, especially in closed systems to serve the needs of a particular organization Llowever, data

building in this approach ofen requires (he involvernent of experts in thal field

Tn addition, the input of a QA system can be Lexl or speech With speech

input, there are two main approaches for the system to understand the user's

question: (i) training the end-to-end model and (Gi) modularizing each component separately With the first approach, the system is a unit, which takes audio as input

4

Trang 17

and relurns the corresponding response [44] proposes joinl audio-and-lext

SpeechBERT, a pre-trained model trained from both audio and text data However,

building high-quality training data for such systems is difficult and expensive In the sccond approach, the system is segregated inlo independent modules First, the

ASR module will be responsible for converling the speech Lo text, then the

generated text will be passed to the next processing module to give the answer With this approach, the system has a modular division, so it is easy to control the quality of cach module However, building such a pipeline sysicmm will also be

more complicated

Based on the analysis of the approaches to the QA problem, and al the same

time, based on the need for building a Virtual Assistant for the Digital

‘Transformation domain of the Ministry of Information and Communications, the

thusis aims lo research and evaluate QA models bused on ximilarily questions,

which considers the input of the system in speech

2 Goal and scape

Based on the studies in scction 1, the thesis consists of two main objectives: (i) building and enriching datasets on the Digital Transformation domain and (ii) evaluating QA models based on built data,

With the first goal, the thesis aims to build a process of collecting data for the QA problem, with the initial data being question-answer pairs in the Digital Transformation domain The data after going through this process will be uscd as training data and testing data for the corresponding QA model, with the user input

in speech Although tested and implemented in the Digital Transformation data domain, the proposed process should be general, extensible, and applicable to other data domains This will he # sample process for individuals, organizations, and subsequent researchers to apply in designing and building their own QA system

With the data that has been built, the second goal of the thesis is to evaluate

QA models on this dataset The results of the test will contribute to proving the quality of the built data and also serve as a basis for evaluating the feasibility of deploying QA models in practice

3 Solution orientation

Jrom the objectives given in section 2, our proposed oriented solutions for the thesis are as follows: (1) building a process of collecting training and testing

data for the QA model and (ii) evaluating QA models based on built data

The data will he collected in written and spoken form, with Lhe participation

of collaborators guided by experts in the field of Digital ‘lransformation In particular, the Written Collection System will support the collection of training data, and the Specch Collection System will help ereaic the building of tust data

for the model.

Trang 18

Besides building dala, the thesis will also test different models to solve the

QA problem in two main approaches: text classification and similarity comparison

‘These are the methods that fit the built data, Based on the experimental results, the Qhesis will evaluate the [casibilily of this method when deployed in practice

4, Outline

‘The rest of the thesis is presented as follows

Chapter 1 presents an overview of the theoretical background related to the

QA problem, focusing on models of text classification and assessing the similarity

of questions This is the basis for cvaluating the solutions in the next chapters

Next, Chapter 2 presents studies on data construction based on available question-answer pairs Based on those studies, the thesis presents a proposal on the data-building process to solve the QA problem, specifically applied in the Digital

‘Transformation domain

Based on the built data set, Chapter 3 presents the evaluating results obtained

on the built datascl This will be the basis for considermg the

implementing the QA model in practice

Trang 19

CHAPTER 1 THEORETICAL BACKGROUND

In this chapter, the thesis presents the theoretical background used in the thesis, from which readers can grasp the basic concepts The theoretical background presented includes basic classification algorithms, feature extraction, and the BERT language model

1.1 Text classification algorithms

1.1.1 Random Forest

Random Forest [46] is a set of many Decision Trees [47] The number of

trees in the forest can be up to hundreds or thousands of trees, A Decision Tree is

a structured hierarchical tree made up of sequences of rules In the case of a boy

who wants to play soccer, Figure 2 is an example of a decision tree Describe a

tree that decides to go soccer or stay at home based on weather, humidity, and

normal, the boy is more likely to go to soccer And if it rains and has strong winds,

it is likely that he will choose to stay at home

Trang 20

The key point in building a Decision Tree lies in the Terative Dicholomiser

3 (ID3) algorithm [45] Usually, data will have many different attributes In the

original example, the data will include a lot of information, about the weather

(sunnyAainy), humidity Cugh/nonnal), wind strength (strong/light) With these

aliributes, 13 determines their order al each decision step The best attribute will

be selected through a measurement A division is considered good if the data at that step is entirely in one class On the contrary, if the data is still mixed together

by a large proportion, the division is nol really good

The forest will generale decision recs randomly, depouding on Ue judgments

in the learning process The final decision results are aggregated based on the judgments of ali the decision trees present in the forest,

1.12 Support Vector Machine

Support Vector Machine (SVM) [48] is @ linear classification model, used in

dividing data into distinct classes Consider the binary classification problem with

N data points:

(91), eer Vad, oe Cow Yu}

Where x; is ar inpul veclor represented in the space X CR? and y; is the class label corresponding to the input vector, y, € (1; —1}, in which y, = 1 means the data belongs lo the positive class and y; = —1 means the dala belongs to the negative class Formula for calculating labels for a data point

The goal of SVM is ta define a linear classifying function between two classes:

Where w is the weight vector of the altributes, and b is a real numeric value Based

on the function f(x), we determine the output value of the model as follows:

y¡ ={11ƒ (w.x) +b>U —1fƒ (w.x)+b<0 (Eg 12)

Suppose (x',1) is a point im the positive class and (x ,—1) is a pom in the

negative class, closest to the dissecting hyperplane Iy Let /1,,/L be two parallel hyperplanes, where H,, passes through (x*, 1) and is parallel lo Hy and HL passes through (x~,—1) and parallel to My The margin is the distance between two hyperplanes H, and H_ In order to minimize the crror in the classifying process,

we need to choose the hyperplane with the largest margin, such a hyperplane is called the maximum margin hyperplane

The distance from x* to Ip is:

Trang 21

Therefore, the problem of determining the maximum margin between two

hyperplanes is reduced to delermining w and B so that margin = qa reaches the maximum value and satisfies the condition:

{Ww.n) +b2>1ify=1 Wath <oify= -1

(since x*, x7 are the closest points to the separation hyperplane and belong to

(w,b) = BE = 2) with condition 1 yw.) +b) S0 Ba 17)

The above algorithm is applied in the ease of Hincar data, For non-linear data,

SVM uses kernel functions to transform the data into a new space, in which the

resulting data is linearly discriminant Some common multiplication functions can

‘be mentioned as linear, polynomial, Radial Basis Function (RBF), sigmoid, [45]

With the multiclass classification problem using SVM, there are many ways for us to relum to the binary classification problem Among thom, the most commonly used method is ane-vs-rest (also known as ane-vs-all, one-apainst-all) [45] Specifically, if there are (classes, we will build C models, in which, each model corresponds to a class Fach of these models helps to distinguish whether a data point belongs to that class or not or calculate the probability that a point falls into that class ‘I'he final result can be the class with the highest probability

Trang 22

1.1.3, Neural network

Neural networks are made up of single neurons, called perceptrons, which

are inspired by biological human neurons Figure 3 depicts the structure of a

Figure 3 A biological neuron

Each biological neuron consists of three main components: (i) the cell body

is the bulge of the neuron, contains the cell nucleus, plays a role in providing

nutrition to the neuron, can generate nerve impulses, and can receive nerve

impulses transmitted to neuron, (ii) dendrites are short dendrites that develop from

cell bodies, which function to receive nerve impulses from other neurons, transmit

to the cell body, and (iii) axons are long single nerve fibers, responsible for

transmitting signals from the cell body to other neurons Inspired by this, the artificial neuron is designed with the structure depicted in Figure 4

Figure 4An artificial neuron

Each neuron has inputs corresponding to dendrites, a processor using

activation functions corresponding to cell bodies, and neuron outputs corresponding to axons The activation function is usually a nonlinear function such as sigmoid function, tanh function, ReLU function, sign function [49]

10

Trang 23

Many neurons combine together lo become neural networks Neural networks usually have 3 layers: the input layer receives input data from the dataset, the output layer shows the predicted value of the model for the input data, and the Indden layer is the layers between the [ist layer input and output layers

LL4 Recurrent Neural Network

With conventional neural networks, all inputs are independent of each other,

so there is no chain link between them In word processing problems, the order before and aller in a document is very important Based on this, the Recurrent

‘Neural Network (RNN) determines the value of the next element based on previous calculations Figure 5 depicts the structure of the RNN network

Figure 5 The structure of the RNN network

The computation inside the network al each stap is calculated based on the inpul

at each step and the hidden state at the previous steps through same function like tanh or RLU, The output at this step usually uses the softmax function

L125 Long-Short Term Memory

The theory has shown that for distant steps, the RNN has a problem of remote

dependence, ie the network can only remember a small interval ‘This happens because of the vanishing gradient This is the phenomenon that occurs when the

value of the gradient will get smaller as they go down the lower Inyers, so the

update performed by the Gradient Descent does not change much of the weights

of those layers, making them impossible to converge and RNN will not get good

yosults The Long short-ierm memory network (7.STM) [49] was born lo overvome

this limitation T.STM also has a sequence architecture similar to RNNs, bul instead

of having only one layer of neural networks, they have up to four layers that interact with cach other in avery special way.

Trang 24

kia

Figure 6 The architecnure of LSTM

A special feature of LSTM is the cell state - the line that runs across the top

of the diagram This is considered network memory At each cell, the LSTM can

add or remove the necessary information through the ports: forget port, input port,

and output port, respectively, as shown in the figure The forget gate will decide

what information is unnecessary and should be discarded in this state The input

port indicates what information should be added to the cell state The output port

decides the output of this cell With such a structure, the LSTM network has the ability to remember the more distant states, thereby creating better efficiency than the RNN network

1.2 BERT language model

BERT [50] stands for Bidirectional Encoder Representations from

Transformers, which is understood as a pre-learned model, also known as the pre- train model, which learns two-dimensional contextual representation vectors of

words, which are used to transfer to other problems in the field of natural language

processing Compared with Word Embedding models, BERT has more

breakthroughs in representing a word to a digital vector based on the word's

context

The architecture of the BERT model is a multilayer architecture consisting

of several Bidirectional Transformer encoder layers BERT Transformer uses two-

way attention mechanisms while GPT Transformer uses one-way attention

(unnatural, inconsistent with the way the language appears), where all words pay

attention only to the left context Two-dimension Transformer is often referred to

as a Transformer encoder while versions of Transformer using only the left-hand

context are often referred to as a Transformer decoder because it can be used to

generate text The comparison between BERT, OpenAI GPT and ELMo is shown

visually in Figure 7

Trang 25

Figure 7 BERT, OpenAl and ELMo

There are two tasks to create a model for BERT, includes Masked LM and Next Sentence Prediction [27]

Masked LM

To train a representation model based on a two-dimensional context, we use

a simple approach to mask some random input tokens and then we only predict the

tokens hidden and call this task a "masked LM" (MLM) In this case, the hidden

vectors in the last layer corresponding to the hidden tokens are put into a softmax

layer over the entire vocabulary for prediction Google researchers tested masking

15% of all tokens taken from WordPiece's dictionary in a sentence randomly predicting only the masked words Figure 8 is the BERT training scheme under the

Although this allows us to obtain a two-dimensional training model, two

disadvantages exist The first is that we are creating a mismatch between pre-train

and fine-tuning because the tokens that are [MASK] are never seen during model

13

Trang 26

refinement To mitigate this, we won't always replace hidden words with the

[MASK] token Instead, the training data generator chooses 15% of tokens at

random and performs the following steps: For example with the sentence "con_cho ctia t6i dep qua" (my dog is so beautiful), the word chosen to mask is "dep"

(beautiful), replace 80% of the words selected in the training data to token [MASK]

to "con_ché ctia toi [MASK] qua” (my dog is so [MASK]), 10% of the selected

words will be replaced by arandom word, for example, "con chỏ của tôi máy tính qua" (my dog is so computer), the remaining 10% is kept unchanged as "con cho

của tôi dep qua" (my dog is so beautiful)

Transformer encoder has no idea which word will be asked to predict or

which word has been replaced by a random word, so it is forced to keep a

contextual representation of each input token Also, replacing 1.5% of all tokens

with a random word does not seem to affect the model's ability to understand the language The second disadvantage of using MLM is that only 15% of tokens are

predicted in each batch, which suggests that additional steps may be needed using

other pre-train models for the model to converge

Next Sentence Prediction

Many important tasks in natural language processing such as Question

Answering require understanding based on the relationship between two text

sentences, not directly using language models To train the model to understand

the relationship between sentences, we build a model that predicts the next

sentence based on the current sentence, the training data can be any corpus

Specifically, when choosing sentence A and sentence B for each training sample, there is a 50% chance that sentence B is the next sentence after sentence A and the

remaining 50% is a random sentence in the corpus

In order for the model to distinguish between two sentences, we need to mark

the beginning positions of the first sentence with the token [CLS] and the positions

at the end of the sentence with [SEP] Figure 9 shows an example input description

Trang 27

13° Feature extraction

Feature extraction is the selection of text attributes and veclortaing ther ito

a vector space that can be easily processed by computers In the following, we will

present some popular feature extraction operations

1.3.1 Term Frequency — Inverse Document Frequency (TF-IDF)

The TF-IDF value [51] of a word represents the importance of a word in a document TF (Term Frequency) is the frequency occurrence of a word in a document and is calculated according to Nquation 2.8

Whore f(t, d) is the number of occurrences of the word ¢ in the document d The denominator in the above formula is the total mumber of words in the document

IDK (Inverse Document l‘requency) is the inverse frequency of a word ina

corpus The purpose of TOF is to reduce words Khát ofTen appear in the text but de

not carry much meaning The formula for calculating IDF is as follows:

|p|

log 995 rept € dl log ——————— (Bg 1.9)

IDE(t,D) =

In which: || is the total number of documents in set 1), and the denominator

is the total number of documents in set D containing the word ¢ The IF-IDF value

is calculated as follows:

TF — IDF (t,d,D) = IF(t,d) IDF(t,D) (Eq 1.10)

Words with high TF-IDF values are those that appear more in one document!

and less in another This value helps us filter owt common words and retain high-

value words (keywords of the document) TF IDF is a simple way to vectorize

textual data, but dhe magnitude of the vector is equal to the number of words,

increasing the computational load Furthermore, word representation using TF-

IDF suffers from the problem of not being able to represent words that are outside the dictionary and not being able to show relationships between words

Trang 28

The CBOW model uses the context, around cach word as inpul and tries to predict the word that corresponds to that context, The architecture of the model is shown in Figure 10

Figure 10 One-word CBOW model structure

‘The input or the context word is a vector X = [X¡ X; xy] encoded as one-

hot with size V, which is the size of the dictionary, where x; = 1 with i is the position of the dictionary word, otherwise x; = 0 ‘he hidden layer consists of N

neurons and the output layer is also a one-hot vector ¥ = [yị y; vụ] of sive V

Wyxw is the weight matrix between the input and hidden layer Wi, is the weight

matrix between the hiddeu layer and the output layer The neurons of the lidden

layer just copy the sum of the weights from the inputs to the next layer There are

no triggers like sigmoid, tanh or ReLU The only nonlinearities are the softmax

calculations in the output layer During target word prediction, we leam a vector

representation of the target word SkipGram works similarly ta the CBOW model,

SkipGram model takes as input a word and tries to predict the context around it

Compared with IF-IDF, Word?Vec models have demonstrated semantic relationships between words The distance between pairs of words with similar meanings will be approximately the same, for example, "king" - "queen" and

"man" - "woman" However, Word2Vec has not yet solved the problem of representing wards that are not in the dictionary

Tins, through CHAPTER 1 we have grasped the basic theoretical bases for the next chapters Next, in CIIAPTIER 2, the author will present the process of building data for the QA problem

16

Trang 29

CHAPTER 2 BUILDING A VIETNAMESE QA DATASET

In this chapter, the thesis will present how to build data for the Vietnamese Question Answering problem The construction process uses two data collection systems, the Written Collection System and the Speech Collection System

2.1 Overview

With the proliferation of documents, the need to find information is increasing day by day A simple way to meet this need is to develop frequently asked questions (AQs) related to the organization, field, or issue that the

organization wants to convey For example, the Microsoft Download Center

provides FAQs so that users can search for problems related to Microsoft products and services [53] The Ministry of Information and Communications Technology

also released the "C4m nang chuyén déi sé" (Digital Transformation Handbook)

[54], which includes questions and answers on the field of Digital Transformation,

‘based on the speeches of the Minister of Information and Communications Nguyen Manh Hung,

Llowever, when the number of questions is getting bigger and bigger, finding

questions similar Lo Che problem you are interested in will take a Jot of lime and

effort Search engines only retum results for the words contained in the original

question But in the reality of daily communication, we humans have many different ways to express a certain request, For example, in the field of smart

‘homes, to ask about how to tum on the fan, we can directly ask “Iam sao để bật

quạt" (JTaw đo T nữ on the fan) or say "Tôi muôn cho quạt chạy thì làm thể nào" (L want the fan to run, how do I do it) Apparently in the last example, the verb

“bat” (turn on) bas been replaced with “chay" (run), If only word matching 1s used,

the system may confuse it with another question when there is missing

information Although expressed in two different ways, we can all have the same answer for instructions on how to use a fan in the case of smart homes The two questions im this case can be considered “sunilar” lo cach other Therefore, approaching the QA system in the direction of developing similar questions for VAQs is a softer approach, helping the system to understand the input questions amore flexibly

For the QA system to work effectively in this approach, building a high- quality similarity datasct is essential However, organizations do not always have large enough data available for training the QA model Several methods have been proposed to enhance the data collection capabilities of the QA system One of the

methods is to use machine translation to automatically generate questions that are similar to the questions in the database of the system However, this method still has many limitations, because the translation results of the current machine

translation methods have not vet reached high accuracy, leading to the generated

data nol really meeting the requirements of a quality QA system Another method

is to collect data from various sources, such as discussion sites, online teaching

17

Trang 30

siles, question-and-answer forums, and online conversations between humans Tlowever, collecting and processing data from these sources requires special skills and tools, and it is also necessary to ensure the reliability of the collected data Thorefure, lo achieve the best efficiency in building QA systems, organizations need to have @ clear and reasonable dala collection process, combining many different methods to collect quality and reliable data for training the QA model

The data to be collected through the organization's process should be divided

into two eategories: training data and testing data, ‘Training data needs to include

cough cases and the basic questions that the organization asks At the same time,

testing dala also needs Lo be crealed lo evaluate the capabilities of lhe QA moxdel

and ensure that it meets the requirements and can work correctly in different scenarios In particular, with user input via speech, the QA model needs to be evaluated under the influence of the ASR module This is especially important with

live mobile QA systems, where the use of speech is common However, the ASR

process can face many problems such as unclear utterance, sound disturbance, noise, and different pronunciations by users, affecting the ability of the QA model

1m giving Ihe correct answer Therefore, evaluating the QA model under the

influence of the ASR module is an important factor in evaluating the operability

of the QA model with speech input

When building data for an AL model, the data is usually split in a certain ratio, for example, 80:20 or 70:30 to produce training data and test data The training

data is used lo train the QA model to understand the user's inpul question and give

the corresponding answer ‘I'he test data will be used to evaluate the quality of the

QA model Considering the input of the model is speech if building data from specch and using the above building data method, the cosl of data construction will

be extremely expensive Because the speech data itself takes a lot of work and time

to collect and evaluate, it is not easy to fine-tane to create a high-quality dataset

On the other hand, the current QA systems mostly follow the direetion of modularivation, in which the system will use an ASR module to conver! the speech received from the user into text, then analyze this text to understand the user's question ‘Therefore, we can only build text data to train the model and evaluate the

trained model on speech data lo sec the correct quality of the model under the

influence of the ASR module This approach minimizes the cost of data building

because text can be constructed faster and less expensive than speech Based on this, the thesis proposes a data-building process consisting of 2 main steps: step 1

is to build lratming data, which focuses on 1cxt data and step 2 is Lo build test dala

with the received data in speech

2.2 The process of building # Vietnamese QA dataset

3.3.1 fnitial detaset

‘The initial dataset provided by the Ministry of Information and Communications, consisting of the corresponding question and answer pairs,

18

Trang 31

belongs to the field of Digital Transformation Each pair of questions is divided into a group for management purposes, The dataset includes 194 question-answer pairs, divided into 9 groups These questions were built on the basis of the speeches

of the Minister of Information and Communications Nguyen Manh Hung, related

to issues of basic concepts, planning mechanisms, policies, and strategies,

including issues related to digital transformation in Vietnam and around the world

Questions in different groups are clearly differentiated, however, some

questions within the same group can have a high degree of ambiguity Some

examples of question-answer pairs are given in Table 1

Table I Some examples of questions in the initial dataset

Chuyển đổi số là bước phát triển

tiếp theo của tin hoc hỏa, có

được nhờ sự tiên bộ vượt bậc

của những công nghệ mới mang

tinh đột phá, goi chung là eng

nghé

(Digital transformation is the

computerization, made possible

by the remarkable progress of

breakthrough new technologies,

collectively known as digital

technology.)

tà Câu hỏi chung về

chuyển đi số Có thẻ nói rõ hơn vẻ chuyển đổi số

Thông tin và Truyền thông xuất

bản, hoặc truy cập website

dx.mic gov vn dé tim hiểu thêm thong tin chi tiet

(You can refer to "Cém nang

chuyên đổi số" (the Digital

Transformation Handbook)

published by the Ministry of

Communication, or visit the

website dx.mic.gov.vn for more

detailed information.)

3 Lam rõ một số

khai niêm có liên

Học máy là gì? Hoe máy là mỏt nhánh nghiên

cửu của tri tuệ nhân tạo và khoa

19

Trang 32

Num Group Question Answer

Science that focuses on using

data and algorithms to mimic

how humans learn.)

In Table 1, the first question "Chuyén doi số là gi" (What is digital

transformation?), and the second "Có thể nói rõ hơn về chuyển đổi số được không" (Can you be more specific about digital transformation) are in the group "Các câu hỏi chung vẻ Chuyen déi so" (General questions about digital transformation)

When asking such questions, questioners are all looking to lean about the concept

of digital transformation However, the answer to the first question can be

answered in a succinct way, while the second question needs to provide an explanation in more detail, or it is necessary to provide the questioner with a useful

source of information that can be consulted to better understand digital

transformation Question number | and 3 are two questions in two different groups,

"Câu hỏi chung vẻ chuyên đổi s6" and "Lam rõ các thông tin liên quan đến Chuyên đổi số" (Clarification of some concepts related to digital transformation) These questions are a pretty clear separation in terms of semantics as the first question

asks about the concept of digital transformation and the second one deals with the

concept of machine learning

The questions in the initial dataset also vary in question length Information

about the length of questions in this dataset is given in Table 2

Table 2 Information about question length

Định dạng
Số trang	64
Dung lượng	2,03 MB

Tiêu đề	Question Answering in Vietnamese
Tác giả	Nguyen Thi Mung
Người hướng dẫn	PHD. Nguyen Thi Thu Trang
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Information and Communication Technology
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Hanoi