1. Trang chủ
  2. » Luận Văn - Báo Cáo

Question answering in vietnamese

64 12 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Question Answering in Vietnamese
Tác giả Nguyễn Thị Mừng
Người hướng dẫn PhD. Nguyễn Thị Thu Trang
Trường học Hanoi University of Science and Technology
Chuyên ngành Khoa học dữ liệu
Thể loại Luận văn thạc sĩ
Năm xuất bản 2023
Thành phố Hà Nội
Định dạng
Số trang 64
Dung lượng 2,04 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 1. Problem Formula (13)
  • 2. Goal and scope (17)
  • 3. Solution orientation (17)
  • 4. Outline (18)
  • CHAPTER 1. THEORETICAL BACKGROUND (19)
    • 1.1 Text classification algorithms (19)
    • 1.2 BERT language model (24)
    • 1.3 Feature extraction (27)
  • CHAPTER 2. BUILDING A VIETNAMESE QA DATASET (29)
    • 2.1 Overview (29)
    • 2.2 The process of building a Vietnamese QA dataset (30)
    • 2.3 Data analysis (44)
    • 2.4 Data Disclosure (47)
  • CHAPTER 3. VIETNAMESE QUESTION ANSWERING MODEL, (48)
    • 3.1 Vietnamese Question Answering problem (48)
    • 3.2 Experiment setup (51)
    • 3.3 Results and evaluations (55)
    • 1. Conclusion (58)
    • 2. Thesis’s contributions (58)
    • 3. Future works (59)

Nội dung

Problem Formula

As the volume of information on the Internet continues to grow, finding relevant data has become increasingly difficult Traditional search engines often provide only brief paragraphs or related links, posing challenges for users, particularly those with limited search skills To address this issue, it is crucial to develop a question-answering (QA) system that can deliver accurate answers swiftly QA is a significant area within natural language processing (NLP) that processes questions posed in natural language, whether in text or audio format, and provides the appropriate responses.

Classification of the QA system

QA systems can be classified based on data sources into three main categories: structured data, semi-structured data, and unstructured data Structured data is represented by knowledge graphs, while semi-structured data typically appears as lists or tables Unstructured data, on the other hand, is often found in natural language formats such as sentences and documents Additionally, QA systems are categorized by domain into open-domain and closed-domain systems Open-domain QA systems aim to answer questions across various fields by mining data from extensive sources like Wikipedia and web searches, whereas closed-domain systems focus on specific domains, addressing a limited number of questions with resources curated by experts in that area.

Approaches for solving QA problems

Previous research has addressed the question-answering (QA) problem through various methods Based on our analysis, these approaches can be categorized into four primary groups, as illustrated in Figure 1 [3].

Figure 1 Approaches to the QA problem

Previous studies have addressed the QA problem through various approaches, including the traditional method, Information Retrieval (IR) combined with Machine Reading Comprehension (MRC), knowledge-based (KB) techniques, and strategies utilizing Question Entailment (QE) based on similar questions.

The QA problem is addressed through a pipeline with three key components: Question Processing, Document Retrieval, and Answer Extraction Initially, the Question Processing component analyzes the user's question to generate a query for the next stage, while also extracting useful information such as question type and entities to enhance answer accuracy This component employs an 8-step pipeline, which includes entity labeling, POS tagging, linguistic trimming heuristics, dependency parsing, sentiment analysis, and query pattern generation, as outlined by Usbeck, Ngomo, Bühmann, and Unger Following the analysis, the Document Retrieval component utilizes the processed information to search for relevant documents, typically texts or paragraphs, through an Information Retrieval module or Web Search.

The Answer Processing component retrieves the final answer by searching through relevant documents This process typically involves extracting factual information found within those documents.

Recent studies indicate that certain research generates latent information from questions and answers, utilizing matching technologies like surface text pattern matching, word or phrase matching, and syntactic structure matching Implementing a question-answering (QA) system based on this approach enhances system control; however, it presents challenges due to the need for integrating various natural language processing and information retrieval technologies.

The advancement of deep learning technologies has significantly enhanced data processing capabilities, leading to a broader exploration of question answering (QA) problems rooted in reading comprehension Machine Reading Comprehension (MRC) involves identifying answers to questions posed in natural language, utilizing a selected passage from a vast database This passage is chosen through the evaluation of the Information Retrieval (IR) component, employing document querying techniques.

To address the MRC problem, researchers focus on two primary approaches: Generative MRC, which involves automatically generating answers based on input information, and Extractive MRC, which entails extracting answers directly from the text The generative approach mimics human reading and comprehension, resulting in more natural and contextually relevant answers However, it poses challenges in training data construction and model quality assessment Notable datasets for this research include NarrativeQA, the Natural Questions Dataset, and the Chinese DuReader.

Popular models like LSTM, ELMo, and GPT-2 are utilized by researchers to address Machine Reading Comprehension (MRC) challenges In contrast to other approaches, the second direction involves answers that are embedded within the input passage, simplifying evaluation and reducing data construction costs The emergence of extensive datasets, including CNN/Daily Mail, MS MARCO, RACE, and SQuAD 2.0, has further facilitated advancements in this area.

Recent studies have demonstrated significant success using various models for text representation and question answering The BiDAF model effectively addresses this challenge by utilizing multiple levels of text representation, while QANet innovatively combines convolutional neural networks (CNN) with self-attention mechanisms Additionally, language models featuring encoder-decoder architectures, such as BERT, XLM-R, and T5, excel in encoding questions alongside their corresponding passages, accurately identifying the start and end positions of answers In the context of Vietnamese, two open datasets, UIT-ViQuAD, are available for further research and development.

Recent advancements in MRC models, such as those demonstrated by [30] and UIT-ViNewsQA [31], highlight the potential of computers to extract information from documents to provide accurate answers However, developing datasets for this research area demands significant effort and resources, making document storage and management costly endeavors.

Research has adopted a knowledge-based (KB) approach utilizing structured data, such as knowledge graphs or SQL databases, to represent facts and relationships between entities Berant et al developed the WEBQUESTION dataset using the Google Suggested API to generate simple "wh" questions with unique entities To enhance this, Berant and Talmor introduced the COMPLEXWEBQUESTIONS dataset, which features more complex questions involving multiple entities and various types, including compound, association, and comparison questions In the context of application query systems for product-related inquiries, Frank et al and Li et al utilized product attributes to create knowledge graphs For the Vietnamese language, Phan and Nguyen constructed a knowledge graph based on triads (subject, verb, object) to transform input questions accordingly Dat et al incorporated intermediate representative elements, such as question structure, type, keywords, and semantic relationships, to build system knowledge Research in this area primarily follows two methods: information retrieval (IR) for sorting potential answers and semantic analysis to convert natural language queries into search engine-compatible queries.

[40] However, building such a knowledge system requires having a rather complex knowledge system and requires a lot of effort in maintaining and expanding that knowledge system

The studies utilize the Question-Answering (QA) approach, leveraging question-answer pairs as a knowledge source for the system This method focuses on identifying similar questions that can be answered similarly When a user inputs a question, the QA system's task is to find a corresponding question and provide the relevant answer For instance, research by [41] generates question-answer pairs from the Frequently Asked Questions (FAQs) of the Microsoft Download Center, employing AND/OR and combinatorial search techniques to compile comprehensive search results The CMU OAQA [42] employs a Bidirectional Recurrent Neural Network (BiRNN) with an attention mechanism to assess question similarity Additionally, T Minh Triet and colleagues introduced the UIT-ViCoV19QA dataset [43], which contains 4,500 question-answer pairs related to the Covid epidemic, sourced from healthcare organizations' FAQs in Vietnam and globally This approach facilitates the efficient construction and storage of data in QA systems, particularly within closed systems tailored to specific organizational needs, although it often necessitates expert involvement in the respective field.

A QA system can process input in the form of text or speech For speech input, there are two primary methods for understanding user questions: (i) utilizing an end-to-end model and (ii) modularizing each component independently The end-to-end model functions as a single unit that directly accepts audio input.

Goal and scope

The thesis aims to achieve two primary objectives: first, to develop and enhance datasets related to the Digital Transformation domain, and second, to assess QA models utilizing the constructed data.

The thesis aims to establish a data collection process for the QA problem, focusing on question-answer pairs in the Digital Transformation domain This processed data will serve as training and testing data for a QA model that accepts user input via speech While the process is tested within the Digital Transformation domain, it is designed to be general, extensible, and applicable to various other data domains This framework will provide a valuable reference for individuals, organizations, and future researchers in designing and developing their own QA systems.

With the data that has been built, the second goal of the thesis is to evaluate

The testing of QA models on this dataset will validate the quality of the constructed data and provide a foundation for assessing the practical deployment of QA models.

Solution orientation

Our proposed solutions for the thesis, based on the objectives outlined in section 2, include: (i) developing a systematic process for collecting training and testing data for the QA model, and (ii) assessing QA models using the data we have compiled.

Data will be gathered in both written and spoken formats, with the assistance of collaborators under the guidance of Digital Transformation experts Specifically, the Written Collection System will facilitate the gathering of training data, while the Speech Collection System will aid in generating test data for the model.

Besides building data, the thesis will also test different models to solve the

The QA problem can be addressed through two primary approaches: text classification and similarity comparison These methods are well-suited for the constructed data The thesis will assess the practicality of these approaches based on experimental results when implemented in real-world scenarios.

Outline

The rest of the thesis is presented as follows

Chapter 1 presents an overview of the theoretical background related to the

QA problem, focusing on models of text classification and assessing the similarity of questions This is the basis for evaluating the solutions in the next chapters

Chapter 2 explores research on constructing data from existing question-answer pairs Building on these studies, the thesis proposes a data-building process aimed at addressing the QA problem, with a specific focus on its application in the Digital Transformation domain.

Chapter 3 discusses the evaluation results derived from the constructed dataset, which will serve as a foundation for assessing the practical feasibility of implementing the QA model.

Finally, the last chapter will present the conclusions reached by the thesis and discuss future development directions

The following are the details of each part of the thesis

THEORETICAL BACKGROUND

Text classification algorithms

Random Forest is an ensemble method that consists of numerous Decision Trees, potentially numbering in the hundreds or thousands Each Decision Tree is a hierarchical structure that follows a sequence of rules to make decisions For instance, a Decision Tree can determine whether a boy should play soccer or stay at home by considering factors such as weather, humidity, and wind conditions.

Figure 2 An example of a decision tree

When the weather is sunny with normal humidity, the boy is inclined to play soccer, whereas rainy conditions with strong winds lead him to prefer staying indoors.

The key point in building a Decision Tree lies in the Iterative Dichotomiser

The ID3 algorithm effectively handles datasets with multiple attributes, such as weather conditions (sunny or rainy), humidity levels (high or normal), and wind strength (strong or light) It determines the optimal order of these attributes at each decision-making step by selecting the best attribute based on a specific measurement A division is deemed effective if the resulting data is entirely classified into one category; conversely, if the data remains largely mixed, the division is considered suboptimal.

The forest randomly generates decision trees based on judgments made during the learning process, and the final decision outcomes are derived by aggregating the judgments of all decision trees within the forest.

Support Vector Machine (SVM) [48] is a linear classification model, used in dividing data into distinct classes Consider the binary classification problem with

In the context of classification, let \( x_i \) denote an input vector within the space \( X \subseteq \mathbb{R}^d \), and \( y_i \) represent the corresponding class label, where \( y_i \in \{1, -1\} \) Here, \( y_i = 1 \) indicates that the data point belongs to the positive class, while \( y_i = -1 \) signifies that it belongs to the negative class This framework is essential for calculating labels for data points in machine learning models.

The goal of SVM is to define a linear classifying function between two classes:

Where 𝑤 is the weight vector of the attributes, and 𝑏 is a real numeric value Based on the function 𝑓(𝑥), we determine the output value of the model as follows:

In a classification scenario, let (𝑥 + , 1) represent a point in the positive class and (𝑥 − , −1) denote a point in the negative class, both situated closest to the separating hyperplane 𝐻 0 Two parallel hyperplanes, 𝐻 + and 𝐻 −, are defined, with 𝐻 + passing through (𝑥 + , 1) and 𝐻 − through (𝑥 − , −1), both parallel to 𝐻 0 The margin, which is the distance between these hyperplanes, plays a crucial role in minimizing classification errors To achieve optimal classification, it is essential to select the hyperplane that maximizes this margin, known as the maximum margin hyperplane.

The distance from 𝑥 + to 𝐻 0 is:

The distance from 𝑥 − to 𝐻 0 is:

The margin between two hyperplanes 𝐻 + and 𝐻 − is determined by the

Therefore, the problem of determining the maximum margin between two hyperplanes is reduced to determining 𝑤 and 𝑏 so that 𝑚𝑎𝑟𝑔𝑖𝑛 = 2

‖𝑤‖ reaches the maximum value and satisfies the condition:

(since 𝑥 + , 𝑥 − are the closest points to the separation hyperplane and belong to

The algorithm is effective for linear data, but for non-linear data, Support Vector Machines (SVM) utilize kernel functions to map the data into a new space where it becomes linearly separable Common kernel functions include linear, polynomial, Radial Basis Function (RBF), and sigmoid.

In multiclass classification using Support Vector Machines (SVM), a prevalent approach to simplify the problem is the one-vs-rest method, also referred to as one-vs-all or one-against-all.

In a multi-class classification scenario with 𝐶 classes, we develop 𝐶 distinct models, each dedicated to a specific class These models are designed to determine if a given data point belongs to their respective class or to compute the probability of the data point being classified into that class Ultimately, the class with the highest probability is selected as the final outcome.

Neural networks are made up of single neurons, called perceptrons, which are inspired by biological human neurons Figure 3 depicts the structure of a biological neuron

Each biological neuron comprises three key components: the cell body, which houses the nucleus, provides nutrition, and generates and receives nerve impulses; dendrites, which are short extensions that receive impulses from other neurons and relay them to the cell body; and axons, which are long fibers that transmit signals from the cell body to other neurons This structure serves as the inspiration for the design of artificial neurons, as illustrated in Figure 4.

Each neuron consists of inputs linked to dendrites, a processing unit represented by the cell body, and outputs associated with axons The processing is facilitated by activation functions, which are typically nonlinear, including the sigmoid, tanh, ReLU, and sign functions.

Neural networks are formed by the combination of multiple neurons and typically consist of three layers: the input layer, which receives data from the dataset; the output layer, which displays the model's predicted values; and the hidden layer, situated between the input and output layers.

Conventional neural networks treat all inputs as independent, lacking any interconnections However, in word processing tasks, the sequence of words is crucial To address this, Recurrent Neural Networks (RNNs) predict the next element by utilizing information from prior calculations The structure of the RNN network is illustrated in Figure 5.

Figure 5 The structure of the RNN network

At each step in the network, computations are performed using the current input and the hidden state from previous steps, typically employing functions such as tanh or ReLU The output generated at this stage is commonly processed through the softmax function.

Recurrent Neural Networks (RNNs) face challenges with remote dependence, as they can only retain information over short intervals due to the vanishing gradient problem This phenomenon causes the gradient values to diminish as they propagate through the network's layers, resulting in minimal weight updates during Gradient Descent and preventing convergence To address these limitations, Long Short-Term Memory networks (LSTMs) were developed LSTMs maintain a similar sequential architecture to RNNs but incorporate up to four interacting layers, allowing for improved memory retention and performance.

Figure 6 The architecture of LSTM

BERT language model

BERT, or Bidirectional Encoder Representations from Transformers, is a pre-trained model that generates two-dimensional contextual representation vectors for words, enhancing its application in natural language processing tasks Unlike traditional Word Embedding models, BERT significantly advances the representation of words as digital vectors by considering their context.

The BERT model features a multilayer architecture composed of multiple Bidirectional Transformer encoder layers, utilizing two-way attention mechanisms In contrast, the GPT Transformer employs one-way attention, which is less natural and inconsistent with typical language usage, as it only considers the left context The two-dimensional Transformer is commonly known as a Transformer encoder, while those that utilize only the left-hand context are referred to as Transformer decoders, suitable for text generation A visual comparison of BERT, OpenAI GPT, and ELMo is illustrated in Figure 7.

Figure 7 BERT, OpenAI and ELMo

There are two tasks to create a model for BERT, includes Masked LM and Next Sentence Prediction [27]

To develop a representation model utilizing a two-dimensional context, we employ a straightforward method of masking random input tokens and predicting only the concealed tokens, a process referred to as "masked language modeling" (MLM) In this approach, the hidden vectors from the last layer that correspond to the masked tokens are processed through a softmax layer across the entire vocabulary for prediction Google researchers have demonstrated this technique by randomly masking 15% of all tokens from the WordPiece dictionary within a sentence and focusing solely on predicting the masked words The BERT training scheme for the masked LM task is illustrated in Figure 8.

While developing a two-dimensional training model offers benefits, it also presents two significant drawbacks The primary issue is the mismatch created between pre-training and fine-tuning, as the tokens designated as [MASK] are not encountered during the model's training process.

To enhance the training process, we implement a strategy where not all hidden words are replaced with the [MASK] token Specifically, 15% of tokens are randomly selected for modification For instance, in the sentence "con_chó của tôi đẹp quá" (my dog is so beautiful), the word "đẹp" (beautiful) is masked, resulting in "con_chó của tôi [MASK] quá" (my dog is so [MASK]) Additionally, 80% of the selected words are replaced with the [MASK] token, while 10% are substituted with random words, such as "con_chó của tôi máy_tính quá" (my dog is so computer), and the final 10% remain unchanged, preserving the original meaning as "con chó của tôi đẹp quá" (my dog is so beautiful).

The Transformer encoder maintains a contextual representation of each input token, as it cannot predict which word will be asked or which has been randomly replaced Interestingly, replacing 1.5% of tokens with random words does not significantly impact the model's language understanding However, a limitation of the Masked Language Model (MLM) is that only 15% of tokens are predicted in each batch, indicating that further steps may be necessary with other pre-training models for optimal convergence.

In natural language processing, tasks like Question Answering necessitate an understanding of the relationship between two text sentences rather than relying solely on language models To enhance this understanding, we develop a model that predicts the next sentence based on a given current sentence, utilizing any corpus for training data For each training sample, sentence A and sentence B are selected such that there is a 50% probability that sentence B follows sentence A, while the other 50% consists of randomly chosen sentences from the corpus.

To enable the model to differentiate between two sentences, it is essential to denote the start of the first sentence with the token [CLS] and the end with [SEP] An example input description for a BERT model is illustrated in Figure 9.

Figure 9 An example of the input in the BERT model

Feature extraction

Feature extraction involves selecting text attributes and converting them into a vector space for efficient computer processing This article will explore several popular feature extraction techniques.

Term Frequency – Inverse Document Frequency (TF-IDF)

The TF-IDF value of a word indicates its significance within a document It is derived from the term frequency (TF), which measures how often a word appears in that document, as outlined in Equation 2.8.

Where 𝑓(𝑡, 𝑑) is the number of occurrences of the word 𝑡 in the document 𝑑 The denominator in the above formula is the total number of words in the document

IDF, or Inverse Document Frequency, measures the rarity of a word within a corpus, aiming to diminish the weight of frequently occurring terms that lack significant meaning The calculation of IDF is based on a specific formula that quantifies this inverse frequency.

The IF-IDF value is determined by the total number of documents in set 𝐷, denoted as |𝐷|, divided by the number of documents in set 𝐷 that contain the word 𝑡.

Words with high TF-IDF values are those that frequently appear in one document while being less common in others, allowing us to filter out common terms and focus on significant keywords Although TF-IDF is an effective method for vectorizing textual data, it increases computational load due to the vector's magnitude being proportional to the number of words Additionally, TF-IDF struggles with representing out-of-vocabulary words and fails to capture relationships between words.

Word2Vec is a technique that transforms words into a lower-dimensional vector space while maintaining their semantic relationships This method can be implemented through two approaches: Skip Gram and Continuous Bag of Words (CBOW).

The CBOW model predicts a target word based on the surrounding context Its architecture is illustrated in Figure 10.

Figure 10 One-word CBOW model structure

The input context word is represented as a one-hot encoded vector \( \mathbf{X} = [x_1, x_2, \ldots, x_V] \), where \( V \) denotes the size of the dictionary In this encoding, \( x_i = 1 \) indicates the position of the corresponding dictionary word, while all other positions are set to \( x_i = 0 \) The model features a hidden layer comprising \( N \) neurons, and the output layer is also a one-hot vector \( \mathbf{Y} = [y_1, y_2, \ldots, y_V] \) of the same size \( V \).

The weight matrix \$W_{V \times N}\$ connects the input layer to the hidden layer, while \$W_{N \times V'}\$ links the hidden layer to the output layer In this architecture, the hidden layer neurons simply replicate the sum of the input weights without applying activation functions like sigmoid, tanh, or ReLU The only nonlinearity present occurs during the softmax calculations in the output layer For target word prediction, the model learns a vector representation of the target word The SkipGram model operates similarly to the Continuous Bag of Words (CBOW) model, as it takes a word as input and aims to predict its surrounding context.

Word2Vec models reveal semantic relationships between words more effectively than IF-IDF For instance, the distance between pairs of words with similar meanings, such as "king" and "queen," is approximately equal.

"man" - "woman" However, Word2Vec has not yet solved the problem of representing words that are not in the dictionary

In CHAPTER 1, we established the foundational theoretical concepts essential for understanding the subsequent chapters Moving forward, CHAPTER 2 will outline the methodology for constructing data relevant to the QA problem.

BUILDING A VIETNAMESE QA DATASET

Overview

As the volume of documents continues to grow, the demand for accessible information is rising One effective solution is to create frequently asked questions (FAQs) that address key topics relevant to the organization or its field For instance, the Microsoft Download Center offers FAQs to help users troubleshoot issues with Microsoft products and services Similarly, the Ministry of Information and Communications Technology has published the "Cẩm nang chuyển đổi số" (Digital Transformation Handbook) to assist in navigating digital transformation challenges.

[54], which includes questions and answers on the field of Digital Transformation, based on the speeches of the Minister of Information and Communications Nguyen Manh Hung

As the number of questions increases, locating similar inquiries related to your interests becomes increasingly time-consuming and labor-intensive Search engines typically provide results based solely on the exact words in the original question However, in everyday communication, people express requests in various ways For instance, when inquiring about how to turn on a fan in the context of smart homes, one might ask, "How do I turn on the fan?" or phrase it as, "I want the fan to run; how do I do that?" This illustrates the diversity in language and expression.

The term "bật" (turn on) has been replaced with "chạy" (run), which can lead to confusion in the QA system if only word matching is applied, especially when information is missing Despite the different expressions, both terms can yield the same instructions for operating a fan in smart homes, indicating that the two questions are "similar." Thus, developing similar questions for FAQs offers a more flexible approach, enhancing the system's understanding of user inquiries.

To ensure the effectiveness of a QA system, it is crucial to build a high-quality similarity dataset However, organizations often face challenges due to insufficient training data for the QA model Various methods have been proposed to improve data collection for QA systems, including the use of machine translation to automatically generate questions similar to those in the system's database Despite this approach, current machine translation methods still exhibit limitations, as their accuracy is not yet sufficient, resulting in generated data that may not fulfill the quality requirements of an effective QA system Additionally, collecting data from diverse sources, such as discussion forums and online teaching platforms, presents another viable method for enhancing the dataset.

To build effective question-and-answer (QA) systems, organizations must implement a clear and efficient data collection process that integrates various methods This approach is essential for gathering high-quality and reliable data from 18 sites, question-and-answer forums, and online conversations Additionally, the collection and processing of data from these sources require specialized skills and tools to ensure the reliability of the information gathered.

The organization's data collection process should categorize data into training and testing sets Training data must encompass a sufficient number of cases and fundamental questions posed by the organization Concurrently, testing data is essential for assessing the QA model's capabilities and ensuring it functions correctly across various scenarios Particularly for user input via speech, the QA model's performance must be evaluated in conjunction with the ASR module, which is crucial for live mobile QA systems However, the ASR process may encounter challenges such as unclear utterances, background noise, and varying pronunciations, which can hinder the QA model's accuracy Thus, evaluating the QA model's performance in the context of the ASR module is vital for ensuring its effectiveness with speech input.

When constructing data for an AI model, it is common to divide the dataset into specific ratios, such as 80:20 or 70:30, to create training and test datasets The training data is essential for teaching the QA model to comprehend user inquiries and provide accurate responses, while the test data serves to assess the model's performance and quality.

Building a QA model that utilizes speech as input can be prohibitively expensive due to the extensive resources required for data collection and evaluation The process of gathering and refining speech data is labor-intensive and time-consuming, making it challenging to fine-tune and develop a high-quality dataset.

Current QA systems primarily adopt a modularization approach, utilizing an ASR module to convert user speech into text for analysis This method allows for the creation of text data to train models while evaluating their performance on speech data, thereby assessing the model's quality influenced by the ASR module This strategy significantly reduces data-building costs, as text data can be generated more quickly and affordably than speech data Consequently, the proposed data-building process consists of two key steps: the first step focuses on constructing training data primarily from text, while the second step involves creating test data from the received speech input.

The process of building a Vietnamese QA dataset

The initial dataset provided by the Ministry of Information and Communications, consisting of the corresponding question and answer pairs,

The dataset comprises 194 question-answer pairs organized into 9 management groups, focusing on Digital Transformation These questions are derived from the speeches of Minister of Information and Communications Nguyen Manh Hung, addressing fundamental concepts, planning mechanisms, policies, and strategies related to digital transformation in Vietnam and globally.

The article highlights that while questions in distinct groups are clearly categorized, there can be significant ambiguity among questions within the same group Examples of these question-answer pairs are presented in Table 1.

Table 1 Some examples of questions in the initial dataset

1 Câu hỏi chung về chuyển đổi số

(General questions about digital transformation)

Chuyển đổi số là gì?

Chuyển đổi số là giai đoạn phát triển tiếp theo của tin học hóa, nhờ vào sự tiến bộ vượt bậc của các công nghệ mới đột phá, được gọi chung là công nghệ số.

(Digital transformation is the next development of computerization, made possible by the remarkable progress of breakthrough new technologies, collectively known as digital technology.)

2 Câu hỏi chung về chuyển đổi số

Có thể nói rõ hơn về chuyển đổi số được không?

(Can you be more specific about digital information?)

Bạn có thể tìm hiểu thêm về chuyển đổi số qua cuốn Cẩm nang Chuyển đổi số do Bộ Thông tin và Truyền thông phát hành, hoặc truy cập trang web dx.mic.gov.vn để có thông tin chi tiết.

(You can refer to "Cẩm nang chuyển đổi số" (the Digital Transformation Handbook) published by the Ministry of

Communication, or visit the website dx.mic.gov.vn for more detailed information.)

3 Làm rõ một số khái niệm có liên

Học máy là gì? Học máy là một nhánh nghiên cứu của trí tuệ nhân tạo và khoa

Num Group Question Answer quan đến chuyển đổi số

(Clarifying some concepts related to digital transformation)

(What is machine learning) học máy tính tập trung vào sử dụng dữ liệu và các thuật toán để bắt chước cách con người học

(Machine Learning is a branch of research in Artificial Intelligence and Computer Science that focuses on using data and algorithms to mimic how humans learn.)

In Table 1, the first question "Chuyển đổi số là gì" (What is digital transformation?), and the second "Có thể nói rõ hơn về chuyển đổi số được không"

In the realm of digital transformation, individuals often seek to grasp its fundamental concept, as reflected in the general inquiries about this topic While some questions can be addressed succinctly, others require a more comprehensive explanation or the provision of valuable resources for deeper understanding Notably, questions one and three originate from distinct categories, highlighting the diverse nature of inquiries surrounding digital transformation.

The article addresses common questions regarding digital transformation, specifically clarifying concepts related to it The first question focuses on the definition of digital transformation, while the second delves into the concept of machine learning, highlighting a clear semantic distinction between the two topics.

The questions in the initial dataset also vary in question length Information about the length of questions in this dataset is given in Table 2

Table 2 Information about question length

As shown in Table 2, the average question length is 11 syllables The shortest questions like "NGSP là gì?" (What is NGSP?) and "LGSP là gì?" (What is

The dataset primarily consists of short and medium-length questions, with 25% containing fewer than 7 syllables and 50% having 10 syllables or less The longest questions reach up to 42 syllables Short questions typically address concepts, purposes, roles, or methods for solving specific problems, while longer questions may combine multiple simple sentences or include complex conceptual terms Examples of both short and long questions can be found in Table 3.

Table 3 Some examples of short and long questions

Cổng dịch vụ công quốc gia là gì?

Cổng Dịch vụ công Quốc gia là nền tảng tích hợp thông tin về dịch vụ công trực tuyến, cung cấp tình hình và kết quả giải quyết thủ tục hành chính Cổng này hoạt động dựa trên việc kết nối và truy xuất dữ liệu từ các hệ thống thông tin một cửa điện tử ở cấp bộ và cấp tỉnh, cùng với các giải pháp hỗ trợ nghiệp vụ và kỹ thuật do Văn phòng Chính phủ quản lý và xây dựng.

The National Public Service Portal serves as a comprehensive platform that consolidates information on online public services, settlement statuses, and outcomes of administrative procedures It operates by connecting and retrieving data from one-stop information systems, with electronic solutions for both ministry and provincial levels being uniformly developed and managed by the Government Office.

Truy cập dữ liệu mở quốc gia ở đâu?

Có thể truy cập dữ liệu mở của cơ quan nhà nước tại Cổng dữ liệu quốc gia tại địa chỉ data.gov.vn

(Open data of state agencies can be accessed at the National Data Portal at data.gov.vn.)

Vai trò của kinh tế số là gì?

(What is the role of the digital economy?)

Kinh tế số không chỉ nâng cao năng suất lao động mà còn thúc đẩy tăng trưởng kinh tế bền vững và bao trùm Việc sử dụng tri thức thay vì chỉ dựa vào tài nguyên giúp mở rộng cơ hội cho nhiều người, đồng thời chi phí tham gia vào nền kinh tế số cũng thấp hơn.

The digital economy enhances labor productivity and drives economic growth by leveraging knowledge over resources It promotes sustainable and inclusive growth, as its lower participation costs create opportunities for a broader range of individuals.

Một trong những mục tiêu của Chương trình Chuyển đổi số quốc gia là xây dựng một môi trường số nhân văn và bao trùm Mục tiêu này cần được hiểu là việc tạo ra một không gian số mà mọi người đều có thể tiếp cận và hưởng lợi, góp phần nâng cao chất lượng cuộc sống và phát triển bền vững.

(One of the goals of the

Transformation Program is to create a humane, widespread digital environment How should this goal be understood?)

Một trong những thế mạnh của các nền tảng số là khả năng mở rộng và tiếp cận người dùng

Do vậy, người dân ở khắp mọi miền tổ quốc đều có thể được tiếp cận dịch vụ một cách bình đẳng

Chuyển đổi số mang lại ý nghĩa nhân văn sâu sắc, giúp người dân ở vùng sâu, vùng xa và biên giới hải đảo tiếp cận dịch vụ y tế và giáo dục tốt nhất.

(One of the strengths of digital platforms is scalability and user reach

Therefore, people in all parts of the country can have equal access to services

People in remote areas and bordering islands can still use the best health and education services

That is the human meaning of digital transformation.)

Hiện nay, nhiều doanh nghiệp Việt Nam đã đáp ứng các tiêu chí và chỉ tiêu kỹ thuật để đánh giá và lựa chọn giải pháp nền tảng điện toán đám mây.

Vietnamese enterprises that meet the criteria and technical criteria for evaluation and selection of cloud computing platform solutions?)

Hiện nay, có 05 doanh nghiệp Việt Nam, bao gồm Viettel, VNG, CMC, VNPT và VCCorp, đã đáp ứng đầy đủ bộ tiêu chí của Bộ Thông tin và Truyền thông.

(Currently, there are 05 Vietnamese enterprises, including Viettel, VNG, CMC, VNPT and VCCorp, which have met the criteria set by the Ministry of Information and Communications.)

Data analysis

We conducted two data collection campaigns as outlined in section 2.2, utilizing a combination of the previously mentioned collection systems Details regarding these campaigns are provided in Table.

Table 5 Data collection campaigns information

The data collection was conducted through two campaigns, starting with the first campaign that utilized 174 question-answer pairs as the initial dataset The second campaign expanded this dataset by adding 194 additional pairs, all provided by experts in the field of Digital Transformation from the Ministry of Information and Communications Collaborators, aged between 19 and 23, contributed to ensure a diverse representation of speech from various regions, with equal proportions of gender and regional accents In the first campaign, 27 collaborators participated, comprising 15 men and 12 women, with accents distributed as 12 Northern, 6 Southern, and 9 Central The second campaign saw an increase in participation, with 32 collaborators involved in the data construction.

15 men and 17 women The proportion of regions in the 2nd campaign is also more evenly distributed, with 12 collaborators speaking with a Northern accent, 10

33 collaborators with a Central accent, and 12 with a Southern accent Once collected, the data is checked and evaluated by data experts before training and deploying the QA models

Through 02 campaigns, the dataset for the QA problem was built with the information described in Table 6

Table 6 Information about data collected through campaigns

Written Collection System Speech Collection System

Through 02 campaigns of data collection, a total of 5,799 similar questions were built on a total of 194 original questions Each question has an average of 29 similar questions The number of questions that are at least similar to an original question is 14 and at most 56 With the speech data collection system, with 194 original questions, the total amount of data collected is 2,909 data, including audio and text Each original question has an average of 14 questions collected The minimum number of data collected on a question is 5 and the maximum is 29 Regarding the number of words in a sentence, Figure 17 shows the distribution of collected data by length with the Written Collection System

Figure 17 Distribution of the number of words in a sentence with the Written Collection System

As depicted in Figure 17, the number of words in the sentences ranges from

The article highlights that most sentences range from 6 to 18 words, with a significant concentration at 10 words Additionally, there are fewer short and long sentences, which aligns with the fact that 75% of the original questions consist of 14 words or fewer.

With data collected through the Speech Collect System, the distribution of the data length is illustrated in Figure 18

Figure 18 Distribution of word count in the data collected by the speech system

Similar to the written system, the data collected by the speech system is from

Sentences range from 3 to 48 words, with a focus on 6 to 20 words The transcribed sentences differ slightly from the original due to the ASR module's influence Overall, the data distribution aligns closely with that of the written system.

Data Disclosure

Through two data collection systems, the thesis has built a data set for building and deploying the QA system The dataset includes:

(i) training data is built from similar questions that have been contributed through the written system,

The test data is categorized into two types: (ii) User test data, which is created from the labeling of audio content during recording, and (iii) ASR test data, which consists of transcribed content generated by the output of the ASR module during the recording process.

The dataset, detailed in Table 6, comprises 5,799 similar questions and 194 original questions utilized for training, along with 2,909 test data points for each test dataset This comprehensive data is stored in a JSON file format, encoded in UTF-8.

To the best of the author's knowledge, this is the first published dataset for a

The QA challenges within the Digital Transformation data domain highlight the potential for the data construction process to be applied to various other data domains beyond just Digital Transformation.

VIETNAMESE QUESTION ANSWERING MODEL,

Ngày đăng: 27/07/2023, 22:55

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1] F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, and T.-S. Chua, “Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering.”arXiv, May 08, 2021. Accessed: Apr. 02, 2023. [Online]. Available:http://arxiv.org/abs/2101.00774 Sách, tạp chí
Tiêu đề: Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering
Tác giả: F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, T.-S. Chua
Nhà XB: arXiv
Năm: 2021
[2] Z. Huang et al., “Recent Trends in Deep Learning Based Open-Domain Textual Question Answering Systems,” IEEE Access, vol. 8, pp. 94341– Sách, tạp chí
Tiêu đề: Recent Trends in Deep Learning Based Open-Domain Textual Question Answering Systems
Tác giả: Z. Huang, et al
Nhà XB: IEEE Access
[3] Q. Jin et al., “Biomedical Question Answering: A Survey of Approaches and Challenges.” arXiv, Sep. 08, 2021. doi: 10.48550/arXiv.2102.05281 Sách, tạp chí
Tiêu đề: Biomedical Question Answering: A Survey of Approaches and Challenges
Tác giả: Q. Jin, et al
Nhà XB: arXiv
Năm: 2021
[4] D. Moldovan, M. Pasca, S. Harabagiu, and M. Surdeanu, “Performance Issues and Error Analysis in an Open-Domain Question Answering System,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 33–40. doi: 10.3115/1073083.1073091 Sách, tạp chí
Tiêu đề: Performance Issues and Error Analysis in an Open-Domain Question Answering System
Tác giả: D. Moldovan, M. Pasca, S. Harabagiu, M. Surdeanu
Nhà XB: Association for Computational Linguistics
Năm: 2002
[5] R. Usbeck, A.-C. N. Ngomo, L. Bühmann, and C. Unger, “Hawk–hybrid question answering using linked data,” in The Semantic Web. Latest Advances and New Domains: 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31–June 4, 2015. Proceedings 12, Springer, 2015, pp. 353–368 Sách, tạp chí
Tiêu đề: The Semantic Web. Latest Advances and New Domains: 12th European Semantic Web Conference, ESWC 2015
Tác giả: R. Usbeck, A.-C. N. Ngomo, L. Bühmann, C. Unger
Nhà XB: Springer
Năm: 2015
[7] J. Kupiec, “MURAX: a robust linguistic approach for question answering using an on-line encyclopedia,” in Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’93, Pittsburgh, Pennsylvania, United States:ACM Press, 1993, pp. 181–190. doi: 10.1145/160688.160717 Sách, tạp chí
Tiêu đề: MURAX: a robust linguistic approach for question answering using an on-line encyclopedia
Tác giả: J. Kupiec
Nhà XB: ACM Press
Năm: 1993
[8] Z. Zheng, “AnswerBus question answering system,” in Proceedings of the second international conference on Human Language Technology Research -, San Diego, California: Association for Computational Linguistics, 2002, pp. 399–404. doi: 10.3115/1289189.1289238 Sách, tạp chí
Tiêu đề: AnswerBus question answering system
Tác giả: Z. Zheng
Nhà XB: Association for Computational Linguistics
Năm: 2002
[9] D. Mollá, M. van Zaanen, and D. Smith, “Named Entity Recognition for Question Answering,” in Proceedings of the Australasian Language Technology Workshop 2006, Sydney, Australia, Nov. 2006, pp. 51–58.Accessed: Apr. 02, 2023. [Online]. Available: https://aclanthology.org/U06- 1009 Sách, tạp chí
Tiêu đề: Named Entity Recognition for Question Answering
Tác giả: D. Mollá, M. van Zaanen, D. Smith
Nhà XB: Proceedings of the Australasian Language Technology Workshop 2006
Năm: 2006
[10] M. Wang, “A Survey of Answer Extraction Techniques in Factoid Question Answering,” Comput. Linguist., vol. 1, no. 1 Sách, tạp chí
Tiêu đề: A Survey of Answer Extraction Techniques in Factoid Question Answering
Tác giả: M. Wang
Nhà XB: Comput. Linguist.
[12] D. Ravichandran and E. Hovy, “Learning surface text patterns for a Question Answering System,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA:Association for Computational Linguistics, Jul. 2002, pp. 41–47. doi:10.3115/1073083.1073092 Sách, tạp chí
Tiêu đề: Learning surface text patterns for a Question Answering System
Tác giả: D. Ravichandran, E. Hovy
Nhà XB: Association for Computational Linguistics
Năm: 2002
[13] R. Sun, J. Jiang, Y. F. Tan, H. Cui, T.-S. Chua, and M.-Y. Kan, “Using Syntactic and Semantic Relation Analysis in Question Answering” Sách, tạp chí
Tiêu đề: Using Syntactic and Semantic Relation Analysis in Question Answering
Tác giả: R. Sun, J. Jiang, Y. F. Tan, H. Cui, T.-S. Chua, M.-Y. Kan
[14] D. Shen, G.-J. M. Kruijff, and D. Klakow, “Exploring Syntactic Relation Patterns for Question Answering,” in Second International Joint Conference on Natural Language Processing: Full Papers, 2005. doi:10.1007/11562214_45 Sách, tạp chí
Tiêu đề: Exploring Syntactic Relation Patterns for Question Answering
Tác giả: D. Shen, G.-J. M. Kruijff, D. Klakow
Nhà XB: Second International Joint Conference on Natural Language Processing
Năm: 2005
[15] T. Kočiský et al., “The NarrativeQA Reading Comprehension Challenge.” arXiv, Dec. 19, 2017. Accessed: Apr. 08, 2023. [Online]. Available:http://arxiv.org/abs/1712.07040 Sách, tạp chí
Tiêu đề: The NarrativeQA Reading Comprehension Challenge
Tác giả: T. Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, Edward Grefenstette
Nhà XB: arXiv
Năm: 2017
[16] T. Kwiatkowski et al., “Natural Questions: A Benchmark for Question Answering Research,” Trans. Assoc. Comput. Linguist., vol. 7, pp. 452–466, 2019, doi: 10.1162/tacl_a_00276 Sách, tạp chí
Tiêu đề: Natural Questions: A Benchmark for Question Answering Research
Tác giả: T. Kwiatkowski, et al
Nhà XB: Trans. Assoc. Comput. Linguist.
Năm: 2019
[17] W. He et al., “DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications,” in Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 37–46. doi: 10.18653/v1/W18- 2605 Sách, tạp chí
Tiêu đề: DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications
Tác giả: W. He, et al
Nhà XB: Association for Computational Linguistics
Năm: 2018
[18] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” 2014 Sách, tạp chí
Tiêu đề: Long short-term memory recurrent neural network architectures for large scale acoustic modeling
Tác giả: H. Sak, A. W. Senior, F. Beaufays
Năm: 2014
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners” Sách, tạp chí
Tiêu đề: Language Models are Unsupervised Multitask Learners
Tác giả: A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever
[21] K. M. Hermann et al., “Teaching Machines to Read and Comprehend.” arXiv, Nov. 19, 2015. Accessed: Apr. 09, 2023. [Online]. Available:http://arxiv.org/abs/1506.03340 Sách, tạp chí
Tiêu đề: Teaching Machines to Read and Comprehend
Tác giả: K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, P. Blunsom
Nhà XB: arXiv
Năm: 2015
[22] P. Bajaj et al., “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” arXiv, Oct. 31, 2018. doi:10.48550/arXiv.1611.09268 Sách, tạp chí
Tiêu đề: MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Tác giả: P. Bajaj, et al
Nhà XB: arXiv
Năm: 2018
[53] “Frequently Asked Questions - Microsoft Download Center.” https://www.microsoft.com/en-us/download/faq.aspx (accessed Apr. 17, 2023) Link

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w