A Community Based Vietnamese Question Answering System

A Community Based Vietnamese Question Answering System tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập...

Trang 1

A Community-Based

Vietnamese Question Answering System

Quan Hung Tran, Nien Dinh Nguyen, Kien Duc Do, Thinh Khanh Nguyen, Dang Hai Tran, Minh Le Nguyen, and Son Bao Pham

Abstract. Most recent Vietnamese QA systems have not considered so far in us-ing the data crawled from the community web services as a useful resource In this paper, we take into accounts the community-based resource to build a Vietnamese question answering system named VnCQAs Our system comprises of three mod-ules for building the database of question-answer pairs, analyzing questions and choosing the best answer respectively Experimental results show that our system achieves promising performances

1 Introduction

Nowadays, the community web services play a crucial role in significantly support-ing human users to seek desired responses, especially in technology domain Users often pose their queries on Yahoo! Answer, technology web forums or Facebook for finding helps as well as personal experience-based advice from others How-ever, queries are often complex and contain multiple sub-questions whilst others’ feedbacks and comments miss valuable information or only deal with a part of these

queries For example, a question “Có nên mua Samsung Galaxy S4 không?” (should

I buy Samsung Galaxy S4?) expects the answer about individual opinions instead of

the specifications of the phone itself

Quan Hung Tran· Nien Dinh Nguyen · Kien Duc Do · Thinh Khanh Nguyen ·

Dang Hai Tran· Son Bao Pham

Faculty of Information Technology

University of Engineering and Technology

Vietnam National University, Hanoi

e-mail: {quanth_55,niennd_55,kiendd_55,thinhnk_55,

dangth,sonpb}@vnu.edu.vn

Minh Le Nguyen

School of Information Science

Japan Advanced Institute of Science and Technology

e-mail: nguyenml@jaist.ac.jp

V.-H Nguyen et al (eds.), Knowledge and Systems Engineering,

Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_10

Trang 2

Assuming that we have a collection of users’ queries from community web services and a corresponding collection of feedbacks and comments, building a community-based question answering (cQA) system to return a best answer for each user’s whole query raises a challenge issue It is because that the task is under the key research problems of how to construct the database of question-answer pairs, how to analysis questions from users’ queries, and how to produce a best answer Regarding to these problems, some researches concern about question identification [5, 16, 6, 15], question similarity [2], question generation [17], question analysis [10, 9], answer summarization [3] and answer re-ranking [13]

At this time, most recent Vietnamese QA systems have not considered so far in using community web services as a useful resource for such researches Existing Vietnamese QA systems [8, 14, 7, 12] are usually rule/grammar-based ones and uti-lizes structured databases or crawled web-pages Additionally, there is a Vietnamese

QA system that used community data as described in the Dang et al [1] ’s research The Dang et al ’s system responds to a new question by finding the similar ques-tions from Yahoo Answer However, this system did not return the answers of those similar questions, and the reported accuracy was not high

In this paper, we present a community-based Vietnamese question answering sys-tem, namely VnCQAs Our system solves the issue of domain adaptability and in-ability to be able to answer complex questions Furthermore, our VnCQAs system uses machine learning techniques to obtain high accuracy Our system contains three main modules: Database Construction, Question Analysis and Answer Selection, which are responsible for building the database, analyzing questions and choosing the best answer, respectively Figure 1 illustrates an example1with the input

ques-tion: “nên mua Ipad hay laptop” (should I buy Ipad or latop?) The output includes

the best available answer and related questions Users can find the answers of the related questions by clicking the corresponding links

The paper is presented as follows In the section 2, we introduce the overview of the whole system and describe the modules We describe the experimental results in section 3 The conclusion and future work are presented in section 4

2 System Architecture

In this section, we introduce the VnCQAs system architecture (as displayed in Figure 1) and briefly describe all modules in the system When a new question is presented

to the system, our system finds the most similar questions from the database The answers of these similar questions are called candidate answers These candidate answers are then processed to output the final answer

We build a system with three modules: Database Construction, Question Analy-sis, Answer selection

Database Construction module is responsible for building the database of

question-answer pairs

1The online demonstration is available at: http://150.65.242.39:8080/VNQA/

Trang 3

A Community-Based Vietnamese Question Answering System 119

Fig 1 The system user interface

Fig 2 The system architecture

Question analysis module analyzes the questions and gives useful information

about the question such as keywords, questions types and words’ synonyms

Answer selection module processes the candidate answers and return the final

answers for the given question

Trang 4

2.1 Database Construction Module

This module is to extract the question-answer pairs for constructing the database from the community data by two steps: question detection and answer detection (as shown in Figure 3)

Fig 3 The database construction module

Our main community sources of question-answer pairs are threads collected from some famous technology forums in Vietnamese such as Vatgia, VnZoom and Tinhte These sites include the series of threads, in which each thread has a specific topic and it is further divided into posts Typically one of the posts presents the ques-tion, and some other posts contain the answer Furthermore, the community data we crawled from the sites have different layouts and therefore we standardize the data

by parsing them to our predefined XML format for later processing

Figure 4 give the question: “Mấy anh cho em hỏi tại sao khi em vào My computer hay bị treo vài giây” (Why when I enter My computer folder, the computer stops responding for a few second) The only suitable answer is the last post: “cũng có thể

do phần mềm AV và cũng có thể là do 1 phần mềm nào trong máy nên bạn kiểm tra lại máy xem nhé.” (maybe because of the AV software or maybe because of some other software, you should check your computer again).

2.1.1 Question Detection

In the question detection, we use a machine learning model to classify whether a post

is the question post or not Our features used for the machine learning are sequence

patterns which are based on the generalized form of text E.g., a sentence: “Subnet Mask dùng để làm gì?” (What is Subnet Mask used for?) can be represented in sequence form as follows: “Np V E làm gì”, in which Np, V, E are part of speech

(POS) tags of the respective words

We define the question words that also appear many times over questions e.g.,

giúp(help), phân biệt (distinguish), đánh giá (evaluate), làm sao (how to do), tại sao (why) These question words are kept in their original form and arranged into 18

groups namely:

Trang 5

Fig 4 Thread XML example

• q0000: gì (what)

• q0001: nào, nào là (which)

• q0002: ai (who)

• q0003: đâu (where)

• q0004: hay (or)

• q0005: sao, vì sao, tại sao (why)

• q0006: làm sao (how to do)

• q0007: làm gì (what to do)

• q0008: ra sao, thế nào, như thế nào (how)

• q0009: bao nhiêu (how many), bao lâu (how long), bao xa (how far)

• q0010: không (not), chưa (not yet)

• q0100: giúp(help), tư vấn(advise), dạy(teach), hướng dẫn(instruct), chỉ

dẫn(instruct)

• q0101: hỏi (ask), thắc mắc (worry)

• q0102: khắc phục (overcome)

• q0103: vấn đề (problem)

• q0104: cách (solution)

• q0105: so sánh (compare), đánh giá (evaluate)

• q0106: phân biệt (distinguish)

Other words in question are replaced by their POS tags to make the sequence more general The sequence patterns are extracted by using Prefix Span algorithm [15] After that, we select the patterns that contain the question words We then

Trang 6

apply the method called “Multiple minimum supports” [4] to guarantee the quality

of patterns

2.1.2 Answer Detection

After finding the questions from the previous step, we detect the corresponding answers for each question by classifying the remaining posts through using a SVM model with a set of features:

• Is the post belonged to the author of the thread?

• Does the post contain quote of the questioner?

• Does the post contain quote of other users?

• The relative position of the post in the thread.

• The relative length of the post compared to others in the thread.

• Similarity between the post and the detected question.

• The proportion of noun, verb and pronoun that the post contains.

If the post is from the question’s owner, it is unlikely to be the answer Other-wise, the remaining post which contains the quote of the questioner often has a high possibility to be the answer

The question analysis module aims to extract important information from the ques-tions for finding the similar quesques-tions in the later module In this module, we investi-gate three steps: question classification, question keyword identification, and similar word identification as presented in Figure 5 Figure 2 shows an example of data

ex-tracted by the question analysis module from the question: “Làm thế nào để tạo vùng nhớ ảo thay thế RAM” (How to create virtual memory to replace RAM).

2.2.1 Question Classification

We classify questions because a question that is classified as a different type from the original question is unlikely to be a similar question Moreover, the question type also provides the constraints for verifying the answers

We categorize the questions into 3 classes: Fact, Solution and Explanation by using the machine learning method with a set of features: Unigrams, Bigrams, and Similarity The unigram and bigram features are calculated as the boolean value,

while the value of the similarity feature which represents a measure of similarity between two questions is estimated by using the phrasal overlap

2.2.2 Keyword Identification

The questions which are likely to be similar usually have the same set of keywords Besides, many questions in online forums and QA sites contain the unnecessary words and phrases, removing these helps to improve the ability of finding similar questions

Trang 7

Fig 5 The question analysis module

Fig 6 An example of analyzing question

The keyword identification aims to find the most important words in a question

We compute a score for each word appearing in the question corpus by using term frequency - inverse document frequency (tf*idf) weighting scheme Then we use a

threshold to determine whether the word is a keyword or not

2.2.3 Similar Word Identification

Regarding to the performance of our system for finding the similar questions, we also use a synonym dictionary to return the words that has the same meaning with the original words in the input question

Trang 8

Fig 7 An example of giving the answer

The answer selection module is responsible for finding the similar questions with their corresponding candidate answers from the database, and finally it give the best answer Figure 7 shows an example of how an input question is processed in this module

As shown in Figure 8, after the input question is analyzed by the question analysis module, we use the extracted information as the input for finding the similar ques-tions by using the Lucene2 For each candidate answer corresponding to a similar question, we then apply the supervised learning approaches to estimate a score of classification confidence The score for each candidate answer is used to re-rank the list of candidate answers, and finally the candidate answer with the highest score is selected as the final answer

We consider the following triplet: (Qnew, Qpast, A), where Qnew is the original question, Qpast is the similar question and A is the candidate answer for Qpast Each

triplet is classified as satisfied if the answer A can be used to respond to the question Qnew Otherwise, the triplet is classified as unsatisfied We employ the supervised

learning approaches with a set of features:

• Text length

• Number of question marks

• Number of stopwords

• IDF statistics

• Query clarity

2http://lucene.apache.org/

Trang 9

Fig 8 The answer selection module

• Cosine similarity

• Topic model

3 Evaluation

We evaluate our system by the results of finding the similar questions and giving the correct answer for each test question

We collect the community data from three famous technology forums in Viet-namese: Vatgia, VnZoom and Tinhte For each question, we assign a score arranged from 0 to 4 to each candidate answer corresponding to the question, in which the exact answers are given the score of 4, the irrelevant answers are given the score

of 0

The evaluation data for finding similar questions consists of 1704 questions ob-tained from the database construction module These questions are checked by hand

to ensure that each question is not similar to other questions Then we paraphrase each question into 3 different versions with the same meaning The paraphrased questions with the corresponding candidate answers are indexed into Lucene as mentioned in section 2.3 We use 1704 original questions as the input questions for testing our system performance, if the returned question is one of 3 paraphrased questions, we evaluate this as a good result, and otherwise we count it as a bad result The accuracy result for finding the similar questions is presented in table 1

Trang 10

Table 1 The accuracy results for finding the similar questions

Cosine similarity 80.86 (%)

Cosine similarity + Question analysis 87.61(%)

From 1704 questions, we choose 605 questions as the input questions, in which each question have an exact answer to test the performance of giving the correct answer We consider an returned answer as the satisfied answer if it matches the exact answer that we assigned The accuracy result for evaluating the correct answers

is shown in table 2

Table 2 The accuracy result for finding the answer

Baseline using the default of Lucene59.66 (%)

4 Conclusion

In this paper, we proposed the community-based question answering system for Vietnamese Our system consists of three modules: database construction, question analysis and answer selection The database construction module is used for creat-ing the database of question-answer pairs, in which each question corresponds to the candidate answers The question analysis module is responsible for extracting useful information such as keywords, question types and synonyms The answer se-lection module takes the extracted information from the input question for finding the similar questions in the database, and then re-rank the list of corresponding can-didate answers to give the best answer Experimental results are promising, where the question analysis module helps to improve the accuracy from 80.86% to 87.61%

in finding the similar questions, and the answer selection module get the accuracy

of 71.19% that is 11.53% higher than the baseline using the default of Lucene

In the future, we will extend the question analysis module by using other addi-tional features based on the dependency tree [11] We will also expand the database

to be able to deal with a wide range of questions and improve the answer selection module

Acknowledgment.This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04

Định dạng
Số trang	12
Dung lượng	1,18 MB