A Community Based Vietnamese Question Answering System tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập...
Trang 1A Community-Based
Vietnamese Question Answering System
Quan Hung Tran, Nien Dinh Nguyen, Kien Duc Do, Thinh Khanh Nguyen, Dang Hai Tran, Minh Le Nguyen, and Son Bao Pham
Abstract. Most recent Vietnamese QA systems have not considered so far in us-ing the data crawled from the community web services as a useful resource In this paper, we take into accounts the community-based resource to build a Vietnamese question answering system named VnCQAs Our system comprises of three mod-ules for building the database of question-answer pairs, analyzing questions and choosing the best answer respectively Experimental results show that our system achieves promising performances
1 Introduction
Nowadays, the community web services play a crucial role in significantly support-ing human users to seek desired responses, especially in technology domain Users often pose their queries on Yahoo! Answer, technology web forums or Facebook for finding helps as well as personal experience-based advice from others How-ever, queries are often complex and contain multiple sub-questions whilst others’ feedbacks and comments miss valuable information or only deal with a part of these
queries For example, a question “Có nên mua Samsung Galaxy S4 không?” (should
I buy Samsung Galaxy S4?) expects the answer about individual opinions instead of
the specifications of the phone itself
Quan Hung Tran· Nien Dinh Nguyen · Kien Duc Do · Thinh Khanh Nguyen ·
Dang Hai Tran· Son Bao Pham
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
e-mail: {quanth_55,niennd_55,kiendd_55,thinhnk_55,
dangth,sonpb}@vnu.edu.vn
Minh Le Nguyen
School of Information Science
Japan Advanced Institute of Science and Technology
e-mail: nguyenml@jaist.ac.jp
© Springer International Publishing Switzerland 2015 117
V.-H Nguyen et al (eds.), Knowledge and Systems Engineering,
Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_10
Trang 2Assuming that we have a collection of users’ queries from community web services and a corresponding collection of feedbacks and comments, building a community-based question answering (cQA) system to return a best answer for each user’s whole query raises a challenge issue It is because that the task is under the key research problems of how to construct the database of question-answer pairs, how to analysis questions from users’ queries, and how to produce a best answer Regarding to these problems, some researches concern about question identification [5, 16, 6, 15], question similarity [2], question generation [17], question analysis [10, 9], answer summarization [3] and answer re-ranking [13]
At this time, most recent Vietnamese QA systems have not considered so far in using community web services as a useful resource for such researches Existing Vietnamese QA systems [8, 14, 7, 12] are usually rule/grammar-based ones and uti-lizes structured databases or crawled web-pages Additionally, there is a Vietnamese
QA system that used community data as described in the Dang et al [1] ’s research The Dang et al ’s system responds to a new question by finding the similar ques-tions from Yahoo Answer However, this system did not return the answers of those similar questions, and the reported accuracy was not high
In this paper, we present a community-based Vietnamese question answering sys-tem, namely VnCQAs Our system solves the issue of domain adaptability and in-ability to be able to answer complex questions Furthermore, our VnCQAs system uses machine learning techniques to obtain high accuracy Our system contains three main modules: Database Construction, Question Analysis and Answer Selection, which are responsible for building the database, analyzing questions and choosing the best answer, respectively Figure 1 illustrates an example1with the input
ques-tion: “nên mua Ipad hay laptop” (should I buy Ipad or latop?) The output includes
the best available answer and related questions Users can find the answers of the related questions by clicking the corresponding links
The paper is presented as follows In the section 2, we introduce the overview of the whole system and describe the modules We describe the experimental results in section 3 The conclusion and future work are presented in section 4
2 System Architecture
In this section, we introduce the VnCQAs system architecture (as displayed in Figure 1) and briefly describe all modules in the system When a new question is presented
to the system, our system finds the most similar questions from the database The answers of these similar questions are called candidate answers These candidate answers are then processed to output the final answer
We build a system with three modules: Database Construction, Question Analy-sis, Answer selection
Database Construction module is responsible for building the database of
question-answer pairs
1The online demonstration is available at: http://150.65.242.39:8080/VNQA/
Trang 3A Community-Based Vietnamese Question Answering System 119
Fig 1 The system user interface
Fig 2 The system architecture
Question analysis module analyzes the questions and gives useful information
about the question such as keywords, questions types and words’ synonyms
Answer selection module processes the candidate answers and return the final
answers for the given question
Trang 42.1 Database Construction Module
This module is to extract the question-answer pairs for constructing the database from the community data by two steps: question detection and answer detection (as shown in Figure 3)
Fig 3 The database construction module
Our main community sources of question-answer pairs are threads collected from some famous technology forums in Vietnamese such as Vatgia, VnZoom and Tinhte These sites include the series of threads, in which each thread has a specific topic and it is further divided into posts Typically one of the posts presents the ques-tion, and some other posts contain the answer Furthermore, the community data we crawled from the sites have different layouts and therefore we standardize the data
by parsing them to our predefined XML format for later processing
Figure 4 give the question: “Mấy anh cho em hỏi tại sao khi em vào My computer hay bị treo vài giây” (Why when I enter My computer folder, the computer stops responding for a few second) The only suitable answer is the last post: “cũng có thể
do phần mềm AV và cũng có thể là do 1 phần mềm nào trong máy nên bạn kiểm tra lại máy xem nhé.” (maybe because of the AV software or maybe because of some other software, you should check your computer again).
2.1.1 Question Detection
In the question detection, we use a machine learning model to classify whether a post
is the question post or not Our features used for the machine learning are sequence
patterns which are based on the generalized form of text E.g., a sentence: “Subnet Mask dùng để làm gì?” (What is Subnet Mask used for?) can be represented in sequence form as follows: “Np V E làm gì”, in which Np, V, E are part of speech
(POS) tags of the respective words
We define the question words that also appear many times over questions e.g.,
giúp(help), phân biệt (distinguish), đánh giá (evaluate), làm sao (how to do), tại sao (why) These question words are kept in their original form and arranged into 18
groups namely:
Trang 5A Community-Based Vietnamese Question Answering System 121
Fig 4 Thread XML example
• q0000: gì (what)
• q0001: nào, nào là (which)
• q0002: ai (who)
• q0003: đâu (where)
• q0004: hay (or)
• q0005: sao, vì sao, tại sao (why)
• q0006: làm sao (how to do)
• q0007: làm gì (what to do)
• q0008: ra sao, thế nào, như thế nào (how)
• q0009: bao nhiêu (how many), bao lâu (how long), bao xa (how far)
• q0010: không (not), chưa (not yet)
• q0100: giúp(help), tư vấn(advise), dạy(teach), hướng dẫn(instruct), chỉ
dẫn(instruct)
• q0101: hỏi (ask), thắc mắc (worry)
• q0102: khắc phục (overcome)
• q0103: vấn đề (problem)
• q0104: cách (solution)
• q0105: so sánh (compare), đánh giá (evaluate)
• q0106: phân biệt (distinguish)
Other words in question are replaced by their POS tags to make the sequence more general The sequence patterns are extracted by using Prefix Span algorithm [15] After that, we select the patterns that contain the question words We then
Trang 6apply the method called “Multiple minimum supports” [4] to guarantee the quality
of patterns
2.1.2 Answer Detection
After finding the questions from the previous step, we detect the corresponding answers for each question by classifying the remaining posts through using a SVM model with a set of features:
• Is the post belonged to the author of the thread?
• Does the post contain quote of the questioner?
• Does the post contain quote of other users?
• The relative position of the post in the thread.
• The relative length of the post compared to others in the thread.
• Similarity between the post and the detected question.
• The proportion of noun, verb and pronoun that the post contains.
If the post is from the question’s owner, it is unlikely to be the answer Other-wise, the remaining post which contains the quote of the questioner often has a high possibility to be the answer
The question analysis module aims to extract important information from the ques-tions for finding the similar quesques-tions in the later module In this module, we investi-gate three steps: question classification, question keyword identification, and similar word identification as presented in Figure 5 Figure 2 shows an example of data
ex-tracted by the question analysis module from the question: “Làm thế nào để tạo vùng nhớ ảo thay thế RAM” (How to create virtual memory to replace RAM).
2.2.1 Question Classification
We classify questions because a question that is classified as a different type from the original question is unlikely to be a similar question Moreover, the question type also provides the constraints for verifying the answers
We categorize the questions into 3 classes: Fact, Solution and Explanation by using the machine learning method with a set of features: Unigrams, Bigrams, and Similarity The unigram and bigram features are calculated as the boolean value,
while the value of the similarity feature which represents a measure of similarity between two questions is estimated by using the phrasal overlap
2.2.2 Keyword Identification
The questions which are likely to be similar usually have the same set of keywords Besides, many questions in online forums and QA sites contain the unnecessary words and phrases, removing these helps to improve the ability of finding similar questions
Trang 7A Community-Based Vietnamese Question Answering System 123
Fig 5 The question analysis module
Fig 6 An example of analyzing question
The keyword identification aims to find the most important words in a question
We compute a score for each word appearing in the question corpus by using term frequency - inverse document frequency (tf*idf) weighting scheme Then we use a
threshold to determine whether the word is a keyword or not
2.2.3 Similar Word Identification
Regarding to the performance of our system for finding the similar questions, we also use a synonym dictionary to return the words that has the same meaning with the original words in the input question
Trang 8Fig 7 An example of giving the answer
The answer selection module is responsible for finding the similar questions with their corresponding candidate answers from the database, and finally it give the best answer Figure 7 shows an example of how an input question is processed in this module
As shown in Figure 8, after the input question is analyzed by the question analysis module, we use the extracted information as the input for finding the similar ques-tions by using the Lucene2 For each candidate answer corresponding to a similar question, we then apply the supervised learning approaches to estimate a score of classification confidence The score for each candidate answer is used to re-rank the list of candidate answers, and finally the candidate answer with the highest score is selected as the final answer
We consider the following triplet: (Qnew, Qpast, A), where Qnew is the original question, Qpast is the similar question and A is the candidate answer for Qpast Each
triplet is classified as satisfied if the answer A can be used to respond to the question Qnew Otherwise, the triplet is classified as unsatisfied We employ the supervised
learning approaches with a set of features:
• Text length
• Number of question marks
• Number of stopwords
• IDF statistics
• Query clarity
2http://lucene.apache.org/
Trang 9A Community-Based Vietnamese Question Answering System 125
Fig 8 The answer selection module
• Cosine similarity
• Topic model
3 Evaluation
We evaluate our system by the results of finding the similar questions and giving the correct answer for each test question
We collect the community data from three famous technology forums in Viet-namese: Vatgia, VnZoom and Tinhte For each question, we assign a score arranged from 0 to 4 to each candidate answer corresponding to the question, in which the exact answers are given the score of 4, the irrelevant answers are given the score
of 0
The evaluation data for finding similar questions consists of 1704 questions ob-tained from the database construction module These questions are checked by hand
to ensure that each question is not similar to other questions Then we paraphrase each question into 3 different versions with the same meaning The paraphrased questions with the corresponding candidate answers are indexed into Lucene as mentioned in section 2.3 We use 1704 original questions as the input questions for testing our system performance, if the returned question is one of 3 paraphrased questions, we evaluate this as a good result, and otherwise we count it as a bad result The accuracy result for finding the similar questions is presented in table 1
Trang 10Table 1 The accuracy results for finding the similar questions
Cosine similarity 80.86 (%)
Cosine similarity + Question analysis 87.61(%)
From 1704 questions, we choose 605 questions as the input questions, in which each question have an exact answer to test the performance of giving the correct answer We consider an returned answer as the satisfied answer if it matches the exact answer that we assigned The accuracy result for evaluating the correct answers
is shown in table 2
Table 2 The accuracy result for finding the answer
Baseline using the default of Lucene59.66 (%)
4 Conclusion
In this paper, we proposed the community-based question answering system for Vietnamese Our system consists of three modules: database construction, question analysis and answer selection The database construction module is used for creat-ing the database of question-answer pairs, in which each question corresponds to the candidate answers The question analysis module is responsible for extracting useful information such as keywords, question types and synonyms The answer se-lection module takes the extracted information from the input question for finding the similar questions in the database, and then re-rank the list of corresponding can-didate answers to give the best answer Experimental results are promising, where the question analysis module helps to improve the accuracy from 80.86% to 87.61%
in finding the similar questions, and the answer selection module get the accuracy
of 71.19% that is 11.53% higher than the baseline using the default of Lucene
In the future, we will extend the question analysis module by using other addi-tional features based on the dependency tree [11] We will also expand the database
to be able to deal with a wide range of questions and improve the answer selection module
Acknowledgment.This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04