Question Analysis for a Community Based Vietnamese Question Answering System

Question Analysis for a Community Based Vietnamese Question Answering System tài liệu, giáo án, bài giảng , luận văn, lu...

Trang 1

Vietnamese Question Answering System

Quan Hung Tran, Minh Le Nguyen, and Son Bao Pham

Abstract. This paper describes the approach for analyzing questions in our community-based Vietnamese question answering system (VnCQAs), in which we focus on two subtasks: question classification and keyword identification The question classification employs the machine learning approaches with a feature which represents a measure of similarity between two questions, while the keyword identification uses the dependency-tree-based features Experimental results are promising, in which the question classification obtains the accuracy of 95.7% and the keyword identification gains the accuracy of 85.8% Furthermore, these two sub-tasks help to improve the accuracy for finding the similar questions in our VnCQAs

by 6.75%

Question answering systems usually have a module for analyzing questions in order

to extract the important information such as keywords, question types or semantic constraints In this research, we focus on two subtasks of question analysis: question classification and keyword identification Identifying important words from a set of documents is an important task on information retrieval and question answering with two main approaches: using the corpus-based statistics for term weighting [7, 9] and employing the supervised methods [3, 18, 10] The question classification aims to

Quan Hung Tran· Son Bao Pham

Faculty of Information Technology

University of Engineering and Technology

Vietnam National University, Hanoi

e-mail: {quanth_55,sonpb}@vnu.edu.vn

Minh Le Nguyen

School of Information Science

Japan Advanced Institute of Science and Technology

e-mail: nguyenml@jaist.ac.jp

V.-H Nguyen et al (eds.), Knowledge and Systems Engineering,

Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_51

Trang 2

classify questions into several pre-defined classes for seeking the suitable answers.

Li et al [8] proposed a two-layer taxonomy with 6 coarse classes and 50 fine-grained classes, while Bu et al [4] introduced a six-types taxonomy Futhermore, regarding

to the methods for classifying questions, some researches employed the rule-based approaches [6] and the machine learning algorithms [4, 1], while other researches considered on combining rule-based and machine learning-based techniques [5] Recently, some question analysis techniques have been examined for Vietnamese [11, 13, 12, 16] However, these researches experimented on a standard corpus, where the words’ spellings are generally good The question analysis for our VnC-QAs system, on the other hand, has to deal with noisy data from the community-based resources In this paper, we propose the dependency-tree-community-based features in finding keywords We also introduce a new feature called “similarity feature” for classifying questions To the best of our knowledge, it is the first time the question analysis are adapted to the Vietnamese community data

The paper was presented as follows: in section 2, we briefly describe the archi-tecture of our VnCQAs system Section 3 presents the overview of our approach for analyzing questions, while the question classification and the keyword identifica-tion are introduced in secidentifica-tion 4 and 5 respectively Secidentifica-tion 6 gives the experimental results and the conclusion are shown in section 7

The architecture of our question answering system [17] is shown in Figure 1 It includes three modules: Database Construction, Question Analysis and Answer se-lection, in which the database construction module aims to build the database of question-answer pairs, while the question analysis module extract the useful infor-mation such keywords, question types and synonyms The answer selection module finds the most similar questions for the input question from the database, in which each similar question corresponds to a candidate answer The candidate answers then are processed to output the best answer

In this paper, we focus on analyzing questions in the question analysis module with two main tasks: Question Classification and Keyword Identification Further-more, in this module, we also use a dictionary of 6626 entities to find the synonyms

in the question Figure 2 shows an example for the question analysis module with the

question: “Làm thế nào để tạo vùng nhớ ảo thay thế RAM” (How to create virtual memory to replace RAM).

3.1 Question Types

We classify questions into three types: Fact, Explanation and Solution according to

the main purpose of the questioner

Trang 3

Fig 1 The system architecture

Fig 2 An example for the question analysis module

• Fact: The questions is only about objects and resources, the expected answer

is about the general facts and/or attributes E.g., with the question: “Tấm dán màn hình từ tính là gì?” (What is magnetic screen stickers?), the object is “Tấm dán màn hình từ tính” (the magnetic screen stickers), and the expected attribute in the answer

is definition

• Explanation: The questions require explanations or opinions e.g., “Vì sao điện thoại của mình hay bị mất sóng?” (Why does my phone frequently lose signal?).

Trang 4

• Solution: The questions ask for the solution for a problem e.g.,“Chỉ cho em cách vào facebook trên Iphone?” (How to access Facebook from Iphone?).

3.2 Methodology

In the VnCQAs system, we use the support vector machines (SVMs) for learning

classification (as shown in Figure 3) with a set of features: Unigrams, Bigrams The

unigram and bigram features are common in the natural language processing tasks,

in which the value of each unigram/bigram feature is calculated as a boolean value indicating whether that unigram/bigram feature is included in the question or not

Furthermore, another feature we used for training the SVM model is the similar-ity feature, for which the value of the similarsimilar-ity feature which represents a measure

of similarity between two questions is estimated by using the phrasal overlap [2, 15]:

overlap phrase (s1,s2) = ∑n

i=1∑m i2 for m phrasal n-word overlaps, where m is a number of i-word phrases appearing in sentence pairs.

sim overlap ,phrase (s1,s2) = tanh( overlap phrase (s1,s2)

|s1| + |s2| )

Fig 3 The question classification

This section describes the keyword identification by using the machine learning technique with the dependency-tree-based features

Trang 5

4.1 Keyword Definition

We define the keywords as:

• The most informative words in the question is the set of keywords which

con-tains most of the information (e.g., topics, main objects and actions)

• The words can be used to distinguish different questions, two questions that

have the same set of keywords are likely to be similar

E.g.,: The question: “Hỏi cách xóa tin nhắn trên iphone?” (How to delete messages on iphone?) have the keywords of: “cách” (how), “xóa” (delete), “tin nhắn” (messages), “iphone” The main verb “hỏi” (ask) is not considered a

keyword because this word does not represent any important information and it also cannot be used to distinguish among questions

4.2 Methodology

We use the dependency tree to find keywords on the premise that is to identify a word as informative or not, we take into account the relationships of that word with other words in the question by using the Vietnamese dependency parser [14] For each question, the dependency parser creates a tree that contains the tree structure, the relation of the words and several other information such as part-of-speech tag (as shown in Figure 4, the question is “How do I delete messages in Iphone?”)

Fig 4 An example of the dependency tree

The features are then extracted from the tree and used for training the SVM model which is used to classify a word as a keyword or not (as presented in Figure 5) These features are grouped as follows:

• The part of speech (POS) tag of each word: POSW

• The part of speech (POS) tag of the parent word of each word in the dependency tree: POSP

• Unigrams

• The common dependent words: CDW

The POSW feature is used because words with certain POS tags (e.g., Noun, Verb, and Adjective) are more likely to be a keyword of a sentence The POSP feature

Trang 6

Fig 5 Dependency tree method work flow

helps to identify that words that are children of verbs are more likely to be the ob-ject of the question The CDW feature is employed because the children of several

words (e.g “cách” (solution)) have a high chance of being keywords During

imple-mentation for the CDW feature, we use a map that stores the words and an Integer that indicates the number of times that word is the parent of a keyword A manual threshold is then used to identify which words have a high frequency of being the parent of keywords

5.1 Question Classification Evaluation

We use a set of 1013 manually tagged questions focusing on the technology domain with three mentioned types: Fact, Explanation, and Solution The tagged questions come from the database construction module of our VnCQAs system These ques-tions are kept in its original form, no modificaques-tions are made However, some of the questions in online forums are not understandable, they lack information or context

to be understood These questions are removed from the set of data The question distribution is shown in Figure 6

Regarding to learn the SVM model, we use the LIBLinear and SVM-SMO algo-rithms with 10-fold cross-validation scheme The experiments were conducted on a Window PC with Core i7 CPU and 8GB of RAM The highest accuracy (95.7%) is achieved with the combination of phrasal overlap, unigram and bigram features (as shown in Table 1)

Although, the accuracy of the question classification is around 95.7%, our corpus

is different from the corpus of other published works, it is hard to directly compare our method to other available methods in the question classification task To make

a meaningful comparison of our method with other methods, we investigate on the

Trang 7

Fig 6 The distribution of questions in the tagged data set

Table 1 The question classification accuracy on the community data

Features Accuracy (SVM - LIBLinear) Accuracy (SVM-SMO)

TREC corpus in Vietnamese [16] Our obtained accuracy is comparable to the accu-racy of the Tran et al [16]’s approach with the same 10-fold cross-validation scheme (as shown in Table 2)

Table 2 The question classification accuracy on the TREC data

Classes Our methodTran’s method coarse classes classification 85.0 (%) 86.0 (%)

fine grain classes classification 84.9 (%) 84.7 (%)

5.2 Keyword Identification Evaluation

To test the performance of the keyword identification, we use a set of 753 words tagged from the sentences in our database (as shown in Figure 7, the question is

“How do I delete messages in Iphone?”)

To make a comparison, we implement a baseline for identifying the keywords

by using the using term frequency - inverse document frequency (TF-IDF) method

Trang 8

Fig 7 An example of keywords

Fig 8 The TF-IDF method’s accuracy

The TF-IDF score of each word in a sentence will be calculated, and a threshold is chosen to identify whether a word is a keyword or not The accuracy results of the TF-IDF method are presented in figure 8

Our method outperforms the TF-IDF method as we can see from the obtained accuracies in Table 3)

5.3 Question Analysis Evaluation

In this section, we evaluate the contribution of the question analysis to our VnC-QAs system by measuring the improvement in finding similar questions To eval-uate the ability of the VnCQAs system to find similar questions, we use a set of

Trang 9

Table 3 The accuracy of the keyword identification

Features Accuracy (SVM - LIBLinear) Accuracy (SVM-SMO)

1704 questions which are checked by hand to ensure that each question is not sim-ilar to other questions Then we paraphrase each question into 3 different versions The paraphrased questions must have close meaning to the original questions

We use 1704 original questions as the input questions for testing our system per-formance, if the returned question is one of 3 paraphrased questions, we evaluate this as a good result, and otherwise we count it as a bad result Table 4 shows the accuracy improvement in finding the similar questions

Table 4 The question analysis evaluation

Cosine similarity 80.86 (%)

Cosine Similarity + Question Analysis 87.61 (%)

In this paper, we described the question analysis module of our VnCQAs system on two subtasks: question classification and keyword identifications We classify

ques-tions into three types: Fact, Explanation and Solution by using the support vector machines (SVMs) for learning classification with a set of features: unigrams, bi-grams, and similarity Our classification accuracy is high even though we have to

deal with noisy community data Furthermore, on the Vietnamese TREC corpus,

we gain the competitive accuracy results For the keyword identification subtask,

we used the machine learning method with the dependency tree-based features and achieved the accuracy of 85.8% which outperforms the TF-IDF method

In the future, we will improve the size and quality of data set used in both subtasks above We also will examine other methods for further improving the performance accuracy in analyzing questions

Trang 10

Acknowledgment.This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04

References

[1] Paliwal, M., Kumar, U.A.: Neural networks and statistical techniques: A review of ap-plications Expert Systems with Applications 36(1), 2–17 (2009)

[2] Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic related-ness In: Proceedings of the 18th International Joint Conference on Artificial Intelli-gence, IJCAI 2003, pp 805–810 (2003)

[3] Bendersky, M., Croft, W.B.: Discovering key concepts in verbose queries In: Proceed-ings of the 31st Annual International ACM SIGIR Conference on Research and Devel-opment in Information Retrieval, SIGIR 2008, pp 491–498 (2008)

[4] Bu, F., Zhu, X., Hao, Y., Zhu, X.: Function-based question classification for general

qa In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, pp 1119–1128 (2010)

[5] Huang, Z., Thint, M., Qin, Z.: Question classification using head words and their hyper-nyms In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp 927–936 (2008)

[6] Hui, Z., Liu, J., Ouyang, L.: Question classification based on an extended class se-quential rule model In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp 938–946 Asian Federation of Natural Language Processing (November 2011)

[7] Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)

[8] Li, X., Roth, D.: Learning question classifiers In: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, vol 1, pp 1–7 (2002) [9] Luhn, H.P.: A business intelligence system IBM J Res Dev 2(4), 314–319 (1958) [10] Luo, X., Raghavan, H., Castelli, V., Maskey, S., Florian, R.: Finding what matters in questions In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp 878–887 Association for Computational Linguistics (June 2013) [11] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Vietnamese Question Answering System In: Proceedings of the 2009 International Conference on Knowledge and Systems Engi-neering, KSE 2009, pp 26–32 (2009)

[12] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Semantic Approach for Question Analysis In: Jiang, H., Ding, W., Ali, M., Wu, X (eds.) IEA/AIE 2012 LNCS, vol 7345, pp 156–165 Springer, Heidelberg (2012)

[13] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: Systematic Knowledge Acquisition for Ques-tion Analysis In: Proceedings of the InternaQues-tional Conference Recent Advances in Nat-ural Language Processing 2011, pp 406–412 (2011)

[14] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Le Nguyen, M.: From treebank conversion to automatic dependency parsing for vietnamese In: Métais, E., Roche, M., Teisseire, M (eds.) NLDB 2014 LNCS, vol 8455, pp 196–207 Springer, Heidelberg (2014)

[15] Ponzetto, S.P., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness Journal of Artificial Intelligence Research 30(1), 181–212 (2007)

Định dạng
Số trang	11
Dung lượng	918,73 KB