Question Analysis for a Community Based Vietnamese Question Answering System tài liệu, giáo án, bài giảng , luận văn, lu...
Trang 1Vietnamese Question Answering System
Quan Hung Tran, Minh Le Nguyen, and Son Bao Pham
Abstract. This paper describes the approach for analyzing questions in our community-based Vietnamese question answering system (VnCQAs), in which we focus on two subtasks: question classification and keyword identification The question classification employs the machine learning approaches with a feature which represents a measure of similarity between two questions, while the keyword identification uses the dependency-tree-based features Experimental results are promising, in which the question classification obtains the accuracy of 95.7% and the keyword identification gains the accuracy of 85.8% Furthermore, these two sub-tasks help to improve the accuracy for finding the similar questions in our VnCQAs
by 6.75%
Question answering systems usually have a module for analyzing questions in order
to extract the important information such as keywords, question types or semantic constraints In this research, we focus on two subtasks of question analysis: question classification and keyword identification Identifying important words from a set of documents is an important task on information retrieval and question answering with two main approaches: using the corpus-based statistics for term weighting [7, 9] and employing the supervised methods [3, 18, 10] The question classification aims to
Quan Hung Tran· Son Bao Pham
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
e-mail: {quanth_55,sonpb}@vnu.edu.vn
Minh Le Nguyen
School of Information Science
Japan Advanced Institute of Science and Technology
e-mail: nguyenml@jaist.ac.jp
© Springer International Publishing Switzerland 2015 641
V.-H Nguyen et al (eds.), Knowledge and Systems Engineering,
Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_51
Trang 2classify questions into several pre-defined classes for seeking the suitable answers.
Li et al [8] proposed a two-layer taxonomy with 6 coarse classes and 50 fine-grained classes, while Bu et al [4] introduced a six-types taxonomy Futhermore, regarding
to the methods for classifying questions, some researches employed the rule-based approaches [6] and the machine learning algorithms [4, 1], while other researches considered on combining rule-based and machine learning-based techniques [5] Recently, some question analysis techniques have been examined for Vietnamese [11, 13, 12, 16] However, these researches experimented on a standard corpus, where the words’ spellings are generally good The question analysis for our VnC-QAs system, on the other hand, has to deal with noisy data from the community-based resources In this paper, we propose the dependency-tree-community-based features in finding keywords We also introduce a new feature called “similarity feature” for classifying questions To the best of our knowledge, it is the first time the question analysis are adapted to the Vietnamese community data
The paper was presented as follows: in section 2, we briefly describe the archi-tecture of our VnCQAs system Section 3 presents the overview of our approach for analyzing questions, while the question classification and the keyword identifica-tion are introduced in secidentifica-tion 4 and 5 respectively Secidentifica-tion 6 gives the experimental results and the conclusion are shown in section 7
The architecture of our question answering system [17] is shown in Figure 1 It includes three modules: Database Construction, Question Analysis and Answer se-lection, in which the database construction module aims to build the database of question-answer pairs, while the question analysis module extract the useful infor-mation such keywords, question types and synonyms The answer selection module finds the most similar questions for the input question from the database, in which each similar question corresponds to a candidate answer The candidate answers then are processed to output the best answer
In this paper, we focus on analyzing questions in the question analysis module with two main tasks: Question Classification and Keyword Identification Further-more, in this module, we also use a dictionary of 6626 entities to find the synonyms
in the question Figure 2 shows an example for the question analysis module with the
question: “Làm thế nào để tạo vùng nhớ ảo thay thế RAM” (How to create virtual memory to replace RAM).
3.1 Question Types
We classify questions into three types: Fact, Explanation and Solution according to
the main purpose of the questioner
Trang 3Fig 1 The system architecture
Fig 2 An example for the question analysis module
• Fact: The questions is only about objects and resources, the expected answer
is about the general facts and/or attributes E.g., with the question: “Tấm dán màn hình từ tính là gì?” (What is magnetic screen stickers?), the object is “Tấm dán màn hình từ tính” (the magnetic screen stickers), and the expected attribute in the answer
is definition
• Explanation: The questions require explanations or opinions e.g., “Vì sao điện thoại của mình hay bị mất sóng?” (Why does my phone frequently lose signal?).
Trang 4• Solution: The questions ask for the solution for a problem e.g.,“Chỉ cho em cách vào facebook trên Iphone?” (How to access Facebook from Iphone?).
3.2 Methodology
In the VnCQAs system, we use the support vector machines (SVMs) for learning
classification (as shown in Figure 3) with a set of features: Unigrams, Bigrams The
unigram and bigram features are common in the natural language processing tasks,
in which the value of each unigram/bigram feature is calculated as a boolean value indicating whether that unigram/bigram feature is included in the question or not
Furthermore, another feature we used for training the SVM model is the similar-ity feature, for which the value of the similarsimilar-ity feature which represents a measure
of similarity between two questions is estimated by using the phrasal overlap [2, 15]:
overlap phrase (s1,s2) = ∑n
i=1∑m i2 for m phrasal n-word overlaps, where m is a number of i-word phrases appearing in sentence pairs.
sim overlap ,phrase (s1,s2) = tanh( overlap phrase (s1,s2)
|s1| + |s2| )
Fig 3 The question classification
This section describes the keyword identification by using the machine learning technique with the dependency-tree-based features
Trang 54.1 Keyword Definition
We define the keywords as:
• The most informative words in the question is the set of keywords which
con-tains most of the information (e.g., topics, main objects and actions)
• The words can be used to distinguish different questions, two questions that
have the same set of keywords are likely to be similar
E.g.,: The question: “Hỏi cách xóa tin nhắn trên iphone?” (How to delete messages on iphone?) have the keywords of: “cách” (how), “xóa” (delete), “tin nhắn” (messages), “iphone” The main verb “hỏi” (ask) is not considered a
keyword because this word does not represent any important information and it also cannot be used to distinguish among questions
4.2 Methodology
We use the dependency tree to find keywords on the premise that is to identify a word as informative or not, we take into account the relationships of that word with other words in the question by using the Vietnamese dependency parser [14] For each question, the dependency parser creates a tree that contains the tree structure, the relation of the words and several other information such as part-of-speech tag (as shown in Figure 4, the question is “How do I delete messages in Iphone?”)
Fig 4 An example of the dependency tree
The features are then extracted from the tree and used for training the SVM model which is used to classify a word as a keyword or not (as presented in Figure 5) These features are grouped as follows:
• The part of speech (POS) tag of each word: POSW
• The part of speech (POS) tag of the parent word of each word in the dependency tree: POSP
• Unigrams
• The common dependent words: CDW
The POSW feature is used because words with certain POS tags (e.g., Noun, Verb, and Adjective) are more likely to be a keyword of a sentence The POSP feature
Trang 6Fig 5 Dependency tree method work flow
helps to identify that words that are children of verbs are more likely to be the ob-ject of the question The CDW feature is employed because the children of several
words (e.g “cách” (solution)) have a high chance of being keywords During
imple-mentation for the CDW feature, we use a map that stores the words and an Integer that indicates the number of times that word is the parent of a keyword A manual threshold is then used to identify which words have a high frequency of being the parent of keywords
5.1 Question Classification Evaluation
We use a set of 1013 manually tagged questions focusing on the technology domain with three mentioned types: Fact, Explanation, and Solution The tagged questions come from the database construction module of our VnCQAs system These ques-tions are kept in its original form, no modificaques-tions are made However, some of the questions in online forums are not understandable, they lack information or context
to be understood These questions are removed from the set of data The question distribution is shown in Figure 6
Regarding to learn the SVM model, we use the LIBLinear and SVM-SMO algo-rithms with 10-fold cross-validation scheme The experiments were conducted on a Window PC with Core i7 CPU and 8GB of RAM The highest accuracy (95.7%) is achieved with the combination of phrasal overlap, unigram and bigram features (as shown in Table 1)
Although, the accuracy of the question classification is around 95.7%, our corpus
is different from the corpus of other published works, it is hard to directly compare our method to other available methods in the question classification task To make
a meaningful comparison of our method with other methods, we investigate on the
Trang 7Fig 6 The distribution of questions in the tagged data set
Table 1 The question classification accuracy on the community data
Features Accuracy (SVM - LIBLinear) Accuracy (SVM-SMO)
TREC corpus in Vietnamese [16] Our obtained accuracy is comparable to the accu-racy of the Tran et al [16]’s approach with the same 10-fold cross-validation scheme (as shown in Table 2)
Table 2 The question classification accuracy on the TREC data
Classes Our methodTran’s method coarse classes classification 85.0 (%) 86.0 (%)
fine grain classes classification 84.9 (%) 84.7 (%)
5.2 Keyword Identification Evaluation
To test the performance of the keyword identification, we use a set of 753 words tagged from the sentences in our database (as shown in Figure 7, the question is
“How do I delete messages in Iphone?”)
To make a comparison, we implement a baseline for identifying the keywords
by using the using term frequency - inverse document frequency (TF-IDF) method
Trang 8Fig 7 An example of keywords
Fig 8 The TF-IDF method’s accuracy
The TF-IDF score of each word in a sentence will be calculated, and a threshold is chosen to identify whether a word is a keyword or not The accuracy results of the TF-IDF method are presented in figure 8
Our method outperforms the TF-IDF method as we can see from the obtained accuracies in Table 3)
5.3 Question Analysis Evaluation
In this section, we evaluate the contribution of the question analysis to our VnC-QAs system by measuring the improvement in finding similar questions To eval-uate the ability of the VnCQAs system to find similar questions, we use a set of
Trang 9Table 3 The accuracy of the keyword identification
Features Accuracy (SVM - LIBLinear) Accuracy (SVM-SMO)
1704 questions which are checked by hand to ensure that each question is not sim-ilar to other questions Then we paraphrase each question into 3 different versions The paraphrased questions must have close meaning to the original questions
We use 1704 original questions as the input questions for testing our system per-formance, if the returned question is one of 3 paraphrased questions, we evaluate this as a good result, and otherwise we count it as a bad result Table 4 shows the accuracy improvement in finding the similar questions
Table 4 The question analysis evaluation
Cosine similarity 80.86 (%)
Cosine Similarity + Question Analysis 87.61 (%)
In this paper, we described the question analysis module of our VnCQAs system on two subtasks: question classification and keyword identifications We classify
ques-tions into three types: Fact, Explanation and Solution by using the support vector machines (SVMs) for learning classification with a set of features: unigrams, bi-grams, and similarity Our classification accuracy is high even though we have to
deal with noisy community data Furthermore, on the Vietnamese TREC corpus,
we gain the competitive accuracy results For the keyword identification subtask,
we used the machine learning method with the dependency tree-based features and achieved the accuracy of 85.8% which outperforms the TF-IDF method
In the future, we will improve the size and quality of data set used in both subtasks above We also will examine other methods for further improving the performance accuracy in analyzing questions
Trang 10Acknowledgment.This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04
References
[1] Paliwal, M., Kumar, U.A.: Neural networks and statistical techniques: A review of ap-plications Expert Systems with Applications 36(1), 2–17 (2009)
[2] Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic related-ness In: Proceedings of the 18th International Joint Conference on Artificial Intelli-gence, IJCAI 2003, pp 805–810 (2003)
[3] Bendersky, M., Croft, W.B.: Discovering key concepts in verbose queries In: Proceed-ings of the 31st Annual International ACM SIGIR Conference on Research and Devel-opment in Information Retrieval, SIGIR 2008, pp 491–498 (2008)
[4] Bu, F., Zhu, X., Hao, Y., Zhu, X.: Function-based question classification for general
qa In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, pp 1119–1128 (2010)
[5] Huang, Z., Thint, M., Qin, Z.: Question classification using head words and their hyper-nyms In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp 927–936 (2008)
[6] Hui, Z., Liu, J., Ouyang, L.: Question classification based on an extended class se-quential rule model In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp 938–946 Asian Federation of Natural Language Processing (November 2011)
[7] Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)
[8] Li, X., Roth, D.: Learning question classifiers In: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, vol 1, pp 1–7 (2002) [9] Luhn, H.P.: A business intelligence system IBM J Res Dev 2(4), 314–319 (1958) [10] Luo, X., Raghavan, H., Castelli, V., Maskey, S., Florian, R.: Finding what matters in questions In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp 878–887 Association for Computational Linguistics (June 2013) [11] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Vietnamese Question Answering System In: Proceedings of the 2009 International Conference on Knowledge and Systems Engi-neering, KSE 2009, pp 26–32 (2009)
[12] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Semantic Approach for Question Analysis In: Jiang, H., Ding, W., Ali, M., Wu, X (eds.) IEA/AIE 2012 LNCS, vol 7345, pp 156–165 Springer, Heidelberg (2012)
[13] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: Systematic Knowledge Acquisition for Ques-tion Analysis In: Proceedings of the InternaQues-tional Conference Recent Advances in Nat-ural Language Processing 2011, pp 406–412 (2011)
[14] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Le Nguyen, M.: From treebank conversion to automatic dependency parsing for vietnamese In: Métais, E., Roche, M., Teisseire, M (eds.) NLDB 2014 LNCS, vol 8455, pp 196–207 Springer, Heidelberg (2014)
[15] Ponzetto, S.P., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness Journal of Artificial Intelligence Research 30(1), 181–212 (2007)