Ripple Down Rules for Question AnalysisNguyen Quoc Dat K16 Computer Science Master Course Faculty of Information Technology University of Engineering and Technology Vietnam National Univ
Trang 11
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN QUOC DAT
RIPPLE DOWN RULES FOR QUESTION ANALYSIS
MASTER THESIS
Hanoi - 2011
Trang 2Ripple Down Rules for Question Analysis
Nguyen Quoc Dat
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
Supervised by
Dr Pham Bao Son
A thesis submitted in fulfillment of the requirements
for the degree of Master of Science in Computer Science
August 2011
Trang 3Table of Contents
2 Literature review 3
2.1 Question analysis
in question answering systems 3
2.1.1 Question classification 4
2.1.2 Pattern-matching based analysis 5
2.1.3 Syntactic-based analysis 6
2.1.4 Semantic-based analysis 8
2.1.5 Annotation-based question analysis in question answering sys-tems 10
2.2 GATE 12
2.2.1 Information Extraction in GATE 14
2.2.2 JAPE 14
2.3 Single Classification Ripple Down Rules 19
3 Our Question Answering System Architecture 20 3.1 Introduction 20
3.2 Preprocessing module 23
3.3 Syntactic analysis module 24
3.3.1 Noun phrases detection 24
3.3.2 Question-phrases detection 25
3.3.3 Relations detection 26
3.4 Semantic analysis module 27
3.5 Answer retrieval component 29
4 Systematic Knowledge Acquisition
for Question Analysis 30
v
Trang 4vi TABLE OF CONTENTS
4.1 Recall Intermediate Representation
of an input question 30
4.2 Rule language 32
4.3 Knowledge Acquisition Process 33
5.1 Question Analysis for Vietnamese 37
5.2 Question Analysis for English 39
A Definitions of question-class types 43
B Definitions of question-structures 45
C Intermediate Representation Elements of English questions 48
D Embedding Java code in JAPE 59
Trang 5Ripple Down Rules for Question Analysis
Nguyen Quoc Dat
K16 Computer Science Master Course
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
datnq@vnu.edu.vn
Pham Bao Son
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
sonpb@vnu.edu.vn
Abstract
For the task of turning a natural language question into an explicit intermediate representation of the complexity in question answering systems, all published works so far use rule-based approach to the best of our knowledge We believe that it is because
of the complexity of the representation and the variety of question types and also there are no publicly available corpus of a decent size In these rule-based approaches, the process of creating rules is not discussed It is clear that manually creating the rules
in an ad-hoc manner is very expensive and error-prone This thesis firstly describes, in details, a method to convert Vietnamese natural language questions into intermediate representation elements over semantic annotations via grammar rules Importantly, this thesis focuses on proposing a language independent approach on the process of creating those rules manually, in a way that consistency between rules is maintained and the effort to create a new rule is independent of the size of the current rule set Experimental results are promising to show that our language independent approach is easy to adapt for a new domain and a new language
Keywords
Question Answering System; Ripple Down Rules; Question Analysis;
PUBLICATIONS
? Dat Quoc Nguyen, Dai Quoc Nguyen and Son Bao Pham Systematic Knowledge Acquisition for
Question Analysis Proc of the 8th International Conference on Recent Advances in Natural Language
Processing (RANLP 2011), pp 406-412.
? Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham and Dang Duc Pham Ripple Down Rules
for Part-Of-Speech Tagging Proc of 12th International Conference on Intelligent Text Processing and
Computational Linguistics (CICLING 2011), Springer-Verlag LNCS, part I, pp 190-201.
? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese question answering system.
Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, pp 26–32.
Trang 6I INTRODUCTION The rocketted growth of online information available that is accessible to human users requires more support from advanced information retrieval (IR) technologies to catch the expected information This brings new challenges to build IR systems especially like search engine, and question answering systems In while almost current search engines return ranked lists of related documents corresponding with each user’s query (in our case, a query referring to a question), and the user have to scan these documents to obtain desired information The goal
of question answering systems is to give extract answers in exploiting advantage of natural language processing to the user’s questions without scanning any document
Natural language question analysis component is the first component in any question answering systems This component creates an intermediate representation of the input question, which is expressed in natural language, to be utilized in the rest of the system For the task of translating
a natural language question into an explicit intermediate representation of the complexity in question answering systems, all published works so far use rule-based approach to the best of our knowledge In existing rule-based approaches, because of the complexity of the representation and the variety of question structure types, manually creating the rules in an ad-hoc manner is very expensive and error-prone in taking a lot of time and effort For example, many rule-based approaches such as the approach to process English questions described in Aqualog [1], the one
to handle Vietnamese questions presented in [2], manually defined a list of sequence pattern structures to analyze questions As rules are created in an ad-hoc manner, these approaches share a common difficulty in managing interaction among rules and keeping consistency
In this thesis, we firstly introduce a method to analyze Vietnamese natural questions in natural language analysis component Natural language questions will be transformed into intermediate representation elements which include construction of question, class of question, keywords
in question and semantic constraints between them through processes such as preprocessing, syntactic analysis and semantic analysis over semantic annotations via JAPE grammar rules on GATE framework [3]
More importantly, we focus on presenting a language independent approach utilizing Ripple Down Rules [4][5][6] knowledge acquisition methodology to acquire rules in a systematic manner where consistency between rules is maintained while avoiding unintended interaction among rules
In section II, we provide some related works and describe our overall system architecture
in section III We present our knowledge acquisition approach for question analysis in section
IV We describe our experiments in section V Discussion and conclusion will be presented in section VI
Trang 7II RELATED WORKS
A Question analysis in question answering systems
Early NLIDB systems used pattern-matching technique to process user’s question and gen-erate corresponding answer [7] A common technique for parsing input questions in NLIDB approaches is syntax analysis where a natural language question is directly mapped to a database query (such as SQL) through grammar rules Nguyen and Le [8] introduced a NLIDB question answering system in Vietnamese employing semantic grammars Their system includes two main modules: QTRAN and TGEN QTRAN (Query Translator) maps a natural language question
to an SQL query while TGEN (Text Generator) generates answers based on the query result tables QTRAN uses limited context-free grammars to analyze user’s question into syntax tree via CYK algorithm
Recently, some question answering systems that used semantic annotations generated high results in natural language question analysis A well known annotation based framework is GATE [3] which have been used in many question answering systems especially for the natural language question analysis module such as: Aqualog [1], QuestIO [9], an the one presented in [2]
Aqualog is an ontology-based question answering system for English and is the basis for the development of our system Aqualog takes a natural language question and an ontology
as its input, and returns an answer for users based on the semantic analysis of the question and the corresponding elements in the ontology Aqualog’s architecture can be described as a waterfall model where a natural language question is mapped to a set of representation based
on the intermediate triple that is called a Query-Triple through the Linguistic Component The Relation Similarity Service takes a Query-Triple and processes it to provide queries with respect
to the input ontology called Onto-Triple
Aqualog performs semantic and syntactic analysis of the input question through the use of processing resources provided by GATE [3] such as word segmentation, sentence segment, part-of-speech tagging When a question is asked, the task of Linguistic Component is to transfer the natural language question to a Query-Triple with the following format (generic term, relation, second term) Through the use of Java Annotation Patterns Engine (JAPE) grammars in GATE [3], AquaLog identifies terms and their relationship The Relation Similarity Service uses Query-Triples to create Ontology-Query-Triples where each term in the Query-Query-Triples is matched with elements
in the ontology
In our experiment, we reported an approach to convert Vietnamese natural language questions into intermediate representation element in query-tuples (Question-structure, Question-class, Term1, Relation, Term2, Term3) based on semantic annotations via JAPE grammars [10] The selected query-tuple type is more complex aiming to cover a wider variety of question types in different languages In addition, we proposed a language-independent approach to to acquire
Trang 8JAPE rules in a systematic manner which avoids unintended interaction among rules [11] Phan and Nguyen [2] presented an approach to syntactically and semantically map Vietnamese
questions into triple-like of Subject, Verb and Object in also utilizing JAPE grammars.
B Single Classification Ripple Down Rules
Ripple Down Rules (RDR) [4][5][6] were developed to allow users incrementally add rules to
an existing rule-based system whiles systematically controlling interactions between rules and ensuring consistency among existing rules
A Single Classification Ripple Down Rules (SCRDR) [4][5][6] tree is a binary tree with two discrete types of edges that are typically called except and if-not edges Associated with each node in a tree is a rule A rule has the form: if α then β where α is called the condition and
β is called the conclusion.
Cases in SCRDR are evaluated by passing a case (for example, a question to be classified
in our case) to the root of the tree At any node in the tree, if the condition of a node N’s
rule is satisfied by the case, the case is passed on to the exception child of N using the except
link if the link exists In the contrast, if the condition of a node N’s rule is not satisfied by the
case, the case is passed on to the N’s if-not child The conclusion given by this process is the conclusion from the last node in the RDR tree which fired (satisfied by the case) To ensure
that a conclusion is always given, the root node typically contains a trivial condition which is
always satisfied This node is called the default node.
A new node is added to an SCRDR tree when the evaluation process returns the wrong conclusion The new node is attached to the last node in the evaluation path of the given case
with the except link if the last node is the fired rule Otherwise, it is attached with the if-not
link
RDR based approaches have been used to tackle NLP tasks such as POS tagging [12], text classification and information extraction [13]
Trang 9III OURQUESTION ANSWERING SYSTEMARCHITECTURE
In this section, we introduce our the first Ontology-based question answering system in Vietnamese, and focus on describing, in details, the system’s front-end compo- nent that performs syntactic and semantic analysis on natural language questions on GATE framework
The architecture of our question answering system is shown in figure 1 It includes two components: the Natural language question analysis engine and the Answer retrieval
The question analysis component consists of three modules: preprocessing, syntactic analysis and semantic analysis It takes the user question as an input and returns a query-tuple representing the question in a compact form The role of this intermediate representation is to provide structured information of the input question for later processing such as retrieving answers The answer retrieval component includes two main modules: Ontology mapping and Answer extraction It takes an intermediate representation produced by the question analysis component and an Ontology as its input to generate semantic answers
We wrapped existing linguistic processing modules for Vietnamese such as Word Segmen-tation, Part-of-speech tagger [14] as GATE plug-ins Results of the modules are annotations capturing information such as sentences, words, nouns and verbs Each annotation has a set
of feature-value pairs For example, a word has a feature category storing its part-of-speech
tag This information can then be reused for further processing in subsequent modules New modules are specifically designed to handle Vietnamese questions using JAPE grammars over existing linguistic annotations
A Intermediate representation element
Aqualog [1] performs semantic and syntactic analysis of the input English question through the use of processing resources provided by GATE [3] When a question is asked, the task of the question analysis component is to transfer the natural language question to a Query-Triple with the following format (generic term, relation, second term) Through the use of JAPE grammars
in GATE, AquaLog identifies terms and their relationship The intermediate representation used
in our approach is more complex aiming to cover a wider variety of question types It consists
of a question-structure and one or more query-tuple in the following format:
(question-structure, question-class, T erm1, Relation, T erm2, T erm3)
where T erm1 represents a concept (object class), T erm2 andT erm3, if exist, represent entities
(objects), Relation (property) is a semantic constraint between terms in the question This
representation is meant to capture the semantic of the question
Simple questions only have one query-tuple and its question-structure is the query-tuple’s
question-structure More complex questions such as composite questions have several
sub-questions, each sub-question is represented by a separate query-tuple, and the question-structure
captures this composition attribute
Composite questions such as:
Trang 10Figure 1 Architecture of our question answering system.
“danh sách tất cả các sinh viên của khoa công nghệ thông tin mà có quê quán ở Hà Nội?”
“list all students in the Faculty of Information Technology whose hometown is Hanoi?”
has question structure of type And with two query-tuples where ? represents a missing element:
( UnknRel , List , sinh viênstudent , ? , khoa công nghệ thông tinF aculty of Inf ormation T echnology , ? ) and ( Normal , List , sinh viênstudent , có quê quán has hometown , Hà NộiHanoi , ? ) This representation is chosen so that it can represent a richer set of question types Therefore, some terms or relation in the tuple can be missing We define the following question structures:
Normal, UnknTerm, UnknRel, Definition, Compare, ThreeTerm, Clause, Combine, And, Or,
When, Where, Who, Many, ManyClass, List and Entity.
B Preprocessing module
The preprocessing module generates TokenVn annotations representing a Vietnamese word
with features such as part-of-speech Vietnamese is a monosyllabic language; hence, a word may contain more than one token
However, the Vietnamese word segmentation module is not trained for question domain