Ripple down rules for question analysis

Ripple Down Rules for Question AnalysisNguyen Quoc Dat K16 Computer Science Master Course Faculty of Information Technology University of Engineering and Technology Vietnam National Univ

Trang 1

1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY



NGUYEN QUOC DAT

RIPPLE DOWN RULES FOR QUESTION ANALYSIS

MASTER THESIS

Hanoi - 2011

Trang 2

Ripple Down Rules for Question Analysis

Nguyen Quoc Dat

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

Supervised by

Dr Pham Bao Son

A thesis submitted in fulfillment of the requirements

for the degree of Master of Science in Computer Science

August 2011

Trang 3

Table of Contents

2 Literature review 3

2.1 Question analysis

in question answering systems 3

2.1.1 Question classification 4

2.1.2 Pattern-matching based analysis 5

2.1.3 Syntactic-based analysis 6

2.1.4 Semantic-based analysis 8

2.1.5 Annotation-based question analysis in question answering sys-tems 10

2.2 GATE 12

2.2.1 Information Extraction in GATE 14

2.2.2 JAPE 14

2.3 Single Classification Ripple Down Rules 19

3 Our Question Answering System Architecture 20 3.1 Introduction 20

3.2 Preprocessing module 23

3.3 Syntactic analysis module 24

3.3.1 Noun phrases detection 24

3.3.2 Question-phrases detection 25

3.3.3 Relations detection 26

3.4 Semantic analysis module 27

3.5 Answer retrieval component 29

4 Systematic Knowledge Acquisition

for Question Analysis 30

v

Trang 4

vi TABLE OF CONTENTS

4.1 Recall Intermediate Representation

of an input question 30

4.2 Rule language 32

4.3 Knowledge Acquisition Process 33

5.1 Question Analysis for Vietnamese 37

5.2 Question Analysis for English 39

A Definitions of question-class types 43

B Definitions of question-structures 45

C Intermediate Representation Elements of English questions 48

D Embedding Java code in JAPE 59

Trang 5

Ripple Down Rules for Question Analysis

Nguyen Quoc Dat

K16 Computer Science Master Course

Faculty of Information Technology

University of Engineering and Technology

Vietnam National University, Hanoi

datnq@vnu.edu.vn

Pham Bao Son

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

sonpb@vnu.edu.vn

Abstract

For the task of turning a natural language question into an explicit intermediate representation of the complexity in question answering systems, all published works so far use rule-based approach to the best of our knowledge We believe that it is because

of the complexity of the representation and the variety of question types and also there are no publicly available corpus of a decent size In these rule-based approaches, the process of creating rules is not discussed It is clear that manually creating the rules

in an ad-hoc manner is very expensive and error-prone This thesis firstly describes, in details, a method to convert Vietnamese natural language questions into intermediate representation elements over semantic annotations via grammar rules Importantly, this thesis focuses on proposing a language independent approach on the process of creating those rules manually, in a way that consistency between rules is maintained and the effort to create a new rule is independent of the size of the current rule set Experimental results are promising to show that our language independent approach is easy to adapt for a new domain and a new language

Keywords

Question Answering System; Ripple Down Rules; Question Analysis;

PUBLICATIONS

? Dat Quoc Nguyen, Dai Quoc Nguyen and Son Bao Pham Systematic Knowledge Acquisition for

Question Analysis Proc of the 8th International Conference on Recent Advances in Natural Language

Processing (RANLP 2011), pp 406-412.

? Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham and Dang Duc Pham Ripple Down Rules

for Part-Of-Speech Tagging Proc of 12th International Conference on Intelligent Text Processing and

Computational Linguistics (CICLING 2011), Springer-Verlag LNCS, part I, pp 190-201.

? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese question answering system.

Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, pp 26–32.

Trang 6

I INTRODUCTION The rocketted growth of online information available that is accessible to human users requires more support from advanced information retrieval (IR) technologies to catch the expected information This brings new challenges to build IR systems especially like search engine, and question answering systems In while almost current search engines return ranked lists of related documents corresponding with each user’s query (in our case, a query referring to a question), and the user have to scan these documents to obtain desired information The goal

of question answering systems is to give extract answers in exploiting advantage of natural language processing to the user’s questions without scanning any document

Natural language question analysis component is the first component in any question answering systems This component creates an intermediate representation of the input question, which is expressed in natural language, to be utilized in the rest of the system For the task of translating

a natural language question into an explicit intermediate representation of the complexity in question answering systems, all published works so far use rule-based approach to the best of our knowledge In existing rule-based approaches, because of the complexity of the representation and the variety of question structure types, manually creating the rules in an ad-hoc manner is very expensive and error-prone in taking a lot of time and effort For example, many rule-based approaches such as the approach to process English questions described in Aqualog [1], the one

to handle Vietnamese questions presented in [2], manually defined a list of sequence pattern structures to analyze questions As rules are created in an ad-hoc manner, these approaches share a common difficulty in managing interaction among rules and keeping consistency

In this thesis, we firstly introduce a method to analyze Vietnamese natural questions in natural language analysis component Natural language questions will be transformed into intermediate representation elements which include construction of question, class of question, keywords

in question and semantic constraints between them through processes such as preprocessing, syntactic analysis and semantic analysis over semantic annotations via JAPE grammar rules on GATE framework [3]

More importantly, we focus on presenting a language independent approach utilizing Ripple Down Rules [4][5][6] knowledge acquisition methodology to acquire rules in a systematic manner where consistency between rules is maintained while avoiding unintended interaction among rules

In section II, we provide some related works and describe our overall system architecture

in section III We present our knowledge acquisition approach for question analysis in section

IV We describe our experiments in section V Discussion and conclusion will be presented in section VI

Trang 7

II RELATED WORKS

A Question analysis in question answering systems

Early NLIDB systems used pattern-matching technique to process user’s question and gen-erate corresponding answer [7] A common technique for parsing input questions in NLIDB approaches is syntax analysis where a natural language question is directly mapped to a database query (such as SQL) through grammar rules Nguyen and Le [8] introduced a NLIDB question answering system in Vietnamese employing semantic grammars Their system includes two main modules: QTRAN and TGEN QTRAN (Query Translator) maps a natural language question

to an SQL query while TGEN (Text Generator) generates answers based on the query result tables QTRAN uses limited context-free grammars to analyze user’s question into syntax tree via CYK algorithm

Recently, some question answering systems that used semantic annotations generated high results in natural language question analysis A well known annotation based framework is GATE [3] which have been used in many question answering systems especially for the natural language question analysis module such as: Aqualog [1], QuestIO [9], an the one presented in [2]

Aqualog is an ontology-based question answering system for English and is the basis for the development of our system Aqualog takes a natural language question and an ontology

as its input, and returns an answer for users based on the semantic analysis of the question and the corresponding elements in the ontology Aqualog’s architecture can be described as a waterfall model where a natural language question is mapped to a set of representation based

on the intermediate triple that is called a Query-Triple through the Linguistic Component The Relation Similarity Service takes a Query-Triple and processes it to provide queries with respect

to the input ontology called Onto-Triple

Aqualog performs semantic and syntactic analysis of the input question through the use of processing resources provided by GATE [3] such as word segmentation, sentence segment, part-of-speech tagging When a question is asked, the task of Linguistic Component is to transfer the natural language question to a Query-Triple with the following format (generic term, relation, second term) Through the use of Java Annotation Patterns Engine (JAPE) grammars in GATE [3], AquaLog identifies terms and their relationship The Relation Similarity Service uses Query-Triples to create Ontology-Query-Triples where each term in the Query-Query-Triples is matched with elements

in the ontology

In our experiment, we reported an approach to convert Vietnamese natural language questions into intermediate representation element in query-tuples (Question-structure, Question-class, Term1, Relation, Term2, Term3) based on semantic annotations via JAPE grammars [10] The selected query-tuple type is more complex aiming to cover a wider variety of question types in different languages In addition, we proposed a language-independent approach to to acquire

Trang 8

JAPE rules in a systematic manner which avoids unintended interaction among rules [11] Phan and Nguyen [2] presented an approach to syntactically and semantically map Vietnamese

questions into triple-like of Subject, Verb and Object in also utilizing JAPE grammars.

B Single Classification Ripple Down Rules

Ripple Down Rules (RDR) [4][5][6] were developed to allow users incrementally add rules to

an existing rule-based system whiles systematically controlling interactions between rules and ensuring consistency among existing rules

A Single Classification Ripple Down Rules (SCRDR) [4][5][6] tree is a binary tree with two discrete types of edges that are typically called except and if-not edges Associated with each node in a tree is a rule A rule has the form: if α then β where α is called the condition and

β is called the conclusion.

Cases in SCRDR are evaluated by passing a case (for example, a question to be classified

in our case) to the root of the tree At any node in the tree, if the condition of a node N’s

rule is satisfied by the case, the case is passed on to the exception child of N using the except

link if the link exists In the contrast, if the condition of a node N’s rule is not satisfied by the

case, the case is passed on to the N’s if-not child The conclusion given by this process is the conclusion from the last node in the RDR tree which fired (satisfied by the case) To ensure

that a conclusion is always given, the root node typically contains a trivial condition which is

always satisfied This node is called the default node.

A new node is added to an SCRDR tree when the evaluation process returns the wrong conclusion The new node is attached to the last node in the evaluation path of the given case

with the except link if the last node is the fired rule Otherwise, it is attached with the if-not

link

RDR based approaches have been used to tackle NLP tasks such as POS tagging [12], text classification and information extraction [13]

Trang 9

III OURQUESTION ANSWERING SYSTEMARCHITECTURE

In this section, we introduce our the first Ontology-based question answering system in Vietnamese, and focus on describing, in details, the system’s front-end component that performs syntactic and semantic analysis on natural language questions on GATE framework

The architecture of our question answering system is shown in figure 1 It includes two components: the Natural language question analysis engine and the Answer retrieval

The question analysis component consists of three modules: preprocessing, syntactic analysis and semantic analysis It takes the user question as an input and returns a query-tuple representing the question in a compact form The role of this intermediate representation is to provide structured information of the input question for later processing such as retrieving answers The answer retrieval component includes two main modules: Ontology mapping and Answer extraction It takes an intermediate representation produced by the question analysis component and an Ontology as its input to generate semantic answers

We wrapped existing linguistic processing modules for Vietnamese such as Word Segmen-tation, Part-of-speech tagger [14] as GATE plug-ins Results of the modules are annotations capturing information such as sentences, words, nouns and verbs Each annotation has a set

of feature-value pairs For example, a word has a feature category storing its part-of-speech

tag This information can then be reused for further processing in subsequent modules New modules are specifically designed to handle Vietnamese questions using JAPE grammars over existing linguistic annotations

A Intermediate representation element

Aqualog [1] performs semantic and syntactic analysis of the input English question through the use of processing resources provided by GATE [3] When a question is asked, the task of the question analysis component is to transfer the natural language question to a Query-Triple with the following format (generic term, relation, second term) Through the use of JAPE grammars

in GATE, AquaLog identifies terms and their relationship The intermediate representation used

in our approach is more complex aiming to cover a wider variety of question types It consists

of a question-structure and one or more query-tuple in the following format:

(question-structure, question-class, T erm1, Relation, T erm2, T erm3)

where T erm1 represents a concept (object class), T erm2 andT erm3, if exist, represent entities

(objects), Relation (property) is a semantic constraint between terms in the question This

representation is meant to capture the semantic of the question

Simple questions only have one query-tuple and its question-structure is the query-tuple’s

question-structure More complex questions such as composite questions have several

sub-questions, each sub-question is represented by a separate query-tuple, and the question-structure

captures this composition attribute

Composite questions such as:

Trang 10

Figure 1 Architecture of our question answering system.

“danh sách tất cả các sinh viên của khoa công nghệ thông tin mà có quê quán ở Hà Nội?”

“list all students in the Faculty of Information Technology whose hometown is Hanoi?”

has question structure of type And with two query-tuples where ? represents a missing element:

( UnknRel , List , sinh viênstudent , ? , khoa công nghệ thông tinF aculty of Inf ormation T echnology , ? ) and ( Normal , List , sinh viênstudent , có quê quán has hometown , Hà NộiHanoi , ? ) This representation is chosen so that it can represent a richer set of question types Therefore, some terms or relation in the tuple can be missing We define the following question structures:

Normal, UnknTerm, UnknRel, Definition, Compare, ThreeTerm, Clause, Combine, And, Or,

When, Where, Who, Many, ManyClass, List and Entity.

B Preprocessing module

The preprocessing module generates TokenVn annotations representing a Vietnamese word

with features such as part-of-speech Vietnamese is a monosyllabic language; hence, a word may contain more than one token

However, the Vietnamese word segmentation module is not trained for question domain

Định dạng
Số trang	20
Dung lượng	0,94 MB