IR Information RetrievalGATE General Architecture for Text EngineeringJAPE Java Annotation Patterns EngineANNIE A New-Nearly Information ExtractionRDR Ripple Down Rules SCRDR Single Clas
Trang 1
NGUYEN QUOC DAT
RIPPLE DOWN RULES FOR QUESTION ANALYSIS
MASTER THESIS
Trang 2Nguyen Quoc Dat
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
Supervised by
Dr Pham Bao Son
A thesis submitted in fulfillment of the requirements
for the degree ofMaster of Science in Computer Science
August 2011
Trang 4‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree
substan-or diploma at University of Engineering and Technology (UET/Coltech) substan-or any othereducational institution, except where due acknowledgement is made in the thesis Anycontribution made to the research by others, with whom I have worked at UET/Coltech
or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectualcontent of this thesis is the product of my own work, except to the extent that assistancefrom others in the project’s design and conception or in style, presentation and linguisticexpression is acknowledged.’
Hanoi, August 23rd, 2011Signed
Trang 5For the task of turning a natural language question into an explicit intermediaterepresentation of the complexity in question answering systems, all published works
so far use rule-based approach to the best of our knowledge We believe that it isbecause of the complexity of the representation and the variety of question typesand also there are no publicly available corpora of a decent size In these rule-basedapproaches, the process of creating rules is not discussed It is clear that manuallycreating the rules in an ad-hoc manner is very expensive and error-prone This thesisfirstly describes an ad-hoc method to convert Vietnamese natural language questionsinto intermediate representation elements over semantic annotations via grammarrules Importantly, this thesis focuses on proposing a language independent approach
on the process of creating those rules manually, in a way that consistency betweenrules is maintained and the effort to create a new rule is independent of the size ofthe current rule set Experimental results are promising to show that our languageindependent approach is easy to adapt for a new domain and a new language
Publications:
? Dat Quoc Nguyen, Dai Quoc Nguyen and Son Bao Pham Systematic Knowledge Acquisition for Question Analysis In Proc of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011).
? Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham and Dang Duc Pham Ripple Down Rules for Part-Of-Speech Tagging In Proc of 12th International Conference on Intelligent Text Process- ing and Computational Linguistics (CICLING 2011), Springer-Verlag LNCS, part I, pp 190-201.
? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese question answering system In Proc of the 2009 International Conference on Knowledge and Systems Engineering (KSE 2009), IEEE CS, pp 26–32.
Trang 6First and foremost, I would like to express my deepest gratitude to my supervisor,
Dr Pham Bao Son, for his patient guidance and continuous support throughout theyears He always appears when I need help, and responds to queries so helpfully andpromptly
I would like to give my honest appreciation to my brother, Nguyen Quoc Dai, forhis great support
I would like to specially thank Prof Bui The Duy and my colleagues for their helpthrough my time at Human Machine Interaction Laboratory, UET/Coltech
I would also like to thank my friend, Nguyen Le Trang, for her kindly help
I sincerely acknowledge the Vietnam National University, Hanoi, NAFOSTED nam, Toshiba Foundation Scholarship, and especially Dr Pham Bao Son for sup-porting finance to my master study
Viet-Finally, this thesis would not have been possible without the support and love of
my mother and my father Thank you!
Trang 81 Introduction 1
2.1 Question analysis
in question answering systems 3
2.1.1 Question classification 4
2.1.2 Pattern-matching based analysis 5
2.1.3 Syntactic-based analysis 6
2.1.4 Semantic-based analysis 8
2.1.5 Annotation-based question analysis in question answering sys-tems 10
2.2 GATE 12
2.2.1 Information Extraction in GATE 14
2.2.2 JAPE 14
2.3 Single Classification Ripple Down Rules 19
3 Our Question Answering System Architecture 20 3.1 Introduction 20
3.2 Preprocessing module 23
3.3 Syntactic analysis module 24
3.3.1 Noun phrases detection 24
3.3.2 Question-phrases detection 25
3.3.3 Relations detection 26
3.4 Semantic analysis module 27
3.5 Answer retrieval component 29
Trang 94.1 Recall Intermediate Representation
of an input question 304.2 Rule language 324.3 Knowledge Acquisition Process 33
Trang 102.1 Parse tree of question “ which rock contains magnesium? ” 7
2.2 The syntactic-semantic tree example 9
2.3 Aqualog’s architecture 11
2.4 GATE’s architecture 12
2.5 A set of Token annotations in GATE 13
3.1 Architecture of our question answering system 21
3.2 An example of intermediate representation element 22
3.3 An example of redefining the TokenVn annotation 23
3.4 NounPhrase annotations 25
3.5 QU-E-L-MC and QUTerm annotations 26
3.6 Relation between phrases 27
3.7 Relation annotations 27
3.8 Question structures 28
4.1 Question analyzer’s GUI 31
4.2 Question processing component to create the intermediate representa-tion of quesrepresenta-tion “trường đại học Công Nghệ có bao nhiêu sinh viên?”(“how many students are there in the College of Technology?”) 34
C.1 Question-structure of Definition 48
C.2 Question-structure of UnknTerm 49
C.3 Question-structure of UnknRel 49
C.4 Question-structure of Normal 50
C.5 Question-structure of Affirm 50
C.6 Question-structure of ThreeTerm 51
C.7 Question-structure of Affirm_3Term 51
C.8 Question-structure of And 52
Trang 11C.9 Question-structure of And (2) 53
C.10 Question-structure of And (3) 54
C.11 Question-structure of And (4) 55
C.12 Question-structure of Or 56
C.13 Question-structure of Clause 57
C.14 Question-structure of Clause (2) 58
Trang 122.1 Countries table 53.1 JAPE grammar for identifying Vietnamese noun phrases 245.1 Number of exception rules in layers in our SCRDR KB 375.2 Number of rules corresponding with each question-structure type inthe knowledge base for Vietnamese 385.3 Number of correctly analyzed questions 395.4 Error results 395.5 Number of exception rules in layers in our English SCRDR KB 405.6 Number of rules corresponding with each question-structure type inthe knowledge base for English 40
Trang 13IR Information RetrievalGATE General Architecture for Text EngineeringJAPE Java Annotation Patterns Engine
ANNIE A New-Nearly Information ExtractionRDR Ripple Down Rules
SCRDR Single Classification Ripple Down Rules
QC Question ClassificationSVM Support Vector MachineSRW Semantically Related WordsNLIDB Natural Language Interface to DataBasePOS Part-of-Speech
NLP Natural Language ProcessingLHS Left-hand-side
RHS Right-hand-sideGUI Graphic User Interface
Trang 14The rocketed growth of online information available that is accessible to human usersrequires more support from advanced information retrieval (IR) technologies to catchthe expected information This brings new challenges to build IR systems especiallylike search engine, and question answering systems While almost current searchengines return ranked lists of related documents corresponding with each user’squery (in our case, a query referring to a question), and the user have to scan thesedocuments to obtain desired information The goal of question answering systems
is to give extract answers in exploiting advantage of natural language processing tothe user’s questions without scanning any document
Natural language question analysis component is the first component in anyquestion answering systems This component creates an intermediate representa-tion of the input question, which is expressed in natural language, to be utilized
in the rest of the system For the task of translating a natural language questioninto an explicit intermediate representation of the complexity in question answer-ing systems, all published works so far use rule-based approach to the best of ourknowledge In existing rule-based approaches, because of the complexity of the rep-resentation and the variety of question structure types, manually creating the rules
in an ad-hoc manner is very expensive and error-prone in taking a lot of time andeffort For example, many rule-based approaches such as the approach to handleEnglish questions described in Aqualog (Lopez et al., 2007), the one to processVietnamese questions presented in (Phan and Nguyen, 2010), manually defined alist of sequence pattern structures to analyze questions As rules are created in anad-hoc manner, these approaches share common difficulties in managing interaction
Trang 15between rules and keeping consistency among them.
In this thesis, we firstly introduce an ad-hoc approach to process Vietnamesenatural questions in natural language analysis component Natural language ques-tions will be transformed into intermediate representation elements which includeconstruction of question, class of question, keywords in question and semantic con-straints between them through processes such as preprocessing, syntactic analysisand semantic analysis over semantic annotations via JAPE grammar rules on GATEframework (Cunningham et al.,2002)
More importantly, we focus on presenting a language independent approach lizing Ripple Down Rules (Compton and Jansen,1988,1990;Richards,2009) knowl-edge acquisition methodology to acquire rules in a systematic manner where con-sistency between rules is maintained while avoiding unintended interaction amongrules
uti-This dissertation consists of 6 chapters In second chapter, we provide some erature reviews and describe our overall system architecture, in which we presentour method to process Vietnamese questions, in chapter3 We propose our languageindependent knowledge acquisition approach in chapter 4 We describe our experi-ments for both Vietnamese and English in chapter5 Discussion and conclusion will
lit-be presented in chapter 6
Trang 16Literature review
In this chapter, we review related work using rule-based approaches for questionanalysis in question answering systems driving specific-domains Section2.1describeapproaches that analyze natural language questions in the ways of using patter-matching (in section 2.1.2), syntactic-based (in section 2.1.3), semantic-based (insection2.1.4, and annotation-based (in section2.1.5) techniques In addition, section2.3 covers basic knowledge background about Ripple Down Rules (RDR), whilesection 2.2 presents GATE framework and its JAPE grammar that we have beenworking on
in question answering systems
Kinds of question answering systems range from close-domain systems (aiming toanswer questions in a specific domain) to open-domain systems (aiming to answerall of asked questions) In our experiment, the open-domain systems focus on re-trieving and ranking related documents corresponding with the input, while theclose-domain systems focus on analysis natural language questions to extract reli-able terms Therefore, our related works come from reviewing rule-based questionanalysis approaches in specific domain driven ones
Natural language question analysis component is the first component in anyquestion answering systems This component creates an intermediate representation
of the input question, which is expressed in natural language, to be utilized inthe rest of the system The basis of the question parser is question classification
Trang 17Subsequently, natural language questions analysis techniques are used to identifykeywords and semantic relations in input questions.
2.1.1 Question classification
Question Classification (QC) can be defined as the task of mapping a given question
to one of k classes based on the possible types of the answers (Li and Roth, 2002)
This classification provides semantic constraints based on the expected answers (Liand Roth, 2006)
The approach applied in early QC systems to identify question-class is based
on original regular expression model (Li, 2002) The main idea of this approach is
on identifying the class of input question based on the sentence pattern includingquestion-words, sequences of words and some terms of representing particular ques-tion classes These patterns are detected by using regular expression A disadvantage
of the regular-expression based approach is the lack of semantic information in tions To resolve the problem we have to build a vast of complete and precise set
ques-of patterns, but it takes a very lot ques-of time and effort Currently, semantic tion can be defined by using patterns over existing linguistic annotations in GATEframework (Cunningham et al., 2002) More details about GATE will be described
informa-in section2.2, and the applications of GATE in actually analyzing natural languagequestions will be presented in chapters 3and 4
Another approach obtaining to classify questions that is more flexible and matic than the one based regular expression is the use of language model A languagemodel (Jurafsky and Martin, 2008; Manning et al., 2008) is a probability distribu-tion over word sequences We have to build a language model for every class C oftraining questions With new question Q, we calculate the conditional probability
auto-P (C|Q) for each C and select the one corresponding with the highest probability
as the class that Q belongs to We build the language model called the N-gramlanguage model bases on statistic of sequence of length N words Using N-grammodel causes imprecise probability estimates because of occurrence of “zero prob-ability N-grams” To resolve this problem, some smoothing methods had proposed
to estimate more imprecisely One of the most commonly smoothing methods, thatare suitable to handle the appearance of “zero probability N-grams”, is the KatzBack-Off method (Jurafsky and Martin, 2008; Manning and Sch¨utze, 1999) KatzBack-Off smoothing uses Good-Turing (Jurafsky and Martin, 2008; Manning and
Trang 18Sch¨utze, 1999) discounting as well The key ideal of Katz backoff N-gram model isthat when a N-gram has 0 counts, we approximate it by backing off to (N-1)-gram.
Using machine learning methods for question classification such as Support tor Machine (SVM) (Zhang and Lee, 2003; Metzler and Croft, 2005; Huang et al.,
Vec-2008), Maximum Entropy model (Kocik, 2004;Huang et al.,2008), are more vanced than manual-based ones Building manual classifiers takes a tedious work ofanalyzing a very large number of questions through manually writing heuristic rules
ad-In addition, mapping questions to defined classes take the use of lexical elements,thus, mapping model has very big size In contrast, a learned classifier is automati-cally constructed based on questions’ features It is more easily to adapt and reuse
to a new domain in short time than human-constructed one The performance of alearned classifier with suitable features and learning algorithms is usually improved
in use of more training data The classification system utilizes learning algorithm inthe use of (one or more) above features on training questions to create a classificationmodel The model, then, is used to return the class for a new input
Table 2.1: Countries tableCOUNTRY CAPITAL
South Korea Seoul
2.1.2 Pattern-matching based analysis
Close-domain question answering systems are usually linked to relational databasesand called natural language interfaces to databases A natural language interface to
a database (NLIDB) is a system that allows the users to access information stored
in a database by typing questions using natural language expressions (sopoulos et al., 1995) Early NLIDB systems used pattern-matching technique toprocess user’s questions and generate corresponding answers A common techniquefor parsing input questions in NLIDBs is syntactic analysis where a natural languagequestion is directly mapped to a database query (such as SQL) through grammarrules Currently, semantic-grammar-based approaches have been applied in NLIDBs
Androut-to analyze input
Trang 19Some early NLIDBs based on pattern-matching approach to respond user’s tions To illustrate the approach, we consider a simple example about database table(Androutsopoulos et al.,1995), as in table 2.1, storing information of countries.
ques-A pattern-matching system could use some rules like:
Rule1: if “capital” <country> then return CAPITAL of row
where COUNTRY = <country>
Rule2: if “capital” “country” then return CAPITAL and
COUNTRY in each rowThe Rule1 means that if input question holds the word “capital” followed by acountry name appearing in the COUNTRY column, the system will locate the rowcontaining the name, and returns the result crossing between CAPITAL columnand located row The Rule2 means that if input question contains the word “capital”
followed by the word “country”, the system will print capital of each country
For examples, some questions like “What is the capital of Vietnam ?” or “Namethe capital of Vietnam” are handled by using Rule1, and the system returns thesame answer While the Rule2 is used to process the questions such as “What is thecapital of each country?”, “List the capital of every country”
Sneiders (2002) presented a NLIDB system by using question patterns ing conceptual model of the database The input is converted into SQL query byusing defined templates that contain entity slots – free space for data instances rep-resenting the primary concepts of the question Some other open-domain systemspresented in (Wu et al.,2003;Saxena et al.,2007) used pattern-matching techniques
cover-to respond user’s requests
The main advantage of pattern-matching approach is its simplicity, and the tem can be able to perform well in certain applications However, the one’s shallow-ness would often lead to bad results
sys-2.1.3 Syntactic-based analysis
In syntactic-based NLIDB systems, user’s question is syntactically transferred intoparse tree, and the tree is directly converted to an expression in query languagedriving database (Androutsopoulos et al., 1995) LUNAR (Woods et al., 1972) istypical example of this approach
Syntax-based systems use grammar rules like Context-free grammar (Chomsky,
1957) to describe syntactic structures of questions The following example shows an
Trang 20simple grammar rules:
N P → Det NDet → “what00 | “which00
N → “rock00 | “radiation00 | “magnesium00
V P → V N
V → “contains00 | “emits00
Using these rules, a NLIDB system could represent the syntactic structure of theexample question “which rock contains magnesium?” as shown in figure 2.1
Figure 2.1: Parse tree of question “ which rock contains magnesium? ”
Then, the NLIDB could map the parse tree of figure 2.1 to the below databasequery in which X is a variable:
(for_every X (is_rock X)
(contains X magnesium) ;(printout X))
The mapping process is performed through rules and is totally based on thesyntactic information of the parse tree In general, syntax-based NLIDBs are usu-ally concatenated to application-specific database systems that provide databasequery languages However, it is difficult to create translating rules that will directlytransform the syntactic tree to an expression in some database query language
The syntactic analysis approach is embedded in some open-domain systemssuch as FALCON (Harabagiu et al., 2000), the system of Harabagiu and colleagues(Harabagiu et al.,2003), the system described in (Min and Tomek, 2006)
Trang 21semantic-to ease the mapping from the syntax tree semantic-to database objects.
Nguyen and Le (Nguyen and Le,2008) introduced a NLIDB question answeringsystem in Vietnamese employing semantic grammars Their system includes twomain modules: QTRAN and TGEN QTRAN (Query Translator) maps a naturallanguage question to an SQL query while TGEN (Text Generator) generates answersbased on the query result tables QTRAN uses limited context-free grammars toanalyze user’s question into syntax tree via CYK algorithm The syntax tree isthen converted into an SQL query by using a mapping dictionary to determinenames of attributes in Vietnamese, names of attributes in the database and names
of individuals stored in these attributes Some following semantic grammar rules areused in Nguyen and Le’s system:
1 < conditions > → < selection condition >< conjunction >< conditions >
2 < conditions > → < joint condition >< conditions >
3 < joint condition > → < source >< negative >< SR >
4 < joint condition > → < SR >< source >
5 < negative > → ‘chanotyet/khng0no
6 < joint condition > → < negative >< SR >< source >
7 < source > → < quantity >< entity >< conditions >
8 < source > → < values >
9 < quantity > → < stress word >< number >
The figure 2.2 shows the syntactic-semantic tree for the question “Tìm các sinhviên học ít nhất 2 môn do giáo viên A dạy” (“Find all students who study at least 2subjects taught by lecturer A”) by using the listed semantic rules
Some other systems based on semantic grammar rules such as Planes (Waltz,
1978), Eufid (Templeton and Burger, 1983) Semantic grammar-based approacheswere considered as an engineering methodology, which allows semantic knowledge
Trang 22Figure 2.2: The syntactic-semantic tree example.
to be easily included in the system Nevertheless, since semantic grammars containhard-wired knowledge orienting specific domain, it is difficult for systems based
on this approach to port to other knowledge domains: “a new semantic grammarhas to be written whenever the NLIDB is configured for a new knowledge domain”
(Androutsopoulos et al., 1995)
The PRECISE system (Popescu et al., 2003) maps the natural language tion to a unique semantic interpretation by analyzing some lexicons and semanticconstraints Stratica et al (2003) described a template-based system to translateEnglish question into SQL query by matching the syntactic parse of the question to
ques-a set of fixed semques-antic templques-ates Some question ques-answering systems process input
by using statistical parsers built from existing semantic-labelled corpus as shown in(Clark et al.,2004; Judge et al., 2005; Niknia and Hassanabadi,2009)
Currently, there are many question answering systems make the use of semanticinformation to analyze questions by utilizing syntactic-semantic interpretation rules
Trang 23driving Logical Forms (Androutsopoulos et al., 1993; Moldovan et al., 2002; jamin et al., 2003; Atzeni et al., 2004; Cimiano et al., 2008) These systems firstlytransform the input into an intermediate logical expression of high level world con-cepts without any relation to the database|knowledge-base structure The logicalexpression is then converted to an expression in the database|knowledge-base querylanguage The use of logic languages makes it possible to adapt to other domains aswell as different query languages (Silakari et al., 2011).
Ben-2.1.5 Annotation-based question analysis in question
an-swering systems
Recently, some question answering systems that used semantic annotations ated high results in natural language question analysis A well known annotationbased framework is GATE (General Architecture for Text Engineering) (Cunning-ham et al., 2002) which have been used in many question answering systems likeOntology-based AquaLog (Lopez et al., 2007) and QuestIO (Damljanovic et al.,
gener-2008) systems, and Galea’s open-domain system (Galea, 2003), especially for thenatural language question analysis component
Aqualog is an ontology-based question answering system for English and is thebasis for the development of our system Aqualog takes a natural language questionand an ontology as its input, and returns an answer for users based on the semanticanalysis of the question and the corresponding elements in the ontology Aqualog’sarchitecture as shown in figure 2.3 can be described as a waterfall model where anatural language question is mapped to a set of representation based on the inter-mediate triple that is called a Query-Triple through the Linguistic Component TheRelation Similarity Service takes a Query-Triple and processes it to provide querieswith respect to the input ontology called Onto-Triple
Aqualog performs semantic and syntactic analysis of the input question throughthe use of processing resources provided by GATE (Cunningham et al., 2002) such
as word segmentation, sentence segment, part-of-speech tagging When a question isasked, the task of Linguistic Component is to transfer the natural language question
to a Query-Triple with the following format (generic term, relation, second term)
Through the use of Java Annotation Patterns Engine (JAPE) grammars in GATE(Cunningham et al., 2002), AquaLog identifies terms and their relationship TheRelation Similarity Service uses Query-Triples to create Ontology-Triples where each
Trang 24Figure 2.3: Aqualog’s architecture.
term in the Query-Triples is matched with elements in the ontology
In our experiment, we reported an approach to convert Vietnamese natural guage questions into intermediate representation element in query-tuples (Question-structure, Question-class, Term1, Relation, Term2, Term3) based on semantic annota-tions via JAPE grammars (Nguyen et al., 2009) The selected query-tuple type ismore complex aiming to cover a wider variety of question types in different languages
lan-In addition, we proposed a language-independent approach to acquire JAPE rules
in a systematic manner which avoids unintended interaction among rules (Nguyen
et al., 2011a).Phan and Nguyen (2010) presented an approach to syntactically andsemantically map Vietnamese questions into triple-like of Subject, Verb and Object
in also utilizing JAPE grammars
The START (Katz,1997;Katz et al.,2005,2006) question answering system alsoused natural language annotations (Katz, 1997) without utilizing GATE A lexicaldatabase WordNet (Fellbaum,1998) is important natural language application Afterthe appearance of WordNet, almost question answering systems used it to provideinformation for analyzing questions
Trang 25gen-• Visual Resources are components used for building graphical interfaces.
Figure 2.4: GATE’s architecture
Trang 26Figure 2.5: A set of Token annotations in GATE.
The semantic constraints are processed in GATE via annotations The semanticannotations are stored in structure layers called annotation sets These annotationsets make up independent layers of annotations over entire document’s content
The figure 2.5 shows examples of annotation sets in GATE’s graphic interface Anannotation is defined by (Tablan et al., 2004):
• Start node is a location in the document content defined by an offset
• End node is a location in the document content defined by an offset
• Type is a String value
• Features set holds <attribute-name, attribute-value> pairs The attributenames are Strings, while the values can be any Java objects
• ID is an Integer value All annotations’ IDs are unique inside an annotationset
Trang 272.2.1 Information Extraction in GATE
GATE (Cunningham et al., 2002) provides a set of processing resources of tence Splitter, Tokeniser, POS Tagger, Gazetteer and JAPE Transducer, that can
Sen-be reused for natural language processing (NLP) tasks A processing resource can
be individually used or joined with some of others to make new modules for newapplications For example, many NLP tasks require a Tokenizer and a POS-taggerwithout other specific resources for information extraction as Named Entity Trans-ducer The connection between these resources in GATE creates an informationextraction system named ANNIE (A Nearly-New Information Extraction)
• The ANNIE’s tokeniser separates the text into simple tokens such as numbers,punctuation and words according to different types
• The sentence splitter segments the text into sentences
• The POS tagger produces a part-of-speech tag as an annotation on each word
or symbol Outputs of English tokeniser, sentence splitter and POS tagger areused to create Token annotations as illustrated in figure2.5
• The gazetteer stores the lookup lists of entities Annotations of type Lookupwill be generated for each matching string in the text when the gazetteer run-ning on a document Each annotation Lookup has a specified feature majorTypeand a optional feature minorType
• JAPE transducers are finite state transducers over annotations, for buildingmore complex annotations and incorporating the context of the document intothe new annotations
2.2.2 JAPE
JAPE allows us to identify regular expressions over annotations on documents AJAPE grammar holds a set of phases, each of which consists of a set of pattern/actionrules in form of LHS 99K RHS The left-hand-side (LHS) of the rule contains therecognized annotation pattern maybe carrying regular expression operators such as
∗, ?, + The right-hand-side (RHS) outlines the action to be taken on the matchedpattern and consists of annotation manipulation statements Annotations detected
on the LHS of a rule are referred on the RHS by means of labels that are attached
to pattern elements
Trang 28Options setting
At the beginning of each grammar, several options can be set:
• Control - this defines the method of rule matching Each grammar phase has
3 possible control styles brill , first, and appelt The brill style allows more thanone rule to match the same region of the document The first style performingthe first rule firing for the first match that is found With the appelt style,only one rule can be fired for the same pattern, according to rules’ priorities
• Debug - when set to true, if the grammar is running in appelt mode trol option) and there is more than one possible match, the conflicts will beoutputted in the messages window
(con-• Input annotations must also be defined at the start of each grammar If thereare no defined annotations, the default will be Token, SpaceToken, and Lookuptherefore, by default only these annotations will be considered when trying amatch
Pattern matching
A pattern is specified in terms of one or more annotations, and optionally, the values
of any or all of its features, according to main 3 ways:
• Specify a string of text, for example, {Token.string == “of”} detects string “of”
• Specify the presence or absence of an annotation previously assigned from agazetteer, for instance, {Lookup} matches a Lookup annotation
• Specify the attributes (and values) of an annotation such as {Token.kind ==
number}, {Token.length != 5}
The following operators can be used:
• | - or
• ∗ - zero or more occurrences
• ? - zero or one occurrences
• + - one or more occurrences
Trang 29An operator can operate on any pattern enclosed in round brackets Every plete pattern to be annotated must be surrounded by round brackets and followed by
com-a lcom-abel A lcom-abel is denoted by com-a preceding colon In the excom-ample ({Lookup.mcom-ajorType
== location}):loc, the label is loc It is possible to have more than one pattern andcorresponding label on the LHS of a rule Nested patterns are also permitted Theexample of nested patterns is shown as below:
(({Lookup.majorType == jobtitle}):jobtitle{TempPerson}
):person
MacrosMacros can also be used to avoid repeating the same information in the LHSs ofseveral rules Macros can themselves be employed inside other macros A macroutilized in one phase of a grammar does not need to be redefined in a later phase
Macro: NOUN(
{Token.category == NP} |{Token.category == NPS} |{Token.category == NNP} |{Token.category == NNPS} |{Token.category == NN} |{Token.category == NNS}|
{Token.category == CD}
)Rule: SimpleNP((NOUN)+):simpleNP 99K
ContextContext is used in case of a pattern only to be identified if the pattern occurs in acertain situation, but the context itself is not a part of the pattern to be annotated
Context before or after a pattern can also be indicated by enclosing it in roundbrackets For example, the following rule would annotate the pattern matched bymacro YEAR if it was preceded by the words “in” or “by”
Trang 30Rule: YearContext1(
{Token.string == “in”} |{Token.string == “by”}
)(YEAR):date 99K:date.Timex = {kind = “date”, rule = “YearContext1”}
Simple JAPE ruleThe RHS of a rule holds information for annotating the matched patterns Theinformation about the patterns is mapped from LHS of the rule in the use of la-bels, and annotated with the entity types to create new annotations Features andcorresponding values are then added to the annotations
Macro: NOUN(
{Token.category == NP} |{Token.category == NPS} |{Token.category == NNP} |{Token.category == NNPS} |{Token.category == NN} |{Token.category == NNS}|
{Token.category == CD}
)Rule: SimpleNP1((NOUN)+):simpleNP99K :simpleNP.NounPhrase = {kind = “simple”, rule = “SimpleNP1”}
In the above rule, the label is simpleNP The RHS of the rule is the part followingthe arrow The label is transferred to the RHS of the rule and the annotation typeNounPhrase is added to the pattern The annotation is given two optional featureskind and rule with the values simple and SimpleNP1 respectively The first one isused to provide more specific information about the annotation, meaning that it is
a particular kind of NounPhrase The second is mainly used for debugging purposes
of keeping track of which rule fired The resulting annotation and its features and
Trang 31corresponding values will all be displayed, together with the string and the offsets,
in the GATE GUI
Using Java in JAPE rulesThe disadvantages of JAPE grammar are that we can not remove, add and modifyany feature of any existed annotation, and we can also delete the existed annotations
Using Java code in RHS of JAPE rule can be resolved above problems An example
of employing Java code in JAPE rule is described in appendix D
Trang 322.3 Single Classification Ripple Down Rules
Ripple Down Rules (RDR) (Compton and Jansen,1988,1990;Richards,2009) weredeveloped to allow users incrementally add rules to an existing rule-based systemwhiles systematically controlling interactions between rules and ensuring consistencyamong existing rules
A Single Classification Ripple Down Rules (SCRDR) (Compton and Jansen,
1988, 1990; Richards, 2009) tree is a binary tree with two discrete types of edgesthat are typically called except and if-not edges Associated with each node in a tree
is a rule A rule has the form: if α then β where α is called the condition and β iscalled the conclusion
Cases in SCRDR are evaluated by passing a case (for example, a question to beclassified in our case) to the root of the tree At any node in the tree, if the condition
of a node N’s rule is satisfied by the case, the case is passed on to the exception child
of N using the except link if the link exists In the contrast, if the condition of a nodeN’s rule is not satisfied by the case, the case is passed on to the N’s if-not child Theconclusion given by this process is the conclusion from the last node in the RDRtree which fired (satisfied by the case) To ensure that a conclusion is always given,the root node typically contains a trivial condition which is always satisfied Thisnode is called the default node
A new node is added to an SCRDR tree when the evaluation process returns thewrong conclusion The new node is attached to the last node in the evaluation path
of the given case with the except link if the last node is the fired rule Otherwise, it
is attached with the if-not link
RDR based approaches have been used to tackle NLP tasks such as POS tagging(Nguyen et al., 2011b), text classification and information extraction (Pham andHoffmann, 2006)
Trang 33Our Question Answering System Architecture
In this chapter, we introduce our the first Ontology-based question answering tem in Vietnamese, and focus on describing the system’s front-end component thatperforms syntactic and semantic analysis on natural language questions on GATEframework The back-end component is responsible for making sense of the user’squery with respect to a target ontology using various concept-matching techniquesbetween a natural language phrase and elements in the ontology The communica-tion between the front-end and back-end is an intermediate representation of thequestion, which captures the semantic structure of the users’ question Natural lan-guage questions will be transformed into intermediate representation elements whichinclude construction type of question, class of question, keywords in question and se-mantic constraints between them through processes such as: preprocessing, syntacticanalysis and semantic analysis over semantic annotations
The architecture of our question answering system is shown in figure3.1 It includestwo components: the Natural language question analysis engine and the Answerretrieval
The question analysis component consists of three modules: preprocessing, tactic analysis and semantic analysis It takes the user question as an input andreturns a query-tuple representing the question in a compact form The role of this
Trang 34syn-intermediate representation is to provide structured information of the input tion for later processing such as retrieving answers.
ques-The answer retrieval component includes two main modules: Ontology mappingand Answer extraction It takes an intermediate representation produced by thequestion analysis component and an Ontology as its input to generate semanticanswers
Figure 3.1: Architecture of our question answering system
The intermediate representation element consists of a question-structure and one
or more query-tuples in the following format:
(question-structure, question-class, Term1, Relation, Term2, Term3)where Term1 represents a concept (object class), Term2, and Term3, if exist, repre-sent entities (objects), Relation (property) is a semantic constraint between terms inthe question This representation is meant to capture the semantic of the question
Simple questions only have one query-tuple and its question-structure is thequery-tuple’s question-structure More complex questions such as composite ques-tions have several sub-questions, each sub-question is represented by a separatequestion-structure, and the question-structure captures this composition attribute
Trang 35The definitions of the following question-structures of Normal, UnknTerm, knRel, Definition, Compare, ThreeTerm, Clause, Combine, And, Or, Affirm, Af-firm_3Term, Affirm_MoreTuples, and question categories of HowWhy, YesNo, What,When, Where, Who, Many, ManyClass, List and Entity could be found in ap-pendixes B and A respectively.
Un-The figure 3.2 shows the returned result when analyzing the question “số lượnggiảng viên có học vị là tiến sĩ của khoa công nghệ thông tin ?” (“how many lec-turers who are from Faculty of Information Technology have degree of Doctor ofPhilosophy ?”) We have the output containing the question-structure of ThreeTermand query-tuple ( ThreeTerm , ManyClass , giảng viênlecturer , có học vịhas degree, tiến
sĩDoctor of P hilosophy , khoa công nghệ thông tinF aculty of Inf ormation T echnology )
Figure 3.2: An example of intermediate representation element
We wrapped existing linguistic processing modules for Vietnamese such as WordSegmentation, Part-of-speech tagger (Pham et al., 2009) as GATE plug-ins Results
of the modules are annotations capturing information such as sentences, words,nouns and verbs Each annotation has a set of feature-value pairs For example, aword has a feature category storing its part-of-speech tag This information can then
be reused for further processing in subsequent modules New modules are cally designed to handle Vietnamese questions using JAPE grammars over existinglinguistic annotations
Trang 36specifi-3.2 Preprocessing module
The preprocessing module generates TokenVn annotations representing a Vietnameseword with features such as part-of-speech Vietnamese is a monosyllabic language;
hence, a word may contain more than one token
However, the Vietnamese word segmentation module is not trained for questiondomain There are question-words, which are indicative of the question categoriessuch as “phải khôngis that|are there”, “là bao nhiêuhow many”, “ở đâuwhere”, “khi nàowhen”,
“là cái gìwhat”, tagged as multiple TokenVn annotations In this module we tify those ones by embedding Java codes in the RHS of JAPE rules (appendix
iden-D), and mark them as single annotations with corresponding feature type and itssemantic categories such as: HowWhycause | method, YesNotrue or f alse, Whatsomething,Whentime | date, Wherelocation, Manynumber, Whoperson In fact, this information will beused in creating rules in the syntactic analysis module at a later stage
Figure 3.3: An example of redefining the TokenVn annotation
For example, with the question “Số lượng sinh viên học lớp khoa học máy tính mà
có quê quán ở Hà Nội là bao nhiêu ?” (“how many students study in computer scienceclass whose hometown is Hanoi?”) as given in the figure3.3 The result of redefiningthe question-words phrase “là bao nhiêuhow many” is it to be covered by an annotationTokenVn It means that “là bao nhiêuhow many” is considered as a word The word “Sốlượnghow many” firstly is recognized as a single annotation TokenVn with its featurecategory corresponding with value “Na”(abstract noun), but the word belongs to set
of words referring question-classes (appendix A), therefore, it would be redefined
Trang 37with feature category of “question-word” and additional feature type of “Many”.
In addition, we marked phrases that refer to comparing-phrases (such as “lớnhơngreater than”, “nhỏ hơn hoặc bằngless than or equal to” , ) or special-words (for exam-ple, abbreviation of some words on special-domain) by singleTokenVn annotations
This module is responsible for identifying noun phrases, question-phrases and therelations between them The different modules communicate through the annota-tions, for example, this module uses the TokenVn annotations, which is the result ofthe preprocessing module
3.3.1 Noun phrases detection
Concepts and entities are normally expressed in noun phrases Therefore, it is tant that we can reliably detect noun phrases in order to generate the query-tuple
impor-We use JAPE grammars to specify patterns over annotations as described in table3.1
Table 3.1: JAPE grammar for identifying Vietnamese noun phrases( {TokenVn.category == “Pn”} )? Quantity pronoun
( {TokenVn.category == “Nu”} | Concrete noun{TokenVn.category == “Nn”} )? Numeral noun( {TokenVn.string == “cái”} | “cáithe”
{TokenVn.string == “chiếc”} )? “chiếcthe”( {TokenVn.category == “Nt”} )? Classifier noun( {TokenVn.category == “Nc”} | Countable noun{TokenVn.category == “Ng”} | Collective noun{TokenVn.category == “Nu”} |
{TokenVn.category == “Na”} | Abstract noun{TokenVn.category == “Np”} )+ Proper noun( {TokenVn.category == “Aa”} | Quality adjective{TokenVn.category == “An”} )? Quantity adjective( {TokenVn.string == “này”} | “nàythis; these”{TokenVn.string == “kia”} | “kiathat; those”{TokenVn.string == “ấy”} | “ấythat; those”{TokenVn.string == “đó”} )? “đóthat; those”
Trang 38When a noun phrase is matched, an annotation NounPhrase is created to mark
up the noun phrase In addition, its type feature is used to identify the concept andentity that is contained in the noun phrase using the following heuristic:
If the noun phrase contains a single noun (not including numeral nouns) anddoes not contain a proper noun, it contains a concept If the noun phrase contains aproper noun or contains at least three single nouns, it contains an entity Otherwiseconcepts and entities are determined using a manual dictionary In this step, amanual dictionary is built for describing concepts and their corresponding synonyms
in the Ontology
Figure 3.4: NounPhrase annotations
Returning the question “Số lượng sinh viên học lớp khoa học máy tính mà có quêquán ở Hà Nội là bao nhiêu ?” (“how many students study in computer science classwhose hometown is Hanoi?”), there are 4 noun phrases “sinh viênstudent”, “lớp khoahọc máy tínhcomputer science class”, “quê quánhometown”, and “Hà NộiHanoi” as indicated
in figure3.4
3.3.2 Question-phrases detection
Question-phrases are detected by using noun phrases and question-words identified
by the preprocessing module via JAPE grammars QUTerm or QU-E-L-MC List-ManyClass) annotations are generated to cover question-phrases with corre-sponding category feature which gives information about question categories to befound in appendixA
(Entity-For example, in the question “Số lượng sinh viên học lớp khoa học máy tính mà cóquê quán ở Hà Nội là bao nhiêu ?” (“how many students study in computer scienceclass whose hometown is Hanoi?”), the phrase “Số lượng sinh viênhow many students”
Trang 39Figure 3.5: QU-E-L-MC and QUTerm annotations.
is annotated by annotation QU-E-L-MC , while “là bao nhiêuhow many” is covered byannotation QUTerm (in figure 3.5)
3.3.3 Relations detection
The next step is to identify relations between noun phrases or noun phrases andquestion-phrases After analyzing a number of questions, we use the following fourJAPE patterns to identify relation phrases:
(“cóhave|has”)((Noun Phrasetype==Concept)|(Adjective))(“làis|are”)
When a phrase is matched by one of the relation patterns, an annotation Relation
is created to markup the relation For example as presented in figure 3.6, with thefollowing question:
“liệt kê tất cả các sinh viên có quê quán là Hà Nội?”
“list all students who have hometown of Hanoi?”
The phrase “có quê quán làhave hometown of” is the relation phrase annotated by lation annotation linking the question-phrase “liệt kê tất cả các sinh viênlist all students”