A vietnamese text based conversational agent

When the question answering system fails toprovide answers to user input, our conversational agent can step in to interactwith users to provide answers to users.. Natural language questi

Trang 1

A Vietnamese Text-based Conversational Agent

Nguyen Quoc Dai

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

Supervised by

Dr Pham Bao Son

A thesis submitted in fulfillment of the requirements

for the degree of Master of Science in Computer Science

November 2011

Trang 3

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge

it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree

substan-or diploma at University of Engineering and Technology (UET/Coltech) substan-or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked at UET/Coltech

or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’

Hanoi, November 23rd, 2011

Signed

i

Trang 4

The first step that a question answering system must perform is to transform

an input question into an intermediate representation Most published works so faruse rule-based approaches to realize this transformation in question answeringsys-tems Nevertheless, in existing rule-based approaches, manually creating therules is error-prone and expensive in time and effort In this thesis, we focus onintroduc-ing a rule-based approach that offers an intuitive way to create compactrules for extracting intermediate representation of input questions Experimentalresults are promising where our system achieves reasonable performance anddemonstrate that it is straightforward to adapt to new domains and languages

More importantly, this thesis introduces a Vietnamese text-based conversationalagent architecture on specific knowledge domain which is integrated in aquestion answer-ing system When the question answering system fails toprovide answers to user input, our conversational agent can step in to interactwith users to provide answers to users Experimental results are promisingwhere our Vietnamese text-based con-versational agent achieves positivefeedback in a study conducted in the university academic regulation domain

Publications:

? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese Text-based sational Agent In Proc of The 25th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems ( IEA/AIE 2012 ), Springer-Verlag LNAI, pp 699-708.

Conver-? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Semantic Approach for Ques-tion Analysis In Proc of The 25th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems ( IEA/AIE 2012 ), Springer-Verlag LNAI, pp 156-165.

? Dat Quoc Nguyen, Dai Quoc Nguyen and Son Bao Pham Systematic Knowledge Acquisition for Question Analysis In Proc of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), ACL Anthology, pp 406-412.

ii

Trang 5

? Dai Quoc Nguyen, Dat Quoc Nguyen, Khoi Trong Ma and Son Bao Pham Automatic tology Construction from Vietnamese text In Proceedings of the 7th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE’11), IEEE, pp 485-488.

On-? Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham and Dang Duc Pham Ripple Down Rules for Part-Of-Speech Tagging In Proc of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2011), Springer-Verlag LNCS, part I, pp 190-201.

? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese question answering system In Proceedings of the 2009 International Conference on Knowledge and Systems Engineer-ing (KSE 2009) , IEEE CS, pp 26 32.

Trang 6

First and foremost, I would like to express my deepest gratitude to mysupervisor, Dr Pham Bao Son, for his patient guidance and continuous supportthroughout the years He always appears when I need help, and responds toqueries so helpfully and promptly

I would like to give my honest appreciation to my younger brother, NguyenQuoc Dat, for his great support

I would like to specially thank Prof Bui The Duy and my colleagues for theirhelp through my time at Human Machine Interaction Laboratory, UET/Coltech

I sincerely acknowledge the Vietnam National University, Hanoi, ToshibaFounda-tion Scholarship, and especially Dr Pham Bao Son for supportingfinance to my master study

Finally, this thesis would not have been possible without the support and love of

my mother and my father Thank you!

iv

Trang 7

To my family ~

v

Trang 8

Table of Contents

1.1 A Semantic Approach for Question Analysis 1.2 A Vietnamese Text-based Conversational Agent 1.3 Thesis Organisation

2 Literature review

2.1 Text-based conversational agents

2.1.12.1.2

2.2 FrameScript Scripting Language

2.3 Question answering systems

3 Our Question Answering System Architecture

3.1 Vietnamese Question Answering System

3.1.1

3.1.23.2 Using FrameScript for question analysis

3.2.13.2.23.2.3

4 Text-based Conversational Agent for Vietnamese

4.1 Overview of architecture

4.2 Determining separate contexts

4.3 Identifying hierarchical contexts

vi

Trang 9

for English question analysis

B Definitions of question-class types

C Definitions of question-structures

Trang 10

List of Figures

2.1 O’Shea et al.’s conversational agent framework

2.2 Aqualog’s architecture

3.1 Architecture of our question answering system

3.2 Architecture of the natural language question analysis componentusing FrameScript

4.1 Architecture of our Vietnamese text-based conversational agent

viii

Trang 11

List of Tables

4.1 Script examples of subjects

4.2 Transformations between contexts

4.3 Order of transformation rules

4.4 Ordered transformation between contexts

5.1 List of transformations among contexts

5.2 Unsatisfying analysis

5.3 The satisfied degree of students

5.4 Number of rules corresponding with each question-structure type 5.5 Number of rules with conditional responses

5.6 Number of questions corresponding with each question-structure type 5.7 Error results

ix

Trang 12

GUI Graphic User Interface

x

Trang 13

Chapter 1

Introduction

The goal of question answering systems is to give answers to the user’squestions instead of ranked lists of related documents as used by most currentsearch engines (Hirschman and Gaizauskas, 2001) Natural language questionanalysis component is the first component in any question answering systems.This component creates an intermediate representation of the input question,which is expressed in natural language, to be utilized in the rest of the system.For the task of translating a natural language question into an explicitintermedi-ate representation of the complexity in question answering systems,most published works so far use rule-based approach to the best of ourknowledge Some question answering systems such as (Lopez et al., 2007;

Phan and Nguyen, 2010) manually defined a list of sequence rule structures toanalyze questions However, in these rule-based approaches, manually creatingthe rules is error-prone and expensive in time and effort

In this thesis, we present an approach to return an intermediate representation

of question via FrameScript scripting language (McGill et al., 2003) Natural guage questions will be transformed into intermediate representation elementswhich include the construction type of question, question class, keywords inquestion and semantic constraints between them Framescript allows users tointuitively write rules to directly extract the output tuple

lan-1

Trang 14

2 Chapter 1 Introduction

A text-based conversational agent is a program allowing the conversationalinter-actions between human and machine by using natural language throughtext The text-based conversational agent uses scripts organized into contextscomprising hier-archically constructed rules The rules consist of patterns andassociated responses, where the input is matched based on patterns and thecorresponding responses are sent to user as output

We focus on the analysis of input text in building a conversational agent cently, the input analysis over user’s statements have been developed followingtwo main approaches: using keywords (ELIZA (Weizenbaum, 1983), ALICE(Wallace, 2001), ProBot (Sammut, 2001)) and using similarity measures(O’Shea et al., 2010; Graesser et al., 2004; Traum, 2006) for pattern matching.The approaches using keywords usually utilize a scripting language to matchthe input statements, while the other approaches measure the similaritybetween the statements and patterns from the agent’s scripts

Re-In this thesis, we introduce a Vietnamese text-based conversational agent chitecture on a specific knowledge domain Our system aims to direct the user’sstatement into an appropriate context The contexts are structured in a hierarchy ofscripts consisting of rules in FrameScript language (McGill et al., 2003) In addition,our text-based conversational agent was constructed to integrate in a Vietnamesequestion answering system Our conversational agent provides not only informationrelated to user’s statement but also provides necessary knowledge to support ourquestion answering system when it is unable to find an answer

ar-The knowledge domain we used to build our text-based conversational agent isthe academic regulation at Vietnam National University, Hanoi (VNU) The aca-demic regulation book helps students to know the course programs, the regulation

of examinations, the discipline at VNU However, most students don’t preferreading the academic regulation book Therefore, our contribution creates aninteraction channel to offer the necessary information to students Once studentsgive their statements that they are interested in the academic regulation, our text-based con-versational agent responses these statements by providing the relatedinformation in detail Furthermore, our conversation agent also interacts withstudents by offering the option to ask if students want to know other information

Trang 15

1.3 Thesis Organisation

This dissertation consists of 6 chapters In chapter 2, we provide some literature views and describe our Vietnamese question answering system architecture, inwhich we present a method for converting a natural language question into anintermediate representation, in chapter 3 We propose our Vietnamese text-basedconversational agent architecture in chapter 4 We describe our experiments anddiscussions in chapter 5, and conclusion will be presented in chapter 6

Trang 16

re-Chapter 2

Literature review

In this chapter, we review related works using text-based approaches for tional agent (CA) Section 2.1 describes the approaches constructing rules tomatch user’s natural language utterances in the ways of using keywords (in section

conversa-2.1.1) and using a sentence similarity measure (in section 2.1.2) In addition,section 2.2 covers the basic knowledge background about FrameScript scriptinglanguage that we have been working on, while section 2.3 presents reviews aboutthe question answering systems driving specific-domains

2.1.1 Using keywords for pattern matching

ELIZA (Weizenbaum, 1983) was one of the earliest text-based conversationalagents based on a simple pattern matching by using the identification ofkeywords from user’s statement Then ELIZA transforms the user’s statement to

an appropriate rule and generates output response The procedure that ELIZAresponds to an user input to give an appropriate output consists of five steps.Identify the important keywords appearing in user’s statement

Define some minimal context within which selected keyword

occurs Determine an appropriate transformation rule

Generate the responses when the input text contained no keywords

4

Trang 17

Provide a facilitate editing for scripts on the script writing level

Transformation rules are used to serve decomposing a data string according

to certain criteria and reassembling a decomposed string according to certainassembly specifications Therefore, the input are analyzed based on thedecomposition rules triggered by keywords, and responses are generatedagainst the reassembly rules associated with selected decomposition rules Forexample, encountering the input sentence:

It seems that you like me

this sentence is decomposed into the four parts:

(1) It seems that (2) you (3) like (4) me

by using the decomposition rule:

(0 YOU 1 ME)

The associated response might then be:

What makes you think I like you

by using the reassembly rule:

(WHAT MAKES YOU THINK I 3 YOU)

An integer 0 in the decomposition rule will match more words and a non-zero integer n appearing in a decomposition rules indicates that exactly n words will be matched, while an integer 3 in the above reassembly rule shows that the third part of the decomposed sentence is inserted in its place to reply the input sentence If each word

is defined in a dictionary of keywords by scanning an input sentence from left to right, then only decomposition rules containing that keyword need to be tried.

An ELIZA script consists mainly of a set of list structures as following:

where K is the keyword, Di the ith decomposition rule associated with K and Ri; j

the j th reassembly rule associated with the ith decomposition rule Any number

of decomposition rules may be associated with a given keyword and anynumber of reassembly rules with any specific decomposition rule since having

no predetermined ordering limitations

Trang 18

6 Chapter 2 Literature review

ALICE (Wallace, 2001) is a text-based conversational agent as chat robotuti-lizing an XML language called Artificial Intelligence Markup Language(AIML) AIML files consist of category tags representing rules; each category tagcontains a pair of pattern and template tag The entire category is stored in atree The system searches the pattern according with an user input by usingdepth-first search in the tree, and produces the appropriate template as aresponse For example, a category below:

<that>DO YOU LIKE ROMANTIC MOVIES</that>

<template>What is your favourite romantic movie?</template>

</category>

<that>DO YOU LIKE ACTION MOVIES</that>

<template>What is your favourite action movie?

</template> </category>

When the client says yes, the program must discover the robot’s previousutterance If the robot asked Do you like romantic movies? , the response sent

to reply is What is your favourite romantic movie?

AIML is clever and simple, and easy for implementation and a good start forbeginners writing simple bots However, it is difficult to write and debug more

Trang 19

discriminating patterns, and it is very hard to know all the transformationsavailable because AIML depends on self-modifying the input

Sammut (Sammut, 2001) presented a text-based CA called ProBot that is able

to extract data from users ProBot’s scripts are typically organized into hierarchi-calcontexts consisting of a number of organized rules to handle unexpected inputs.Concurrently, McGill et al (McGill et al., 2003) derived from ProBot’s scripts (Sam-mut,2001)build the rule system in FrameScript scripting language (in section 2.2).FrameScript (McGill et al., 2003) provides for the rapid prototyping of conversa-tional interfaces and simplifies the writing of scripts

2.1.2 Using the sentence similarity measure

for pattern matching

O’Shea et al (O’Shea et al., 2008, 2010) proposed a text-based conversationalagent framework (shown in figure 2.1) using semantic analysis All patterns inscripts are the natural language sentences The pattern matching uses asentence similarity measure (Li et al., 2006) to calculate the similarity betweensentences from scripts and user input The highest ranked sentence is selectedand its associated response is sent as output

Figure 2.1: O’Shea et al.’s conversational agent framework

Scripts used in framework consist of contexts relating to a specific topic of sation Each context contains one or more rules, and each rule uses s to represent

Trang 20

conver-8 Chapter 2 Literature review

a natural language sentence and r to represent a response statement For example, considering a following rule:

<Rule_01>

s: I’m a student

r: Which university do you

study? With a user’s statement:

I am a master student or

I am a phd student

This input and the natural language sentences from the scripts are received inorder to send the sentence similarity measure Then sentence similarity measurecalculates a firing strength for each sentence pair to rank the sentences In thisabove example, the highest ranked sentence selected is I’m a student and itsassociated response sent to user is Which university do you study?

The advantages of using a sentence similarity measure for pattern-matching

is that rule structures are simplified and reduced in size and complexity Bycontrast, this approach can’t retrieve some information from an input to insertinto response like using keywords for presented section 2.1.1

Graesser et al (Graesser et al., 2004) presented a conversational agentcalled AUTOTUTOR matching input statements in the use of Latent SemanticAnalysis Traum (Traum, 2006) adapted the effective question answeringcharacters (Leuski et al., 2006) to build a conversational agent also employingLatent Semantic Analysisfor pattern matching

Trang 21

FrameScript (McGill et al., 2003) is a language for creating multi-modal user terfaces It employs from Sammut’s Probot (Sammut, 2001) to enable rule-based programming, frame representations and simple function evaluation TheFrame-Script scripting language also proposes a set of tools to representknowledge and interacting with users and external devices

in-Each script in FrameScript (McGill et al., 2003) includes a list of rules matchedagainst user input and used to give the appropriate response Rules are groupedinto particular contexts of the form: context_name :: rule_set The scripting rules inthe FrameScript language consist of patterns and responses with the form:

pattern ==> response

Pattern expressions may contain 2 wild-cards characters which are * and *will match 0 or more words and within a word indicates that 0 or morecharacters may be matched Pattern expressions also allow the use of thealternatives by constructing of the form:

f alternative 1 j alternative 2 j g

Moreover, patterns use non-terminals to reuse other pattern expressions by writing the name of the non-terminal surrounded by ‘<’ and ‘>’ Non-terminals are often declared as list of alternatives followed by ;; For example:

where each response is given in turn every time the pattern is matched and thesequence repeats when the last response is output Alternatives have the formsur-rounded by braces:

fresponse 1 j response 2 j j another responseg,

in which any response may be chosen randomly for user output

In addition, responses utilize the ‘#’ to perform some action such as chang-ingthe current context For example, #goto(a_script) transforms a conversation orinteraction from one context to another Similarly, ‘^’ is used to perform actions, ex-

Trang 22

cept that when the following expression is evaluated it is inserted into responsenot thrown away And some response expression may be dependent on someconditions holding true in the constructed form below:

{My name is | I’m} * ==>

[ Hello ^2 How old are you? ]

I am <Number> years old ==>

[^(^1 <= 20) > Are you a student? |

How do you do? ]

The transcript of dialogue is shown below illustrating the above example:User: My name is X

CA: Hello X How old are you?

User: I am 19 years old

CA: Are you a student?

An input received from user is given to a domain in order to ensure that theinput is matched against the correct scripts Script can be registered as topic in

a domain to become the current script and process the input When a script isregistered as a topic, the domain uses the script’s trigger to determine whether

or not an input activates that topic If a topic doesn’t have a trigger, any input willactivate it When a topic’s trigger matches the input, it becomes the currentcontext and the current topic

Trang 23

Example ::

domain example

trigger{* {Hi | hi | Hello | hello} *}

* {Hi | hi | Hello | hello} * ==> [Hi there!]

When writing complex scripts where scripts have similar behaviours, FrameScript is possible to use inheritance to enable rule to be shared between scripts Moreover, FrameScript allows defining failsafes for scripts A failsafe is another script whose rules would be used if an input matches incorrectly any of rules for a script.

The order in which domains attempt to determine rules that the input should

be matched is:

1. triggers of the topics

2. the current context

3. the failsafe of the current context

4. the current topic

5. the failsafe of the current topic

6. the failsafe for the domain

When an input is compared to the rules of a script, the input is first compared tothe rules specifically defined by the script If none of these rules match, the input

is matched against the rules of the script’s parents The rules of the scripts aretried in top to bottom order

Trang 24

Kinds of question answering systems range from closed-domain systems (aiming to answer questions in a specific domain) to open-domain systems (aiming to answer all

of asked questions) In our experiment, the open-domain systems focus on retrieving and ranking related documents corresponding with the input, while the close-domain systems focus on analysis natural language questions to extract reliable terms.

Additionally, natural language question analysis component is the first nent in any question answering systems This component creates an intermediaterepresentation of the input question, which is expressed in natural language, to beutilized in the rest of the system The basis of the question parser is question clas-sification that can be defined as the task of mapping a given question to one of kclasses based on the possible types of the answers (Li and Roth, 2002b) Subse-quently, natural language questions analysis techniques are used to identifykeywords and semantic relations in input questions

compo-Therefore, our related works come from reviewing question answering systemsagainst the question analysis approaches in specific domain driven ones

Pattern-matching based systems

Close-domain question answering systems are usually linked to relationaldatabases and called natural language interfaces to databases A naturallanguage interface to a database (NLIDB) is a system that allows the users toaccess information stored in a database by typing questions using naturallanguage expressions (Androutsopouloset al.,1995)

Early NLIDB systems used pattern-matching technique to process user’s tions and generate corresponding answers (Sneiders, 2002) presented a NLIDBsys-tem by using question patterns covering conceptual model of the database.The input is converted into SQL query by using defined templates that containentity slots free space for data instances representing the primary concepts of theques-tion Some other open-domain systems presented in (Wu et al., 2003;

ques-Saxena et al., 2007)used pattern-matching techniques to respond user’s requests.The main advantage of pattern-matching approach is its simplicity, and thesys-tem can be able to perform well in certain applications However, the one’sshallow-ness would often lead to bad results

Trang 25

2.3 Question answering systems

Semantic-based systems

Later NILDBs respond user’s question by using semantic grammar to parse theinput into syntax tree and mapping the tree to a database query In semantic-based systems, the grammar’s categories (i.e the non-leaf nodes appearing inthe parse tree) have not to correspond to syntactic concepts (Androutsopoulos

et al., 1995) Semantic constraints are usually enforced by choosing semanticgrammar categories, in which the grammar’s categories can also be chosen toease the mapping from the syntax tree to database objects

Nguyen and Le (Nguyen and Le, 2008) introduced a NLIDB questionanswering system in Vietnamese employing semantic grammars Their systemincludes two main modules: QTRAN and TGEN QTRAN (Query Translator)maps a natural language question to an SQL query while TGEN (TextGenerator) generates answers based on the query result tables QTRAN useslimited context-free grammars to analyze user’s question into syntax tree viaCYK algorithm The syntax tree is then converted into an SQL query by using amapping dictionary to determine names of attributes in Vietnamese, names ofattributes in the database and names of individuals stored in these attributes

The PRECISE system (Popescu et al., 2003) maps the natural language tion to a unique semantic interpretation by analyzing some lexicons and semanticconstraints (Stratica et al., 2003) described a template-based system to translateEnglish question into SQL query by matching the syntactic parse of the question to

ques-a set of fixed semques-antic templques-ates Some other systems bques-ased on semques-anticgrammar rules such as Planes (Waltz, 1978), Eufid (Templeton and Burger, 1983).Semantic grammar-based approaches were considered as an engineeringmethodology, which allows semantic knowledge to be easily included in the system

Annotation-based systems

Recently, some question answering systems that used semantic annotations ated high results in natural language question analysis A well known annotationbased framework is GATE (General Architecture for Text Engineering) (Cunning-

Ontology-based AquaLog (Lopez et al., 2007) and QuestIO (Damljanovic et al.,

2008)systems, and Galea’s open-domain system (Galea, 2003),especially for thenatural language question analysis component

Trang 26

Aqualog (Lopez et al., 2007) shown in figure 2.2 is an ontology-based questionanswering system for English and is the basis for the development of our system Anatural language question is mapped to a set of representation based on the inter-mediate triple that is called a Query-Triple through the Linguistic Component byusing Java Annotation Patterns Engine (JAPE) grammars in GATE (Cunninghamet

al., 2002) The Relation Similarity Service takes a Query-Triple and processes it toprovide queries with respect to the input ontology called Onto-Triple Then Aqualoguses Onto-Triple to return an answer for users

Figure 2.2: Aqualog’s architecture

In our experiment, we reported an approach to convert Vietnamese natural guage questions into intermediate representation element in query-tuples(Question-structure, Question-class, Term1, Relation, Term2, Term3) based onsemantic annota-tions via JAPE grammars (Nguyen et al., 2009) The selectedquery-tuple type is more complex aiming to cover a wider variety of question types

lan-in different languages In addition, we proposed a language-lan-independent approach

to acquire JAPE rules in a systematic manner which avoids unintended interactionamong rules (Nguyen et al., 2011) (Phan and Nguyen, 2010) presented anapproach to syntactically and semantically map Vietnamese questions into triple-like of Subject, Verb and Object in also utilizing JAPE grammars

The START (Katz, 1997; Katz et al., 2006) question answering system alsoused natural language annotations (Katz, 1997) without utilizing GATE A lexicaldatabase WordNet (Fellbaum, 1998) is important natural language application.After the appearance of WordNet, almost question answering systems used it toprovide information for analyzing questions

Trang 27

an-Furthermore, we focus on describing a rule-based approach to directlyextract an intermediate representation elements of question via FrameScriptscripting language (McGill et al., 2003) (in section 3.2).

The architecture of our question answering system is shown in figure 3.1 It includes two components: the Natural language question analysis and the Answer retrieval.The question analysis component takes the user’s question as an input andre-turns a query-tuple representing the question in a compact form The role ofthis intermediate representation is to provide structured information of the inputques-tion for later processing such as retrieving answers

The answer retrieval component includes two main modules: Ontology mapping

15

Trang 28

16 Chapter 3 Our Question Answering System Architecture

and Answer extraction It takes an intermediate representation produced by thequestion analysis component and an ontology as its input to generate semanticanswers

Figure 3.1: Architecture of our question answering system

3.1.1 Natural language question analysis component

3.1.1.1 Intermediate representation of an input question

The intermediate representation used in our approach aims to cover a widervariety of question types It consists of a question-structure and one or morequery-tuple in the following format:

( question-structure, question-class, T erm1, Relation, T erm2, T erm3 )

where T erm1 represents a concept (object class), T erm2 and T erm3, ifexist, represent entities (objects), Relation (property) is a semantic constraintbetween terms in the question This representation is meant to capture thesemantics of the question

Simple questions corresponding to basic constructions only have one query-tuple

Trang 29

3.1 Vietnamese Question Answering System

and its question-structure is the query-tuple’s question-structure More complexquestions such as composite questions are constructed by several sub-questions,each sub-question is described by a separate question-structure, and the question-structure capture this composition attribute This representation is chosen so that itcan represent a richer set of question types Therefore, some terms or relation inthe query-tuple can be missed Composite questions such as:

list all students in the Faculty of Information Technology whose hometown isHanoi?

has question structure of type And with two query-tuples where ? represents amissed element: ( UnknRel , List , students , ? , Faculty of Information Technology,

? ) and ( Normal , List , students, hometown, Hanoi, ? )

The definitions of the following question categories of HowWhy, YesNo,What, When, Where, Who, Many, ManyClass, List and Entity, and question-structures of Normal, UnknTerm, UnknRel, Definition, Compare, ThreeTerm,Clause, Combine, And, Or, Affirm, Affirm_3Term, Affirm_MoreTuples could befound in appendixes B andC respectively

3.1.1.2 Question analysis

We wrapped existing linguistic processing modules for Vietnamese such asWord Segmentation, Part-of-speech tagger (Pham et al., 2009) as GATE plug-ins Results of the modules are annotations capturing information such assentences, words, nouns and verbs Each annotation has a set of feature-valuepairs For example, a word has a feature category storing its part-of-speech tag.This information can then be reused for further processing in subsequentmodules New modules are specifi-cally designed to handle Vietnamesequestions using JAPE grammars over existing linguistic annotations

There are three modules that we use to get an intermediate representation ofuser’s question including: preprocessing, syntactic analysis and semantic analysis

The preprocessing module generates TokenVn annotations representing aViet-namese word with features such as part-of-speech to identify question-words and comparing-phrases or special-words by using JAPE rules

The syntactic module is responsible for identifying noun phrases, questionphrases and relation phrases between noun phrases or noun phrases and questionphrases The different modules communicate through the annotations, for example,

Định dạng
Số trang	58
Dung lượng	223,89 KB