More importantly, this thesis introduces a Vietnamese text-based conversational agentarchitecture on specific knowledge domain which is integrated in a question answer-ing system.. 22 4
Trang 1A Vietnamese Text-based Conversational Agent
Nguyen Quoc Dai
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
Supervised by
Dr Pham Bao Son
A thesis submitted in fulfillment of the requirements
for the degree ofMaster of Science in Computer Science
November 2011
Trang 3ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree
substan-or diploma at University of Engineering and Technology (UET/Coltech) substan-or any othereducational institution, except where due acknowledgement is made in the thesis Anycontribution made to the research by others, with whom I have worked at UET/Coltech
or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectualcontent of this thesis is the product of my own work, except to the extent that assistancefrom others in the project’s design and conception or in style, presentation and linguisticexpression is acknowledged.’
Hanoi, November 23rd, 2011
Signed
i
Trang 4The first step that a question answering system must perform is to transform
an input question into an intermediate representation Most published works so faruse rule-based approaches to realize this transformation in question answering sys-tems Nevertheless, in existing rule-based approaches, manually creating the rules iserror-prone and expensive in time and effort In this thesis, we focus on introduc-ing a rule-based approach that offers an intuitive way to create compact rules forextracting intermediate representation of input questions Experimental results arepromising where our system achieves reasonable performance and demonstrate that
it is straightforward to adapt to new domains and languages
More importantly, this thesis introduces a Vietnamese text-based conversational agentarchitecture on specific knowledge domain which is integrated in a question answer-ing system When the question answering system fails to provide answers to userinput, our conversational agent can step in to interact with users to provide answers
to users Experimental results are promising where our Vietnamese text-based versational agent achieves positive feedback in a study conducted in the universityacademic regulation domain
con-Publications:
? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese Text-based sational Agent In Proc of The 25th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems ( IEA/AIE 2012 ), Springer-Verlag LNAI, pp 699-708.
Conver-? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Semantic Approach for tion Analysis In Proc of The 25th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems ( IEA/AIE 2012 ), Springer-Verlag LNAI, pp 156-165.
Ques-? Dat Quoc Nguyen, Dai Quoc Nguyen and Son Bao Pham Systematic Knowledge Acquisition for Question Analysis In Proc of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), ACL Anthology, pp 406-412.
ii
Trang 5? Dai Quoc Nguyen, Dat Quoc Nguyen, Khoi Trong Ma and Son Bao Pham Automatic tology Construction from Vietnamese text In Proceedings of the 7th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE’11), IEEE, pp 485-488.
On-? Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham and Dang Duc Pham Ripple Down Rules for Part-Of-Speech Tagging In Proc of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2011), Springer-Verlag LNCS, part I, pp 190-201.
? Dai Quoc Nguyen, Dat Quoc Nguyen and Son Bao Pham A Vietnamese question answering system In Proceedings of the 2009 International Conference on Knowledge and Systems Engineer- ing (KSE 2009) , IEEE CS, pp 26–32.
Trang 6First and foremost, I would like to express my deepest gratitude to my supervisor,
Dr Pham Bao Son, for his patient guidance and continuous support throughout theyears He always appears when I need help, and responds to queries so helpfully andpromptly
I would like to give my honest appreciation to my younger brother, Nguyen QuocDat, for his great support
I would like to specially thank Prof Bui The Duy and my colleagues for their helpthrough my time at Human Machine Interaction Laboratory, UET/Coltech
I sincerely acknowledge the Vietnam National University, Hanoi, Toshiba tion Scholarship, and especially Dr Pham Bao Son for supporting finance to mymaster study
Founda-Finally, this thesis would not have been possible without the support and love of
my mother and my father Thank you!
iv
Trang 7To my family ♥
v
Trang 8Table of Contents
1.1 A Semantic Approach for Question Analysis 1
1.2 A Vietnamese Text-based Conversational Agent 2
1.3 Thesis Organisation 3
2 Literature review 4 2.1 Text-based conversational agents 4
2.1.1 Using keywords for pattern matching 4
2.1.2 Using the sentence similarity measure for pattern matching 7
2.2 FrameScript Scripting Language 9
2.3 Question answering systems 12
3 Our Question Answering System Architecture 15 3.1 Vietnamese Question Answering System 15
3.1.1 Natural language question analysis component 16
3.1.1.1 Intermediate representation of an input question 16
3.1.1.2 Question analysis 17
3.1.2 Answer retrieval component 18
3.2 Using FrameScript for question analysis 19
3.2.1 Preprocessing module 19
3.2.2 Syntactic analysis module 20
3.2.3 Semantic analysis module 22
4 Text-based Conversational Agent for Vietnamese 24 4.1 Overview of architecture 24
4.2 Determining separate contexts 25
4.3 Identifying hierarchical contexts 27
vi
Trang 9TABLE OF CONTENTS vii
5.1 Experimental results
for Vietnamese text-based conversational agent 29
5.2 Question Analysis for English 31
5.3 Discussion 33
A Scripting patterns
Trang 10List of Figures
2.1 O’Shea et al.’s conversational agent framework 7
2.2 Aqualog’s architecture 14
3.1 Architecture of our question answering system 16
3.2 Architecture of the natural language question analysis componentusing FrameScript 19
4.1 Architecture of our Vietnamese text-based conversational agent 25
viii
Trang 11List of Tables
4.1 Script examples of “subjects” 26
4.2 Transformations between contexts 27
4.3 Order of transformation rules 28
4.4 Ordered transformation between contexts 28
5.1 List of transformations among contexts 30
5.2 Unsatisfying analysis 30
5.3 The satisfied degree of students 31
5.4 Number of rules corresponding with each question-structure type 31
5.5 Number of rules with conditional responses 32
5.6 Number of questions corresponding with each question-structure type 32 5.7 Error results 32
ix
Trang 12NLP Natural Language Processing
GUI Graphic User Interface
x
Trang 13Chapter 1
Introduction
The goal of question answering systems is to give answers to the user’s questionsinstead of ranked lists of related documents as used by most current search engines(Hirschman and Gaizauskas, 2001) Natural language question analysis component
is the first component in any question answering systems This component creates
an intermediate representation of the input question, which is expressed in naturallanguage, to be utilized in the rest of the system
For the task of translating a natural language question into an explicit ate representation of the complexity in question answering systems, most publishedworks so far use rule-based approach to the best of our knowledge Some questionanswering systems such as (Lopez et al., 2007; Phan and Nguyen, 2010) manuallydefined a list of sequence rule structures to analyze questions However, in theserule-based approaches, manually creating the rules is error-prone and expensive intime and effort
intermedi-In this thesis, we present an approach to return an intermediate representation
of question via FrameScript scripting language (McGill et al., 2003) Natural guage questions will be transformed into intermediate representation elements whichinclude the construction type of question, question class, keywords in question andsemantic constraints between them Framescript allows users to intuitively writerules to directly extract the output tuple
lan-1
Trang 142 Chapter 1 Introduction
A text-based conversational agent is a program allowing the conversational actions between human and machine by using natural language through text Thetext-based conversational agent uses scripts organized into contexts comprising hier-archically constructed rules The rules consist of patterns and associated responses,where the input is matched based on patterns and the corresponding responses aresent to user as output
inter-We focus on the analysis of input text in building a conversational agent cently, the input analysis over user’s statements have been developed following twomain approaches: using keywords (ELIZA (Weizenbaum, 1983), ALICE (Wallace,
Re-2001), ProBot (Sammut, 2001)) and using similarity measures (O’Shea et al., 2010;
Graesser et al., 2004; Traum, 2006) for pattern matching The approaches usingkeywords usually utilize a scripting language to match the input statements, whilethe other approaches measure the similarity between the statements and patternsfrom the agent’s scripts
In this thesis, we introduce a Vietnamese text-based conversational agent chitecture on a specific knowledge domain Our system aims to direct the user’sstatement into an appropriate context The contexts are structured in a hierarchy ofscripts consisting of rules in FrameScript language (McGill et al.,2003) In addition,our text-based conversational agent was constructed to integrate in a Vietnamesequestion answering system Our conversational agent provides not only informationrelated to user’s statement but also provides necessary knowledge to support ourquestion answering system when it is unable to find an answer
ar-The knowledge domain we used to build our text-based conversational agent isthe academic regulation at Vietnam National University, Hanoi (VNU) The aca-demic regulation book helps students to know the course programs, the regulation ofexaminations, the discipline at VNU However, most students don’t prefer readingthe academic regulation book Therefore, our contribution creates an interactionchannel to offer the necessary information to students Once students give theirstatements that they are interested in the academic regulation, our text-based con-versational agent responses these statements by providing the related information indetail Furthermore, our conversation agent also interacts with students by offeringthe option to ask if students want to know other information
Trang 16Chapter 2
Literature review
In this chapter, we review related works using text-based approaches for tional agent (CA) Section 2.1describes the approaches constructing rules to matchuser’s natural language utterances in the ways of using keywords (in section 2.1.1)and using a sentence similarity measure (in section 2.1.2) In addition, section 2.2
conversa-covers the basic knowledge background about FrameScript scripting language that
we have been working on, while section 2.3 presents reviews about the questionanswering systems driving specific-domains
ELIZA (Weizenbaum,1983) was one of the earliest text-based conversational agentsbased on a simple pattern matching by using the identification of keywords fromuser’s statement Then ELIZA transforms the user’s statement to an appropriaterule and generates output response The procedure that ELIZA responds to an userinput to give an appropriate output consists of five steps
• Identify the important keywords appearing in user’s statement
• Define some minimal context within which selected keyword occurs
• Determine an appropriate transformation rule
• Generate the responses when the input text contained no keywords
4
Trang 172.1 Text-based conversational agents 5
• Provide a facilitate editing for scripts on the script writing level
Transformation rules are used to serve decomposing a data string according tocertain criteria and reassembling a decomposed string according to certain assemblyspecifications Therefore, the input are analyzed based on the decomposition rulestriggered by keywords, and responses are generated against the reassembly rulesassociated with selected decomposition rules For example, encountering the inputsentence:
“It seems that you like me”
this sentence is decomposed into the four parts:
(1) It seems that (2) you (3) like (4) me
by using the decomposition rule:
(0 YOU 1 ME)
The associated response might then be:
“What makes you think I like you”
by using the reassembly rule:
(WHAT MAKES YOU THINK I 3 YOU)
An integer 0 in the decomposition rule will match more words and a non-zero integer
“n” appearing in a decomposition rules indicates that exactly “n” words will bematched, while an integer 3 in the above reassembly rule shows that the third part
of the decomposed sentence is inserted in its place to reply the input sentence Ifeach word is defined in a dictionary of keywords by scanning an input sentence fromleft to right, then only decomposition rules containing that keyword need to be tried
An ELIZA script consists mainly of a set of list structures as following:
Trang 186 Chapter 2 Literature review
ALICE (Wallace, 2001) is a text-based conversational agent as chat robot lizing an XML language called Artificial Intelligence Markup Language (AIML).AIML files consist of category tags representing rules; each category tag contains apair of pattern and template tag The entire category is stored in a tree The systemsearches the pattern according with an user input by using depth-first search in thetree, and produces the appropriate template as a response For example, a categorybelow:
<topic name=“MOVIES”>
<category>
<pattern>YES</pattern>
<that>DO YOU LIKE ROMANTIC MOVIES</that>
<template>What is your favourite romantic movie?</template>
</category>
<category>
<pattern>YES</pattern>
<that>DO YOU LIKE ACTION MOVIES</that>
<template>What is your favourite action movie?</template>
</category>
When the client says yes, the program must discover the robot’s previous utterance
If the robot asked “Do you like romantic movies?”, the response sent to reply is
“What is your favourite romantic movie?”
AIML is clever and simple, and easy for implementation and a good start forbeginners writing simple bots However, it is difficult to write and debug more
Trang 192.1 Text-based conversational agents 7
discriminating patterns, and it is very hard to know all the transformations availablebecause AIML depends on self-modifying the input
Sammut (Sammut, 2001) presented a text-based CA called ProBot that is able
to extract data from users ProBot’s scripts are typically organized into cal contexts consisting of a number of organized rules to handle unexpected inputs.Concurrently, McGill et al (McGill et al.,2003) derived from ProBot’s scripts (Sam-mut,2001) build the rule system in FrameScript scripting language (in section2.2).FrameScript (McGill et al., 2003) provides for the rapid prototyping of conversa-tional interfaces and simplifies the writing of scripts
for pattern matching
O’Shea et al (O’Shea et al.,2008,2010) proposed a text-based conversational agentframework (shown in figure2.1) using semantic analysis All patterns in scripts arethe natural language sentences The pattern matching uses a sentence similaritymeasure (Li et al., 2006) to calculate the similarity between sentences from scriptsand user input The highest ranked sentence is selected and its associated response
is sent as output
Figure 2.1: O’Shea et al.’s conversational agent framework
Scripts used in framework consist of contexts relating to a specific topic of sation Each context contains one or more rules, and each rule uses “s” to represent