LIST OF TABLES Table 1: List of speech acts ...23 Table 2: Layout of a sample tour plan ...30 Table 3: Elva’s interactional behaviors portion only ...34 Table 4: Elva's multimodal behavi
Trang 1ELVA: AN EMBODIED CONVERSATIONAL AGENT IN
AN INTERACTIVE VIRTUAL WORLD
Trang 2First of all, I wish to thank my supervisor A/P Chee Yam San for his guidance, encouragement and patience over the years The many discussions we had, in which he showed his enthusiasm towards the topic, kept me on the right track
I would like to show my gratitude to the head of NUS Museums, Ms Angela Sim, for granting us the permission to use the Ng Eng Teng Gallery as our subject matter, and making the project possible
I would like to thank Dr Sabapathy for reviewing the virtual tour, and providing his insights on Ng Eng Teng’s art Also thanks to Suhaimi, curator of the Ng Eng Teng Gallery, who has contributed his expert knowledge about how to guide a tour Special thanks to those who lent their hands during the gallery-scanning sessions: Mr Chong, Beng Huat, Chao Chun, Ting, and Donald
I am also grateful to the members of LELS Lab: Chao Chun, Jonathan, Lai Kuan, Liu Yi, and Si Chao It was an enjoyable and memorable experience studying in this lab
Finally, it is time to thank my family Their support and patience have accompanied me throughout the whole period
Yuan Xiang
Trang 3TABLE OF CONTENT
TABLE OF CONTENT III LIST OF TABLES V LIST OF FIGURES VI SUMMARY VII
CHAPTER 1 INTRODUCTION 1
1.1 EMBODIED CONVERSATIONAL AGENTS: A CHALLENGE FOR VIRTUAL INTERACTION.1 1.2 RESEARCH OBJECTIVES 3
1.3 ELVA, AN EMBODIED TOUR GUIDE IN AN INTERACTIVE ART GALLERY 4
1.4 STRUCTURE OF THESIS 6
CHAPTER 2 RESEARCH BACKGROUND 8
2.1 ECA AS A MULTI-DISCIPLINARY RESEARCH TOPIC 8
2.2 ARCHITECTURAL REQUIREMENT 8
2.3 DISCOURSE MODEL 9
2.4 MULTIMODAL COMMUNICATIONS 10
2.5 SUMMARY 12
CHAPTER 3 AGENT ACHITECTURE 13
3.1 OVERVIEW 13
3.2 PERCEPTION MODULE 14
3.3 ACTUATION MODULE 14
3.4 INTERPRETATION MODULE 15
3.4.1 Reflexive Layer 15
3.4.2 Reactive Layer 15
3.4.3 Deliberative Layer 15
3.4.4 Behavior Generation 16
3.5 KNOWLEDGE BASE 16
CHAPTER 4 AGENT’S VERBAL BEHAVIORS 17
4.1 SCHEMA-BASED DISCOURSE FRAMEWORK 17
4.2 DISCOURSE MODELING 19
4.2.1 Narrative Modeling 19
4.2.2 Dialog Modeling 20
4.3 REASONING 21
4.3.1 Analysis Phase 21
4.3.2 Retrieval Phase 27
4.4 PLANNING 29
4.5 DIALOG COORDINATION 31
Trang 44.6 SUMMARY 32
CHAPTER 5 AGENT’S NONVERBAL BEHAVIORS 33
5.1 TYPES OF NONVERBAL BEHAVIORS IN OUR DESIGN 33
5.1.1 Interactional Behaviors 33
5.1.2 Deictic Behaviors 34
5.2 NONVERBAL BEHAVIOR GENERATION 35
5.3 NONVERBAL BEHAVIOR EXPRESSION 36
5.3.1 Building Blocks of Multimodality 37
5.3.2 Multimodality Instantiation 38
5.3.3 Multimodality Regulation 40
5.4 SUMMARY 41
CHAPTER 6 ILLUSTRATED AGENT-HUMAN INTERACTION 42
6.1 AN INTERACTIVE ART GALLERY 42
6.2 VERBAL COMMUNICATION 43
6.2.1 Elva presents information 44
6.2.2 Elva probes the user in a dialogue 45
6.2.3 Elva answers questions 46
6.3 NONVERBAL COMMUNICATION 47
6.3.1 Deictic Gestures 47
6.3.2 Facial Display 49
6.3.3 Locomotion 50
CHAPTER 7 EVALUATION 51
7.1 METHODOLOGIES 51
7.1.1 Qualitative Analysis 51
7.1.2 Quantitative Analysis 52
7.2 SUBJECTS, TASK AND PROCEDURE 52
7.3 EVALUATION RESULTS 55
7.3.1 User Experience 55
7.3.2 Agent Believability 57
7.3.2.1 Evaluation results on Elva’s Verbal Behaviors 58
7.3.2.2 Evaluation results on Elva’s Non-verbal Behaviors 61
7.4 DISCUSSION 62
7.5 SUMMARY 63
CHAPTER 8 CONCLUSION 64
8.1 SUMMARY 64
8.2 CONTRIBUTIONS OF THE THESIS 65
8.3 FUTURE WORK 66
REFERENCES 68
Trang 5LIST OF TABLES
Table 1: List of speech acts 23
Table 2: Layout of a sample tour plan 30
Table 3: Elva’s interactional behaviors (portion only) 34
Table 4: Elva's multimodal behavior library 38
Table 5: Evaluation results related to user satisfaction in interaction with Elva 55
Table 6: Users' responses to Elva's personality 57
Table 7: Evaluation results related to Elva's Verbal Behaviors 58
Table 8: Evaluation results related to nonverbal behaviors 61
Trang 6LIST OF FIGURES
Figure 1: Examples of embodied conversational agents 2
Figure 2: Steve's deictic gestures 11
Figure 3: Greta's facial display: neutral, sorry-for, and surprise .11
Figure 4: Three-layered architecture for the agent’s mental processing 13
Figure 5: Schema-based discourse space (portion only) 18
Figure 6: Narrative modeling using a schema 20
Figure 7: Dialog modeling using a schema 21
Figure 8: Transform raw user input to utterance meaning 22
Figure 9: State transition graph for QUESTION_MATERIAL 25
Figure 10: Speech act classification using STG 26
Figure 11: Elva's tour plan 29
Figure 12: Conversational Modalities 37
Figure 13: Synchronized modalities in an animation timeline 39
Figure 14: Elva invites the user to start the tour 42
Figure 15: Elva's basic facial expressions 49
Figure 16: Elva's locomotion 50
Figure 17: Procedure of user study on Elva 53
Trang 7SUMMARY
The technology of Embodied Conversational Agents (ECAs) offers great promise for natural and realistic human-computer interaction However, interaction design in this domain needs to be handled sensitively in order for verbal and nonverbal signals conveyed by the agent to be understood by human users
This thesis describes an integrative approach for building an ECA that is able to engage conversationally with human users, and that is capable of behaving according to social norms in terms of facial display and gestures The research focuses on the attributes of the
agent’s autonomy and believability, not its truthfulness To achieve autonomy, we present
a three-layered architectural design to ensure appropriate coupling between the agent’s perception and action, and to hide the internal mechanism from users In regard to agent believability, we utilize the notion of “schema” to support structured and coherent verbal behaviors We also present a layered approach to generate and coordinate the agent’s nonverbal behaviors, so as to establish social interactions within a virtual world
Using the above approach, we developed an ECA called Elva Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to human users A user study was performed to assess user satisfaction and agent believability when interacting with Elva The user feedback was generally positive Most users interacted successfully with Elva and enjoyed the tour A majority agreed that Elva’s behaviors, both verbal and nonverbal, were comprehensive and appropriate The user study also revealed that
Trang 8emotive behaviors should be integrated into Elva to achieve a higher level of believability Future work will address the area of affective computing, multiparty interaction support, and user modeling
Trang 9One path towards achieving realistic conversational interfaces has been the creation of embodied conversational agents or ECAs An ECA is a life-like computer character that can engage in a conversation with humans using a naturally spoken dialogue It also has a
“body” and knows how to use it for effective communication
In recent years, systems containing ECAs have been deployed in a number of domains There are pedagogical agents that educate and motivate students in E-learning systems (see Figure 1: A, B), virtual actors or actresses that are developed for entertainment or therapy (see Figure 1: C), sales agents that demonstrate products in e-commence applications (see Figure 1: D), and web-agents that help the user surfing the web pages of
a company In the future, ECAs are likely to become long-term companions to many people and share much of their daily activity [20]
Trang 10
(A) Steve (B) Cosmo
(C) Greta (D) Rea
Figure 1: Examples of embodied conversational agents
While embodied conversational agents hold the promises of conducting effective and comfortable conversation, the related research is still in its infancy Conversation with ECAs presents particular theoretical and practical problems that warrant further investigation Some of the most intensively pursued research questions include:
z What are the appropriate computational models for building autonomous and intelligent ECAs?
z How can we draw on existing methods in research and effectively put the pieces together to design characters that are compelling and helpful?
z What is the set of communicative behaviors that are meaningful and useful for humans, and are practical to implement?
Trang 11It is evident that the field of embodied conversational agents remains open and requires systematic research
1.2 Research Objectives
This research project aims to develop an integrative approach for building an ECA that is able to engage conversationally with human users, and capable of behaving according to social norms in terms of facial display and gestures
Our research into the embodied conversational agents focuses on the attributes of autonomy and believability, not truthfulness To achieve autonomy, the autonomous agent
we developed shall be capable of relating itself to the virtual world in a sensible manner
In view of this, the architectural design aims to couple the agent’s perception and action appropriately, and to hide its internal mechanism from the system users To achieve believability, the agent must demonstrate sufficient social skills in terms of speech and animation Towards this goal, we pay careful attention to the design of verbal and nonverbal interaction and to the attainment of flows of behavior that help to establish social facts within the virtual world
In practice, we work towards assembling a coherent set of ECA technologies that help to attain the attributes as mentioned above, and are feasible to implement given the constraints in project timeline and resources Our work has been emphasized on the development of ECA technologies as described below:
Trang 12• Layered architectures: In our design, the agent is divided into three layers to
support its autonomous behaviors The agent is able to sense the environment, and react to an actually detected situation The agent can also act deliberately, i.e., plan
a course of actions, over time, in pursuit of its own goal
• Discourse model: We aim to provide a unified discourse framework where
conversation about individual topics can be modeled Under this framework, the agent will able to narrate about a specific topic, as well as to respond to user queries in appropriate manner
• Generation and coordination of multimodal behaviors: The multimodal
behaviors such as deictic gestures, turn taking, attention, and nonverbal feedback, can be instantiated and expressed We also need to devise an approach to coordinate the modalities that are generated
We adopt this integrative approach in the design and development of an agent called Elva Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to the human users A user study has been performed to measure user satisfaction and agent believability The human computer interaction evaluation methods are exploited to cover novel embodied conversational settings
1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery
Elva is an embodied conversational agent developed in the Learning Science and Learning Environment Lab, National University of Singapore, for this project Her job is
Trang 13to guide a user to through an interactive art gallery, and engage conversationally with the user about gallery artifacts
In conventional tours, the guide decides where to take people and the exact timing of the itinerary However, under the circumstances of a virtual tour, we expect the system users
to behave more actively In the interest of greater user freedom and participation, the design especially caters to mixed-initiative interaction where both parties can direct the navigation and conversation The following paragraphs describe the Elva’s three categories of behaviors: navigation, speech, and nonverbal behaviors
z Navigation: A tour can be divided into a sequence of locations Through
planning of the itinerary, the guide is able to intentionally direct the user from one artifact to the next When the user navigates independently, Elva can track the user
z Speech: Elva’s verbal communication can be classified into two modes:
narrative and dialog In the narrative mode, the guide presents information centering on the artifacts, artists, or the gallery being visited In the dialog mode, two parties engage in conversation about a specific topic
z Nonverbal Behaviors: The presence of body language and facial display helps
to convey communicative function to manage turn taking and feedback, to refer
to an artifact, and to create interesting effects
Trang 14We will use Elva as an example to illustrate various aspects of our approach throughout this thesis
1.4 Structure of Thesis
This thesis is divided into eight chapters, with each particular chapter detailing a specific area of focus:
Chapter 1, Introduction, gives an overview of the motivations and objectives that this
thesis aims to achieve
Chapter 2, Literature Review, discusses the various research areas that ground this
multi-disciplinary project Existing methodologies are reviewed from the perspectives of architectural requirements, discourse models, and multimodality
Chapter 3, Agent Architecture, presents a generic three-layered architecture which forms
the backbone of our ECA We then describe the construction of such architecture and the interaction among the system components
Chapter 4, Agent’s Verbal Behaviors, introduces the notion of schema, a discourse
template from which the narrative or dialog about a specific topic can be dynamically generated It then describes how structured and coherent verbal behaviors are supported under schema-base knowledge framework
Trang 15Chapter 5, Agent’s Nonverbal Behavior, begins by highlighting the importance of
interactional behaviors and deictic gestures in enhancing agent-user communication It then presents our methods to generate and express the multimodal behaviors
Chapter 6, Illustrated Agent-human Interaction, illustrates how Elva establishes verbal
and nonverbal communication with a system user, and guides the user in a gallery tour
Chapter 7, Evaluation, describes the evaluation methodology and observed results of the
user study performed on the virtual tour with Elva
Chapter 8, Conclusion, summarizes the thesis and documents our achievements and
contribution Possible future work is also discussed
Trang 16CHAPTER 2 RESEARCH BACKGROUND
2.1 ECA as a Multi-disciplinary Research Topic
Research on embodied conversational agents embraces multiple dimensions: ECA inherits the research problems from its super type, autonomous agent, which is a field of Artificial Intelligence; the verbal communication aspect may require technologies from Natural Language Processing (NLP) and Information Retrieval (IR); the embodiment and communicative behaviors aspects are intrinsically social, and form a dimension of Human Computer Interaction (HCI) research
We identified the key aspects of ECA technology through a breadth-wise study on ECA technology We then concentrate our research on the development of ECA from the aspects of agent architecture, conversation modal, and generation and coordination of multimodal behaviors The following sections will discuss the previous research in these research areas
2.2 Architectural Requirement
Architectural requirement has been widely discussed in the research literature [9,21,23] Sloman described architectural layers that comprise mechanisms for reactive, deliberative, and meta-management controls [23]:
z The Reactive mechanism acts in response to the actually detected internal or
external event Reactive systems may be very fast because they use highly parallel implementations However, such systems may lead to uncoordinated
Trang 17behaviors due to their inability to contemplate, evaluate, and select possible future course of action
z The Deliberative mechanism provides more sophisticated planning and
decision-making mechanisms Therefore, it allows for more coordination and global control
z The Meta-management allows the deliberative processes to be monitored and
improved It is an ambitious step towards a human-like architecture However, due to extraordinary complexity, it is not yet practically viable for implementation [23]
Most agent architectures conform to the action selection paradigm [1,10,11,18] FSA (Finite State Automaton) is a simple architecture based on the reactive mechanism described above The behavior state in FSA is controlled by events generated based on the values of the internal mood variables [18] BLUMBERG is another reactive architecture, which employs a hierarchical behavioral representation where behaviors compete among themselves according to their “levels of interest.” [18] JAM is a traditional BDI (Belief Desire Intention) architecture operating with explicit goals and plans JAM agents can deliberately trigger the search for the best plan from a plan library [11]
Trang 18ALICE’s conversational capabilities rely on Case-Based Reasoning and pattern-matching Another chatterbot, JULIA, employs an Activation Network to model possible conversation on a specific topic [10] When an input a pattern is matched, the node containing the pattern has its activation level raised And the node with the highest level is then selected This allows more structured and flowing conversations than simple pattern matching It is interesting to note that NLP is seen as less important in chatterbots, whereas simple implementations are often sufficient for creating believable conversation
in a limited context
More recently, researchers began to exploit the notion of speech acts to analyze the
intentions of speakers [8,15] Based on Searle’s speech act theory [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising The categories of speech act can be domain-independent, as described in TRAINS and DAMSL annotation scheme [8]; or it can be defined so as to be relevant to a domain of conversation When the intention of the speaker is captured with speech act, the response given by the agent is likely to be more accurate, as compared to the situation of expressing with simple keywords
2.4 Multimodal Communications
Embodiment allows multimodality, therefore making interaction more natural and robust Research on multimodal communications has concentrated on the question of generating understandable nonverbal modalities In what follows, we review some previous research
in this field
Trang 19Steve was developed by the Center for Advanced Research in Technology for Education (CARTE) [16] As shown in Figure 2, Steve can demonstrate actions, use gaze and deictic gestures to direct the students’ attention, and he can guide the students around with locomotion In order to do these, Steve exploits its knowledge of the position of objects in the world, its relative location with respect to these object, as well as its prior explanations
to create deictic gesture, motions, and utterances that are both natural and unambiguous
Figure 2: Steve's deictic gestures
Greta, embodied in a 3D talking head (see Figure 3), shows a rich expressiveness during the natural conversation with user Greta manifests emotion by employing a Belief Network that links facial communicative functions and facial signals that are consist with the discourse content [19,20] The facial communicative functions that are typically used
in human-human dialogs; for instance: syntactic, dialogic, meta-cognitive, performative, deictic, adjectival and belief relation functions
Figure 3: Greta's facial display: neutral, sorry-for, and surprise
REA [6] is an embodied estate-seller that is able to describe feature of a house using a
Trang 20combination of speech utterances and gesture Rea’s speech and gesture output is generated in real time from the knowledge base and the description of communicative goals, using SPUD (“Sentence Planning Using Description”) engine
2.5 Summary
We have reviewed some of the previous research works in the areas of architectural requirements, discourse models, and multimodality In the next chapters, we proceed to present our integrative approaches
Trang 21CHAPTER 3 AGENT ACHITECTURE
In this chapter, we present a generic three-layered architecture which forms the backbone
of our ECA We then describe the construction of such architecture and the interaction among the system components
3.1 Overview
In our design, an embodied conversational agent interfaces with the virtual world via a perception module and an actuation module The perception module provides the agent with high-level sensory information The actuation module enables the ECA to walk and
to perform gestures and facial expressions The gap between perception and action is
bridged by mental processing in the Interpretation Module, a knowledge-based inference
Figure 4: Three-layered architecture for the agent’s mental processing
Utterance Analyzer
Response Retriever
KNOWLEDGE BASE
FOCUS
Dialog Coordinator
DELIBERATIVE
REACTIVE
ACTUATION MODULE
Q&I Generator
PERCEPTION
MODULE
Lexicon SpeechAct Context
Plan Library Agent Goal Agent State
Enricher
REFLEXIVE
INTERPRETATION MODULE
Trang 22engine We model this module using a combination of reactive and deliberative mechanisms in order to cope with mixed-initiative situations The architecture (see Figure 4) comprises three layers: a reflexive layer (section 3.4.1), a reactive layer (section 3.4.2), and a deliberative layer (section 3.4.3)
3.2 Perception Module
The Perception Module endows an ECA with the ability to “see” and “hear” in the virtual world It supplies sensory information about self, system users (as avatars), environment, and the relationships among them For example, “user arrives at artifact2,” “user is facing artifact2,” “I am at artifact2,” “artifact2 is on my left hand side,” “user clicks on the artifact2,” and “User said <How do you do?> to me.”
When a state change in the world is detected, the Perception Module will abstract the information, and propagate this event to the reactive layer for appraisal Simultaneously, raw perceptual data, such as the coordinates of user’s current position, are fed to reflexive layer for quick processing
3.3 Actuation Module
The Actuation Module drives the character animation and generates sythesized speech The Interpretation Module produces behavior scripts which specify multimodal behaviors over a span of time, and sends them to the event queue of the Actuation Module The Actuation Module sequences and coordinates the multimodal signals in different channels (described in section 5.4) and plays them back If necessary, the Actuation Module will update the agent’s goal state and coversational state upon the completion of actuation
Trang 233.4 Interpretation Module
The Interpretation Module an inference engine that bridge the agent’s perception and actions It comprises three layers, i.e., a reflexive layer, a reactive layer, and a deliberative layer
3.4.1 Reflexive Layer
The reflexive layer implements a series of reflex behaviors that are quick and inflexible (Q&I) For example, the agent is able to track the user with head movement in response to the user’s subtle location change And the agent glances at artifact when the user performs
a click on the artifacts
3.4.3 Deliberative Layer
The deliberative layer provides planning and decision-making mechanisms The ECA in our environment is equipped with information-delivery goals and plans that accomplish the goals The Planner selects an appropriate plan from the plan library based on the agent goal During the interaction, Planner may adopt new plans according to the change of goal The Scheduler instantiates and executes the selected plan With its help, the agent is able
Trang 24to advance the conversation at regular time intervals In chapter 5, we will go through the details
3.4.4 Behavior Generation
The utterances produced by the reactive layer and deliberative layer flow into a Dialog Coordinator where turn taking is regulated Finally, the Enricher generates appropriate nonverbal behaviors to accompany the speech before the behavior is actuated Details are described in chapter 6
3.5 Knowledge Base
The Knowledge Base (KB) at the heart of the system consists of schema nodes that encode the domain knowledge as condition-action pairs FOCUS refers to the schema node that constitutes the agent’s current focus of attention during the conversation FOCUS serves as a reference point for both reactive and deliberative processes, and it is updated every time a new behavior has been actuated The details will be covered in section 4.1
Trang 25CHAPTER 4 AGENT’S VERBAL BEHAVIORS
This chapter introduces the notion of schema, a discourse template from which the narrative and dialog about a specific topic can be dynamically generated It then describes how structured and coherent verbal behaviors are supported under schema-base knowledge framework
4.1 Schema-based Discourse Framework
A schema defines a template from which narrative or dialog about a specific topic can be
dynamically generated Each schema is related to a discourse type and a topic For example, the schema named ELABORATE_BIODATA_OF_ ARTIST1 indicates that the topic on BIODATA_OF_ARTIST1 is elaborated using this schema A schema is modeled
as a network of utterances that contribute to its information-delivery goal When a schema, e.g., ELABORATE_BIODATA_OF_ARTIST1, is instantiated, the agent has to fulfill a goal ELABORATE (ARTIST, BIODATA) by producing a sequence of utterances from the template
A pictorial view of the discourse space is shown in Figure 5: a network of utterances (black dots) forms a schema (dark grey ovals) In turn, a few schemas are grouped into a domain entity (light grey oval), e.g., an artist, an artifact, or an art-related concept Transiting from one schema to another simulates shifting topic or referencing to another concept in a conversation
Trang 26INTRO_PARTICULAR_OF_ARTIST1 ELABORATE_TECHNIQUE_OF_
SCULPTART
Figure 5: Schema-based discourse space (portion only)
An utterance is encapsulated as a schema node in the knowledge base Each schema node can have multiple conditions, an action, and several links A condition describes a pattern
of user input, in terms of a speech act [3], and a keyword list, which activates the node The action contains a list of interchangeable responses The link specifies three types of relationships between two adjacent nodes (1) A sequential link defines a pair of nodes in the order of narration (2) A dialog link connects the agent’s adjacent turns in a dialog The link contains a pattern which is expected from the user’s turn (3) A reference link bridges two schemas, which is analogous to a hyper link
We employ an XML tree structure to represent the schema nodes The definition is shown below:
<!— ELVA’s schema DTD >
<! ELEMENT schema (node+)>
<! ELEMENT node (condition+, action, link+)>
<!ELEMENT condition (speechAct?, keywordList?)+>
<!ELEMENT speechAct (#PCDATA)>
<!ELEMENT keywords (#PCDATA)>
<!ELEMENT action (utterance+)>
<!ELEMENT utterance (#PCDATA)>
ELABORATE_BIODA
TA_OF_ARTIST1
ELABORATE_GEN RE_OF_ARTIST1
DEFINE_CONCEPT_OF_
SCULTPART
Trang 27<! ELEMENT link (#PCDATA)>
<!ATTLIST schema schema_id ID #REQUIRED>
<!ATTLIST schema goal CDATA #REQUIRED>
<!ATTLIST condition type CDATA (dependant|independent) “independent” >
<!ATTLIST node id ID #REQUIRED>
<!ATTLIST node schema_id IDREF #REQUIRED>
<!ATTLIST action affective_type CDATA #IMPLIED>
<!ATTLIST link relationship (sequential|dialog|reference) “sequential”>
<!ATTLIST link node_id IDREF #REQUIRED>
Using this definition, a schema node can be encoded as following:
<node id = "INTER.15.1" schema_id = "INTER.15">
<condition type = "independent">
</utterance>
</action>
<link relationship = “sequential” node_id = “INTER.15.2”>
<link relationship = “dialog” node_id = “INTER.15.3”>
Trang 28usually organized in a linear form As depicted in Figure 4, each schema node encapsulates an utterance The nodes must be well ordered so as to generate a clear narrative about a specific topic
Here you looked at the sculpture called Love and Hate.
If you look carefully, you probably see two persons bound together.
It suggests love because they are very close together.
It also suggests hate because the two faces are pulled away from each other.
This sculpture deals with the
"love and hate" relationship.
Figure 6: Narrative modeling using a schema
4.2.2 Dialog Modeling
In the dialog mode, the agent and the user are engaged in the discussion of a topic Turns are frequently taken A sample schema for dialog modeling is shown in Figure 5 Clearly, the organization of schema nodes in a dialog is more dynamic and complicated than in narrative Some schema nodes branch out into a few dialog links, which describe the user’s possible responses
Trang 29Can you guess what this sculpture looks like?
Figure 7: Dialog modeling using a schema
4.3 Reasoning
This section introduces the agent’s reasoning process We utilize the concept of based reasoning and pattern matching As a key area of improvement, the user’s query patterns are represented using the extracted meaning of the utterance, instead of simple keywords
No, hint
ok
Trang 30z Speech Act Classification
Correct ill-formed input
Transform into stem list using PORTER
Classify into Speech Acts
Remove stop words
Replace Synonym Speech Act
Tokenize into linked word
list Resolve Reference
Raw Utterance S
Trang 31The next crucial step is to determine users’ intention via speech act classification Inspired by Searle’s speech act theory [3], the system defines over thirty speech acts (see Table 1) to cover the user’s common queries and statements Some speech acts are domain independent, while the rest are related to our application, i.e., virtual tour guide in an art gallery
Illocution Matters
QUESTION WHY, WHEN, WHO, WHERE, QUANTITY,
CONSTRUCT, MATERIAL, COMMENT, EXAMPLE, TYPE, DESCRIPTION, COMPARISON, MORE REQUEST MOVE, TURN, POINT, EXIT, REPEAT
STATE FACT, REASON, EXAMPLE, PREFERENCE,
COMPARE, LOOKLIKE, POSITIVECOMMENT, NEGATIVECOMMENT
COMMUNICATE GREET, THANK, INTERRUPT, BYE,
ACKNOWLEDGE, YES, NO, REPAIR
Table 1: List of speech acts
A speech act is described by its illocutionary force and the object For example, QUESTION_MATERIAL relates to an inquiry about the material
<User> Can you tell me what is this sculpture made of?
<Elva> It is made of Ciment-fondu, an industrial material
STATE_POSITIVECOMMENT reveals a user’s positive comment
<User> I like this sculpture very much
<Elva> I am glad to hear that
Our early classification approach was based on the list of acceptable patterns for a speech act For example, “do you like *” is an acceptable pattern for
Trang 32QUESTION_COMMENT However, this approach results in relatively low accuracy Consider the following two utterances:
A: <User> do you like this sculpture?
B: <User> do you like to go for supper?
Utterance B is classified wrongly
This problem was resolved by adding in the list of rejected patterns for each speech act In this case, “do you like to *” shall be added to the reject patterns for QUESTION_COMMENT
We also encountered the overlapping problem in speech act classification As it is impossible to enumerate all possible patterns for a speech act, the classification is not entirely clean-and-dry In some cases, one utterance can elicit more than one speech act For example, the following utterance can be classified into both QUESTION_CONSTRUCT and QUESTION_DESCRIBE
<User> Can you describe how to make ceramics?
Further investigation reveals that some speech acts, e.g., QUESTION_CONSTRUCT are, in fact, the specialized instances of another speech act (QUESTION_DESCRIBE) Therefore, their patterns may overlap To get around this problem, we assess the level of specialization for each speech Priority is given to a specialized speech act when several speech acts are co-present
Trang 33We developed a speech act classifier based on a data modal called state transition graphs (STGs) as shown in Figure 7 An STG encapsulates the acceptable and rejected patterns for a speech act A path from start state to terminal state indicates
an acceptable or a reject pattern
START
I do tell what
me
Figure 9: State transition graph for QUESTION_MATERIAL
During classification, an utterance is validated against every STG defined for agent’s set of speech acts For each STG, the utterance performs a word-by-word traversal through the STG If the utterance runs into a state where there is no escape, or terminates in the REJECT state, this indicates that validation has failed Otherwise, the utterance will terminate in the ACCEPT state, which means the utterance matches with the corresponding speech act For example, for an utterance, “Do you know what this sculpture is made of,” there is a valid path
be
make NOT (to)
REJECT ACCEPT
Trang 34
“doÆyouÆknowÆwhatÆmakeÆof” in QUESTION_MATERIAL (see Figure 8)
Do you know what this sculpture is made of
Synonym replacement is a desired feature for embodied conversational agents It allows a single set of keywords to be scripted in the Knowledge Base, and a list of
Trang 35its synonyms to be specified in a lexical database (e.g WordNet [24]) The feature will help reduce the number of similar patterns we have to script because of various synonyms of a certain word And it will allow us to adapt the agent to a new version of words easily
After keywords are extracted, they are combined with speech act to form the utterance meaning Utterance meaning is then escalated to the next level of processing
4.3.2 Retrieval Phase
In the retrieval phase, the Response Retriever (refer to Figure 4) searches for the most appropriate schema node that matches the user’s utterance meaning
We have developed a heuristic searching approach, named locality-prioritized search,
which is sensitive to the locality of the schema nodes The heuristic rules are well based
on the characteristics of schema-based discourse framework that we have proposed Using such a framework, the basic assumption is that conversation around a topic shall be modeled as a schema In other words, the schema shall capture the relevant and coherent information about the topic Given this assumption, the idea of locality-prioritized search
is to use locality as a clue about the information relevance, so as to perform effective search The step steps are described as the following:
Trang 36
1 The search starts from the successive nodes of FOCUS3 (refer to Figure 4) A
locality weight w 1 is assigned to a successive node which is connected via a dialog link If a node is connected to FOCUS via a sequential link, a relatively low
weight w 2 will be assigned
2 Scan the nodes within the current schema, assigning a weight w 3 that is lower than
w 2
3 If necessary, the searching space is expanded to the whole knowledge base and a
lowest weight w 4 is assigned
For each schema node, the similarity between its ith condition and the user’s utterance
meaning is measured using the sum of both the hits in their speech acts S i and the hits in
the keyword lists K i We then employ a matching function f to compute the matching score by multiplying the maximum similarity value with the locality weight w of the node
The node with the highest matching score is accepted if the score exceeds a predefined threshold
f = Max (α Si + β Ki) w
i:1 Æ n
The formulation of the matching function reveals our design rationale: first, speech acts have been integrated as part of the pattern so as to enhance the robustness of pattern matching; second, to avoid flattened retrieval, i.e treating every node equally, we have
3
Recap that FOCUS is the agent’s attentional node during the conversation, i.e., “what we were talking about in this topic”
Trang 37introduced locality weight, which gives priority to those potentially more related schema nodes
4.4 Planning
The implemented planner relies on a repository of plans that are manually scripted by
content experts There are two levels of plans: an activity plan constitutes a series of activities to be carried out by the agent; a discourse plan lays out several topics to be
covered for an activity In our context, Elva’s activity plan outlines the skeleton of a tour
It includes: (1) welcome remarks; (2) an “appetizer”, i.e., a short and sweet introduction before the tour; (3) the sequence of “must-sees”, i.e., artifacts to be visited along the itinerary; (4) a “dessert”, i.e., summarization A sample plan is shown in Figure 9 An itinerary is planned at the level of domain entities (light grey ovals), whereas discourse planning is done at schema level (dark grey ovals) A path through the discourse space indicates the sequence of schemas, i.e the topics, to be instantiated
artifact
#8 artifact
#3
Figure 11: Elva's tour plan
welcome remarks WELCOME
INTRO_PARTICULAR_OF_MYSELF INTRO_THEME_OF_EXHIBITION
Trang 38appetizer INTRO_BACKGROUND_OF_GALLERY
INTRO_PARTICULAR_OF_ARTIST BRIEF_PLAN_OF_TOUR
artifact#1 PROBE_METAPHOR_OF_ARTIFACT1
DESCRIBE_METAPHOR_OF_ARTIFACT1 DESCRIBE_MATERIAL_OF_ARTIFACT1 IDENTITY_ARTISTINTENTION_OF_ARTIFACT1 artifact#3 DESCRIBE_METAPHOR_OF_ARTIFACT3
DESCRIBE_TEXTURE_OF_ARTIFACT3 IDENTITY_ARTISTINTENTION_OF_ARTIFACT3
dessert SUMMARIZE_TOUR
BYE
Table 2: Layout of a sample tour plan
The plan library serves as a repository of activity plans and discourse plans Each plan specifies the goal it intends to achieve At run time, the Planner picks up a plan to fulfill the agent goal The goal can be modified by a user’s navigation or query For instance, when a user takes navigational initiative and stops at an artifact ARTIFACT1, the Planner will be notified about the new goal NARRATE (ARTIFACT1) A discourse plan can fulfill this goal by decomposing it into sub-goals Say a discourse plan, DESCRIBE_METAPHOR_OF_ARTIFACT1ÆDESCRIBE_TEXTURE_OF_ARTIFACT1ÆDESCRIBE_MATERIAL_OF_ARTIFACT1, is selected The goal NARRATE (ARTIFACT1) is then achieved through the accomplishment of the sub-goals DESCRIBE (ARTIFACT1, METAPHOR), DESCRIBE (ARTIFACT1, TEXTURE) and DESCRIBE (ARTIFACT1, MATERIAL)
In the present development, the Planner provides Elva with a means to generate a random tour plan, and localize the discourse content for individual sculptures when visited The choices of plans are limited to the available plans in the plan library To some extent, the
Trang 39Planner functions in a deterministic manner While this seems adequate for the domain of the intended application, a museum where a guide plans a tour of “must-sees” in a pre-defined sequence, the future development of the planner should favor a more flexible way
to generate plans that can cater to a dynamic world situation For example, it is desirable
if a guide could customize a tour plan in accordance with the user’s preference
The Scheduler parses and executes a plan Before the execution of a plan, the agent’s attentional state, i.e., FOCUS, will be positioned at a starting node of the targeted schema The scheduling task is performed at regular intervals At each interval, the Scheduler has
to decide “what to say next” by selecting one of the successive nodes of FOCUS (this selected node will become the next FOCUS) During the conversation, FOCUS keeps advancing until the Scheduler establishes a dynamic sequence of utterances to span all the schemas in the plan
4.5 Dialog Coordination
The Dialog Coordinator (recall Figure 4) is responsible for turn taking management It keeps a conversation state diagram that contains possible states: NARRATE, DIALOG, MAKE_REFERENCE, SHIFT_TOPIC, and WITNESS Dialog Coordinator examines the inputs from the reactive layer, the deliberative layer, and the agent’s perceptual data to determine the next appropriate state For example, the agent transits the conversation state
to DIALOG if dialog links are frequently traversed It changes to WITNESS state when the user starts to manipulate an artifact or suddenly moves away The treatment of turn taking is designed carefully for each conversation state For example, in DIALOG state,
Trang 40the agent temporarily suspends the execution of the scheduler to wait for a user turn In WITNESS state, the waiting period can be even longer
Another functionality of Dialog Coordinator is to manage conversation topics using a topic stack Considering when Elva makes a reference to a concept, the related topic (represented with schema ID) will be pushed on to the topic stack When reference is completed, the topic will be cleared from the top of the stack, so that the agent can proceed with its original topic
4.6 Summary
We utilized the notion of “schema” to support structured and coherent verbal behaviors Under a unified discourse framework, ECA is able to answer appropriately to the user enquiries, as well as to generate plans to fulfill its information delivery goals In addition,
we briefly described how turn taking was coordinated based on the state of conversation