Conversational agent in an interactive virtual world

LIST OF TABLES Table 1: List of speech acts ...23 Table 2: Layout of a sample tour plan ...30 Table 3: Elva’s interactional behaviors portion only ...34 Table 4: Elva's multimodal behavi

Trang 1

ELVA: AN EMBODIED CONVERSATIONAL AGENT IN

AN INTERACTIVE VIRTUAL WORLD

Trang 2

First of all, I wish to thank my supervisor A/P Chee Yam San for his guidance, encouragement and patience over the years The many discussions we had, in which he showed his enthusiasm towards the topic, kept me on the right track

I would like to show my gratitude to the head of NUS Museums, Ms Angela Sim, for granting us the permission to use the Ng Eng Teng Gallery as our subject matter, and making the project possible

I would like to thank Dr Sabapathy for reviewing the virtual tour, and providing his insights on Ng Eng Teng’s art Also thanks to Suhaimi, curator of the Ng Eng Teng Gallery, who has contributed his expert knowledge about how to guide a tour Special thanks to those who lent their hands during the gallery-scanning sessions: Mr Chong, Beng Huat, Chao Chun, Ting, and Donald

I am also grateful to the members of LELS Lab: Chao Chun, Jonathan, Lai Kuan, Liu Yi, and Si Chao It was an enjoyable and memorable experience studying in this lab

Finally, it is time to thank my family Their support and patience have accompanied me throughout the whole period

Yuan Xiang

Trang 3

TABLE OF CONTENT

TABLE OF CONTENT III LIST OF TABLES V LIST OF FIGURES VI SUMMARY VII

CHAPTER 1 INTRODUCTION 1

1.1 EMBODIED CONVERSATIONAL AGENTS: A CHALLENGE FOR VIRTUAL INTERACTION.1 1.2 RESEARCH OBJECTIVES 3

1.3 ELVA, AN EMBODIED TOUR GUIDE IN AN INTERACTIVE ART GALLERY 4

1.4 STRUCTURE OF THESIS 6

CHAPTER 2 RESEARCH BACKGROUND 8

2.1 ECA AS A MULTI-DISCIPLINARY RESEARCH TOPIC 8

2.2 ARCHITECTURAL REQUIREMENT 8

2.3 DISCOURSE MODEL 9

2.4 MULTIMODAL COMMUNICATIONS 10

2.5 SUMMARY 12

CHAPTER 3 AGENT ACHITECTURE 13

3.1 OVERVIEW 13

3.2 PERCEPTION MODULE 14

3.3 ACTUATION MODULE 14

3.4 INTERPRETATION MODULE 15

3.4.1 Reflexive Layer 15

3.4.2 Reactive Layer 15

3.4.3 Deliberative Layer 15

3.4.4 Behavior Generation 16

3.5 KNOWLEDGE BASE 16

CHAPTER 4 AGENT’S VERBAL BEHAVIORS 17

4.1 SCHEMA-BASED DISCOURSE FRAMEWORK 17

4.2 DISCOURSE MODELING 19

4.2.1 Narrative Modeling 19

4.2.2 Dialog Modeling 20

4.3 REASONING 21

4.3.1 Analysis Phase 21

4.3.2 Retrieval Phase 27

4.4 PLANNING 29

4.5 DIALOG COORDINATION 31

Trang 4

4.6 SUMMARY 32

CHAPTER 5 AGENT’S NONVERBAL BEHAVIORS 33

5.1 TYPES OF NONVERBAL BEHAVIORS IN OUR DESIGN 33

5.1.1 Interactional Behaviors 33

5.1.2 Deictic Behaviors 34

5.2 NONVERBAL BEHAVIOR GENERATION 35

5.3 NONVERBAL BEHAVIOR EXPRESSION 36

5.3.1 Building Blocks of Multimodality 37

5.3.2 Multimodality Instantiation 38

5.3.3 Multimodality Regulation 40

5.4 SUMMARY 41

CHAPTER 6 ILLUSTRATED AGENT-HUMAN INTERACTION 42

6.1 AN INTERACTIVE ART GALLERY 42

6.2 VERBAL COMMUNICATION 43

6.2.1 Elva presents information 44

6.2.2 Elva probes the user in a dialogue 45

6.2.3 Elva answers questions 46

6.3 NONVERBAL COMMUNICATION 47

6.3.1 Deictic Gestures 47

6.3.2 Facial Display 49

6.3.3 Locomotion 50

CHAPTER 7 EVALUATION 51

7.1 METHODOLOGIES 51

7.1.1 Qualitative Analysis 51

7.1.2 Quantitative Analysis 52

7.2 SUBJECTS, TASK AND PROCEDURE 52

7.3 EVALUATION RESULTS 55

7.3.1 User Experience 55

7.3.2 Agent Believability 57

7.3.2.1 Evaluation results on Elva’s Verbal Behaviors 58

7.3.2.2 Evaluation results on Elva’s Non-verbal Behaviors 61

7.4 DISCUSSION 62

7.5 SUMMARY 63

CHAPTER 8 CONCLUSION 64

8.1 SUMMARY 64

8.2 CONTRIBUTIONS OF THE THESIS 65

8.3 FUTURE WORK 66

REFERENCES 68

Trang 5

LIST OF TABLES

Table 1: List of speech acts 23

Table 2: Layout of a sample tour plan 30

Table 3: Elva’s interactional behaviors (portion only) 34

Table 4: Elva's multimodal behavior library 38

Table 5: Evaluation results related to user satisfaction in interaction with Elva 55

Table 6: Users' responses to Elva's personality 57

Table 7: Evaluation results related to Elva's Verbal Behaviors 58

Table 8: Evaluation results related to nonverbal behaviors 61

Trang 6

LIST OF FIGURES

Figure 1: Examples of embodied conversational agents 2

Figure 2: Steve's deictic gestures 11

Figure 3: Greta's facial display: neutral, sorry-for, and surprise .11

Figure 4: Three-layered architecture for the agent’s mental processing 13

Figure 5: Schema-based discourse space (portion only) 18

Figure 6: Narrative modeling using a schema 20

Figure 7: Dialog modeling using a schema 21

Figure 8: Transform raw user input to utterance meaning 22

Figure 9: State transition graph for QUESTION_MATERIAL 25

Figure 10: Speech act classification using STG 26

Figure 11: Elva's tour plan 29

Figure 12: Conversational Modalities 37

Figure 13: Synchronized modalities in an animation timeline 39

Figure 14: Elva invites the user to start the tour 42

Figure 15: Elva's basic facial expressions 49

Figure 16: Elva's locomotion 50

Figure 17: Procedure of user study on Elva 53

Trang 7

SUMMARY

The technology of Embodied Conversational Agents (ECAs) offers great promise for natural and realistic human-computer interaction However, interaction design in this domain needs to be handled sensitively in order for verbal and nonverbal signals conveyed by the agent to be understood by human users

This thesis describes an integrative approach for building an ECA that is able to engage conversationally with human users, and that is capable of behaving according to social norms in terms of facial display and gestures The research focuses on the attributes of the

agent’s autonomy and believability, not its truthfulness To achieve autonomy, we present

a three-layered architectural design to ensure appropriate coupling between the agent’s perception and action, and to hide the internal mechanism from users In regard to agent believability, we utilize the notion of “schema” to support structured and coherent verbal behaviors We also present a layered approach to generate and coordinate the agent’s nonverbal behaviors, so as to establish social interactions within a virtual world

Using the above approach, we developed an ECA called Elva Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to human users A user study was performed to assess user satisfaction and agent believability when interacting with Elva The user feedback was generally positive Most users interacted successfully with Elva and enjoyed the tour A majority agreed that Elva’s behaviors, both verbal and nonverbal, were comprehensive and appropriate The user study also revealed that

Trang 8

emotive behaviors should be integrated into Elva to achieve a higher level of believability Future work will address the area of affective computing, multiparty interaction support, and user modeling

Trang 9

One path towards achieving realistic conversational interfaces has been the creation of embodied conversational agents or ECAs An ECA is a life-like computer character that can engage in a conversation with humans using a naturally spoken dialogue It also has a

“body” and knows how to use it for effective communication

In recent years, systems containing ECAs have been deployed in a number of domains There are pedagogical agents that educate and motivate students in E-learning systems (see Figure 1: A, B), virtual actors or actresses that are developed for entertainment or therapy (see Figure 1: C), sales agents that demonstrate products in e-commence applications (see Figure 1: D), and web-agents that help the user surfing the web pages of

a company In the future, ECAs are likely to become long-term companions to many people and share much of their daily activity [20]

Trang 10

(A) Steve (B) Cosmo

(C) Greta (D) Rea

Figure 1: Examples of embodied conversational agents

While embodied conversational agents hold the promises of conducting effective and comfortable conversation, the related research is still in its infancy Conversation with ECAs presents particular theoretical and practical problems that warrant further investigation Some of the most intensively pursued research questions include:

z What are the appropriate computational models for building autonomous and intelligent ECAs?

z How can we draw on existing methods in research and effectively put the pieces together to design characters that are compelling and helpful?

z What is the set of communicative behaviors that are meaningful and useful for humans, and are practical to implement?

Trang 11

It is evident that the field of embodied conversational agents remains open and requires systematic research

1.2 Research Objectives

This research project aims to develop an integrative approach for building an ECA that is able to engage conversationally with human users, and capable of behaving according to social norms in terms of facial display and gestures

Our research into the embodied conversational agents focuses on the attributes of autonomy and believability, not truthfulness To achieve autonomy, the autonomous agent

we developed shall be capable of relating itself to the virtual world in a sensible manner

In view of this, the architectural design aims to couple the agent’s perception and action appropriately, and to hide its internal mechanism from the system users To achieve believability, the agent must demonstrate sufficient social skills in terms of speech and animation Towards this goal, we pay careful attention to the design of verbal and nonverbal interaction and to the attainment of flows of behavior that help to establish social facts within the virtual world

In practice, we work towards assembling a coherent set of ECA technologies that help to attain the attributes as mentioned above, and are feasible to implement given the constraints in project timeline and resources Our work has been emphasized on the development of ECA technologies as described below:

Trang 12

• Layered architectures: In our design, the agent is divided into three layers to

support its autonomous behaviors The agent is able to sense the environment, and react to an actually detected situation The agent can also act deliberately, i.e., plan

a course of actions, over time, in pursuit of its own goal

• Discourse model: We aim to provide a unified discourse framework where

conversation about individual topics can be modeled Under this framework, the agent will able to narrate about a specific topic, as well as to respond to user queries in appropriate manner

• Generation and coordination of multimodal behaviors: The multimodal

behaviors such as deictic gestures, turn taking, attention, and nonverbal feedback, can be instantiated and expressed We also need to devise an approach to coordinate the modalities that are generated

We adopt this integrative approach in the design and development of an agent called Elva Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to the human users A user study has been performed to measure user satisfaction and agent believability The human computer interaction evaluation methods are exploited to cover novel embodied conversational settings

1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery

Elva is an embodied conversational agent developed in the Learning Science and Learning Environment Lab, National University of Singapore, for this project Her job is

Trang 13

to guide a user to through an interactive art gallery, and engage conversationally with the user about gallery artifacts

In conventional tours, the guide decides where to take people and the exact timing of the itinerary However, under the circumstances of a virtual tour, we expect the system users

to behave more actively In the interest of greater user freedom and participation, the design especially caters to mixed-initiative interaction where both parties can direct the navigation and conversation The following paragraphs describe the Elva’s three categories of behaviors: navigation, speech, and nonverbal behaviors

z Navigation: A tour can be divided into a sequence of locations Through

planning of the itinerary, the guide is able to intentionally direct the user from one artifact to the next When the user navigates independently, Elva can track the user

z Speech: Elva’s verbal communication can be classified into two modes:

narrative and dialog In the narrative mode, the guide presents information centering on the artifacts, artists, or the gallery being visited In the dialog mode, two parties engage in conversation about a specific topic

z Nonverbal Behaviors: The presence of body language and facial display helps

to convey communicative function to manage turn taking and feedback, to refer

to an artifact, and to create interesting effects

Trang 14

We will use Elva as an example to illustrate various aspects of our approach throughout this thesis

1.4 Structure of Thesis

This thesis is divided into eight chapters, with each particular chapter detailing a specific area of focus:

Chapter 1, Introduction, gives an overview of the motivations and objectives that this

thesis aims to achieve

Chapter 2, Literature Review, discusses the various research areas that ground this

multi-disciplinary project Existing methodologies are reviewed from the perspectives of architectural requirements, discourse models, and multimodality

Chapter 3, Agent Architecture, presents a generic three-layered architecture which forms

the backbone of our ECA We then describe the construction of such architecture and the interaction among the system components

Chapter 4, Agent’s Verbal Behaviors, introduces the notion of schema, a discourse

template from which the narrative or dialog about a specific topic can be dynamically generated It then describes how structured and coherent verbal behaviors are supported under schema-base knowledge framework

Trang 15

Chapter 5, Agent’s Nonverbal Behavior, begins by highlighting the importance of

interactional behaviors and deictic gestures in enhancing agent-user communication It then presents our methods to generate and express the multimodal behaviors

Chapter 6, Illustrated Agent-human Interaction, illustrates how Elva establishes verbal

and nonverbal communication with a system user, and guides the user in a gallery tour

Chapter 7, Evaluation, describes the evaluation methodology and observed results of the

user study performed on the virtual tour with Elva

Chapter 8, Conclusion, summarizes the thesis and documents our achievements and

contribution Possible future work is also discussed

Trang 16

CHAPTER 2 RESEARCH BACKGROUND

2.1 ECA as a Multi-disciplinary Research Topic

Research on embodied conversational agents embraces multiple dimensions: ECA inherits the research problems from its super type, autonomous agent, which is a field of Artificial Intelligence; the verbal communication aspect may require technologies from Natural Language Processing (NLP) and Information Retrieval (IR); the embodiment and communicative behaviors aspects are intrinsically social, and form a dimension of Human Computer Interaction (HCI) research

We identified the key aspects of ECA technology through a breadth-wise study on ECA technology We then concentrate our research on the development of ECA from the aspects of agent architecture, conversation modal, and generation and coordination of multimodal behaviors The following sections will discuss the previous research in these research areas

2.2 Architectural Requirement

Architectural requirement has been widely discussed in the research literature [9,21,23] Sloman described architectural layers that comprise mechanisms for reactive, deliberative, and meta-management controls [23]:

z The Reactive mechanism acts in response to the actually detected internal or

external event Reactive systems may be very fast because they use highly parallel implementations However, such systems may lead to uncoordinated

Trang 17

behaviors due to their inability to contemplate, evaluate, and select possible future course of action

z The Deliberative mechanism provides more sophisticated planning and

decision-making mechanisms Therefore, it allows for more coordination and global control

z The Meta-management allows the deliberative processes to be monitored and

improved It is an ambitious step towards a human-like architecture However, due to extraordinary complexity, it is not yet practically viable for implementation [23]

Most agent architectures conform to the action selection paradigm [1,10,11,18] FSA (Finite State Automaton) is a simple architecture based on the reactive mechanism described above The behavior state in FSA is controlled by events generated based on the values of the internal mood variables [18] BLUMBERG is another reactive architecture, which employs a hierarchical behavioral representation where behaviors compete among themselves according to their “levels of interest.” [18] JAM is a traditional BDI (Belief Desire Intention) architecture operating with explicit goals and plans JAM agents can deliberately trigger the search for the best plan from a plan library [11]

Trang 18

ALICE’s conversational capabilities rely on Case-Based Reasoning and pattern-matching Another chatterbot, JULIA, employs an Activation Network to model possible conversation on a specific topic [10] When an input a pattern is matched, the node containing the pattern has its activation level raised And the node with the highest level is then selected This allows more structured and flowing conversations than simple pattern matching It is interesting to note that NLP is seen as less important in chatterbots, whereas simple implementations are often sufficient for creating believable conversation

in a limited context

More recently, researchers began to exploit the notion of speech acts to analyze the

intentions of speakers [8,15] Based on Searle’s speech act theory [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising The categories of speech act can be domain-independent, as described in TRAINS and DAMSL annotation scheme [8]; or it can be defined so as to be relevant to a domain of conversation When the intention of the speaker is captured with speech act, the response given by the agent is likely to be more accurate, as compared to the situation of expressing with simple keywords

2.4 Multimodal Communications

Embodiment allows multimodality, therefore making interaction more natural and robust Research on multimodal communications has concentrated on the question of generating understandable nonverbal modalities In what follows, we review some previous research

in this field

Trang 19

Steve was developed by the Center for Advanced Research in Technology for Education (CARTE) [16] As shown in Figure 2, Steve can demonstrate actions, use gaze and deictic gestures to direct the students’ attention, and he can guide the students around with locomotion In order to do these, Steve exploits its knowledge of the position of objects in the world, its relative location with respect to these object, as well as its prior explanations

to create deictic gesture, motions, and utterances that are both natural and unambiguous

Figure 2: Steve's deictic gestures

Greta, embodied in a 3D talking head (see Figure 3), shows a rich expressiveness during the natural conversation with user Greta manifests emotion by employing a Belief Network that links facial communicative functions and facial signals that are consist with the discourse content [19,20] The facial communicative functions that are typically used

in human-human dialogs; for instance: syntactic, dialogic, meta-cognitive, performative, deictic, adjectival and belief relation functions

Figure 3: Greta's facial display: neutral, sorry-for, and surprise

REA [6] is an embodied estate-seller that is able to describe feature of a house using a

Trang 20

combination of speech utterances and gesture Rea’s speech and gesture output is generated in real time from the knowledge base and the description of communicative goals, using SPUD (“Sentence Planning Using Description”) engine

2.5 Summary

We have reviewed some of the previous research works in the areas of architectural requirements, discourse models, and multimodality In the next chapters, we proceed to present our integrative approaches

Trang 21

CHAPTER 3 AGENT ACHITECTURE

In this chapter, we present a generic three-layered architecture which forms the backbone

of our ECA We then describe the construction of such architecture and the interaction among the system components

3.1 Overview

In our design, an embodied conversational agent interfaces with the virtual world via a perception module and an actuation module The perception module provides the agent with high-level sensory information The actuation module enables the ECA to walk and

to perform gestures and facial expressions The gap between perception and action is

bridged by mental processing in the Interpretation Module, a knowledge-based inference

Figure 4: Three-layered architecture for the agent’s mental processing

Utterance Analyzer

Response Retriever

KNOWLEDGE BASE

FOCUS

Dialog Coordinator

DELIBERATIVE

REACTIVE

ACTUATION MODULE

Q&I Generator

PERCEPTION

MODULE

Lexicon SpeechAct Context

Plan Library Agent Goal Agent State

Enricher

REFLEXIVE

INTERPRETATION MODULE

Trang 22

engine We model this module using a combination of reactive and deliberative mechanisms in order to cope with mixed-initiative situations The architecture (see Figure 4) comprises three layers: a reflexive layer (section 3.4.1), a reactive layer (section 3.4.2), and a deliberative layer (section 3.4.3)

3.2 Perception Module

The Perception Module endows an ECA with the ability to “see” and “hear” in the virtual world It supplies sensory information about self, system users (as avatars), environment, and the relationships among them For example, “user arrives at artifact2,” “user is facing artifact2,” “I am at artifact2,” “artifact2 is on my left hand side,” “user clicks on the artifact2,” and “User said <How do you do?> to me.”

When a state change in the world is detected, the Perception Module will abstract the information, and propagate this event to the reactive layer for appraisal Simultaneously, raw perceptual data, such as the coordinates of user’s current position, are fed to reflexive layer for quick processing

3.3 Actuation Module

The Actuation Module drives the character animation and generates sythesized speech The Interpretation Module produces behavior scripts which specify multimodal behaviors over a span of time, and sends them to the event queue of the Actuation Module The Actuation Module sequences and coordinates the multimodal signals in different channels (described in section 5.4) and plays them back If necessary, the Actuation Module will update the agent’s goal state and coversational state upon the completion of actuation

Trang 23

3.4 Interpretation Module

The Interpretation Module an inference engine that bridge the agent’s perception and actions It comprises three layers, i.e., a reflexive layer, a reactive layer, and a deliberative layer

3.4.1 Reflexive Layer

The reflexive layer implements a series of reflex behaviors that are quick and inflexible (Q&I) For example, the agent is able to track the user with head movement in response to the user’s subtle location change And the agent glances at artifact when the user performs

a click on the artifacts

3.4.3 Deliberative Layer

The deliberative layer provides planning and decision-making mechanisms The ECA in our environment is equipped with information-delivery goals and plans that accomplish the goals The Planner selects an appropriate plan from the plan library based on the agent goal During the interaction, Planner may adopt new plans according to the change of goal The Scheduler instantiates and executes the selected plan With its help, the agent is able

Trang 24

to advance the conversation at regular time intervals In chapter 5, we will go through the details

3.4.4 Behavior Generation

The utterances produced by the reactive layer and deliberative layer flow into a Dialog Coordinator where turn taking is regulated Finally, the Enricher generates appropriate nonverbal behaviors to accompany the speech before the behavior is actuated Details are described in chapter 6

3.5 Knowledge Base

The Knowledge Base (KB) at the heart of the system consists of schema nodes that encode the domain knowledge as condition-action pairs FOCUS refers to the schema node that constitutes the agent’s current focus of attention during the conversation FOCUS serves as a reference point for both reactive and deliberative processes, and it is updated every time a new behavior has been actuated The details will be covered in section 4.1

Trang 25

CHAPTER 4 AGENT’S VERBAL BEHAVIORS

This chapter introduces the notion of schema, a discourse template from which the narrative and dialog about a specific topic can be dynamically generated It then describes how structured and coherent verbal behaviors are supported under schema-base knowledge framework

4.1 Schema-based Discourse Framework

A schema defines a template from which narrative or dialog about a specific topic can be

dynamically generated Each schema is related to a discourse type and a topic For example, the schema named ELABORATE_BIODATA_OF_ ARTIST1 indicates that the topic on BIODATA_OF_ARTIST1 is elaborated using this schema A schema is modeled

as a network of utterances that contribute to its information-delivery goal When a schema, e.g., ELABORATE_BIODATA_OF_ARTIST1, is instantiated, the agent has to fulfill a goal ELABORATE (ARTIST, BIODATA) by producing a sequence of utterances from the template

A pictorial view of the discourse space is shown in Figure 5: a network of utterances (black dots) forms a schema (dark grey ovals) In turn, a few schemas are grouped into a domain entity (light grey oval), e.g., an artist, an artifact, or an art-related concept Transiting from one schema to another simulates shifting topic or referencing to another concept in a conversation

Trang 26

INTRO_PARTICULAR_OF_ARTIST1 ELABORATE_TECHNIQUE_OF_

SCULPTART

Figure 5: Schema-based discourse space (portion only)

An utterance is encapsulated as a schema node in the knowledge base Each schema node can have multiple conditions, an action, and several links A condition describes a pattern

of user input, in terms of a speech act [3], and a keyword list, which activates the node The action contains a list of interchangeable responses The link specifies three types of relationships between two adjacent nodes (1) A sequential link defines a pair of nodes in the order of narration (2) A dialog link connects the agent’s adjacent turns in a dialog The link contains a pattern which is expected from the user’s turn (3) A reference link bridges two schemas, which is analogous to a hyper link

We employ an XML tree structure to represent the schema nodes The definition is shown below:

<!— ELVA’s schema DTD >

<! ELEMENT schema (node+)>

<! ELEMENT node (condition+, action, link+)>

<!ELEMENT condition (speechAct?, keywordList?)+>

<!ELEMENT speechAct (#PCDATA)>

<!ELEMENT keywords (#PCDATA)>

<!ELEMENT action (utterance+)>

<!ELEMENT utterance (#PCDATA)>

ELABORATE_BIODA

TA_OF_ARTIST1

ELABORATE_GEN RE_OF_ARTIST1

DEFINE_CONCEPT_OF_

SCULTPART

Trang 27

<! ELEMENT link (#PCDATA)>

<!ATTLIST schema schema_id ID #REQUIRED>

<!ATTLIST schema goal CDATA #REQUIRED>

<!ATTLIST condition type CDATA (dependant|independent) “independent” >

<!ATTLIST node id ID #REQUIRED>

<!ATTLIST node schema_id IDREF #REQUIRED>

<!ATTLIST action affective_type CDATA #IMPLIED>

<!ATTLIST link relationship (sequential|dialog|reference) “sequential”>

<!ATTLIST link node_id IDREF #REQUIRED>

Using this definition, a schema node can be encoded as following:

</utterance>

</action>

Trang 28

usually organized in a linear form As depicted in Figure 4, each schema node encapsulates an utterance The nodes must be well ordered so as to generate a clear narrative about a specific topic

Here you looked at the sculpture called Love and Hate.

If you look carefully, you probably see two persons bound together.

It suggests love because they are very close together.

It also suggests hate because the two faces are pulled away from each other.

This sculpture deals with the

"love and hate" relationship.

Figure 6: Narrative modeling using a schema

4.2.2 Dialog Modeling

In the dialog mode, the agent and the user are engaged in the discussion of a topic Turns are frequently taken A sample schema for dialog modeling is shown in Figure 5 Clearly, the organization of schema nodes in a dialog is more dynamic and complicated than in narrative Some schema nodes branch out into a few dialog links, which describe the user’s possible responses

Trang 29

Can you guess what this sculpture looks like?

Figure 7: Dialog modeling using a schema

4.3 Reasoning

This section introduces the agent’s reasoning process We utilize the concept of based reasoning and pattern matching As a key area of improvement, the user’s query patterns are represented using the extracted meaning of the utterance, instead of simple keywords

No, hint

ok

Trang 30

z Speech Act Classification

Correct ill-formed input

Transform into stem list using PORTER

Classify into Speech Acts

Remove stop words

Replace Synonym Speech Act

Tokenize into linked word

list Resolve Reference

Raw Utterance S

Trang 31

The next crucial step is to determine users’ intention via speech act classification Inspired by Searle’s speech act theory [3], the system defines over thirty speech acts (see Table 1) to cover the user’s common queries and statements Some speech acts are domain independent, while the rest are related to our application, i.e., virtual tour guide in an art gallery

Illocution Matters

QUESTION WHY, WHEN, WHO, WHERE, QUANTITY,

CONSTRUCT, MATERIAL, COMMENT, EXAMPLE, TYPE, DESCRIPTION, COMPARISON, MORE REQUEST MOVE, TURN, POINT, EXIT, REPEAT

STATE FACT, REASON, EXAMPLE, PREFERENCE,

COMPARE, LOOKLIKE, POSITIVECOMMENT, NEGATIVECOMMENT

COMMUNICATE GREET, THANK, INTERRUPT, BYE,

ACKNOWLEDGE, YES, NO, REPAIR

Table 1: List of speech acts

A speech act is described by its illocutionary force and the object For example, QUESTION_MATERIAL relates to an inquiry about the material

<User> Can you tell me what is this sculpture made of?

<Elva> It is made of Ciment-fondu, an industrial material

STATE_POSITIVECOMMENT reveals a user’s positive comment

<User> I like this sculpture very much

<Elva> I am glad to hear that

Our early classification approach was based on the list of acceptable patterns for a speech act For example, “do you like *” is an acceptable pattern for

Trang 32

QUESTION_COMMENT However, this approach results in relatively low accuracy Consider the following two utterances:

A: <User> do you like this sculpture?

B: <User> do you like to go for supper?

Utterance B is classified wrongly

This problem was resolved by adding in the list of rejected patterns for each speech act In this case, “do you like to *” shall be added to the reject patterns for QUESTION_COMMENT

We also encountered the overlapping problem in speech act classification As it is impossible to enumerate all possible patterns for a speech act, the classification is not entirely clean-and-dry In some cases, one utterance can elicit more than one speech act For example, the following utterance can be classified into both QUESTION_CONSTRUCT and QUESTION_DESCRIBE

<User> Can you describe how to make ceramics?

Further investigation reveals that some speech acts, e.g., QUESTION_CONSTRUCT are, in fact, the specialized instances of another speech act (QUESTION_DESCRIBE) Therefore, their patterns may overlap To get around this problem, we assess the level of specialization for each speech Priority is given to a specialized speech act when several speech acts are co-present

Trang 33

We developed a speech act classifier based on a data modal called state transition graphs (STGs) as shown in Figure 7 An STG encapsulates the acceptable and rejected patterns for a speech act A path from start state to terminal state indicates

an acceptable or a reject pattern

START

I do tell what

me

Figure 9: State transition graph for QUESTION_MATERIAL

During classification, an utterance is validated against every STG defined for agent’s set of speech acts For each STG, the utterance performs a word-by-word traversal through the STG If the utterance runs into a state where there is no escape, or terminates in the REJECT state, this indicates that validation has failed Otherwise, the utterance will terminate in the ACCEPT state, which means the utterance matches with the corresponding speech act For example, for an utterance, “Do you know what this sculpture is made of,” there is a valid path

be

make NOT (to)

REJECT ACCEPT

Trang 34

“doÆyouÆknowÆwhatÆmakeÆof” in QUESTION_MATERIAL (see Figure 8)

Do you know what this sculpture is made of

Synonym replacement is a desired feature for embodied conversational agents It allows a single set of keywords to be scripted in the Knowledge Base, and a list of

Trang 35

its synonyms to be specified in a lexical database (e.g WordNet [24]) The feature will help reduce the number of similar patterns we have to script because of various synonyms of a certain word And it will allow us to adapt the agent to a new version of words easily

After keywords are extracted, they are combined with speech act to form the utterance meaning Utterance meaning is then escalated to the next level of processing

4.3.2 Retrieval Phase

In the retrieval phase, the Response Retriever (refer to Figure 4) searches for the most appropriate schema node that matches the user’s utterance meaning

We have developed a heuristic searching approach, named locality-prioritized search,

which is sensitive to the locality of the schema nodes The heuristic rules are well based

on the characteristics of schema-based discourse framework that we have proposed Using such a framework, the basic assumption is that conversation around a topic shall be modeled as a schema In other words, the schema shall capture the relevant and coherent information about the topic Given this assumption, the idea of locality-prioritized search

is to use locality as a clue about the information relevance, so as to perform effective search The step steps are described as the following:

Trang 36

1 The search starts from the successive nodes of FOCUS3 (refer to Figure 4) A

locality weight w 1 is assigned to a successive node which is connected via a dialog link If a node is connected to FOCUS via a sequential link, a relatively low

weight w 2 will be assigned

2 Scan the nodes within the current schema, assigning a weight w 3 that is lower than

w 2

3 If necessary, the searching space is expanded to the whole knowledge base and a

lowest weight w 4 is assigned

For each schema node, the similarity between its ith condition and the user’s utterance

meaning is measured using the sum of both the hits in their speech acts S i and the hits in

the keyword lists K i We then employ a matching function f to compute the matching score by multiplying the maximum similarity value with the locality weight w of the node

The node with the highest matching score is accepted if the score exceeds a predefined threshold

f = Max (α Si + β Ki) w

i:1 Æ n

The formulation of the matching function reveals our design rationale: first, speech acts have been integrated as part of the pattern so as to enhance the robustness of pattern matching; second, to avoid flattened retrieval, i.e treating every node equally, we have

3

Recap that FOCUS is the agent’s attentional node during the conversation, i.e., “what we were talking about in this topic”

Trang 37

introduced locality weight, which gives priority to those potentially more related schema nodes

4.4 Planning

The implemented planner relies on a repository of plans that are manually scripted by

content experts There are two levels of plans: an activity plan constitutes a series of activities to be carried out by the agent; a discourse plan lays out several topics to be

covered for an activity In our context, Elva’s activity plan outlines the skeleton of a tour

It includes: (1) welcome remarks; (2) an “appetizer”, i.e., a short and sweet introduction before the tour; (3) the sequence of “must-sees”, i.e., artifacts to be visited along the itinerary; (4) a “dessert”, i.e., summarization A sample plan is shown in Figure 9 An itinerary is planned at the level of domain entities (light grey ovals), whereas discourse planning is done at schema level (dark grey ovals) A path through the discourse space indicates the sequence of schemas, i.e the topics, to be instantiated

artifact

#8 artifact

#3

Figure 11: Elva's tour plan

welcome remarks WELCOME

INTRO_PARTICULAR_OF_MYSELF INTRO_THEME_OF_EXHIBITION

Trang 38

appetizer INTRO_BACKGROUND_OF_GALLERY

INTRO_PARTICULAR_OF_ARTIST BRIEF_PLAN_OF_TOUR

artifact#1 PROBE_METAPHOR_OF_ARTIFACT1

DESCRIBE_METAPHOR_OF_ARTIFACT1 DESCRIBE_MATERIAL_OF_ARTIFACT1 IDENTITY_ARTISTINTENTION_OF_ARTIFACT1 artifact#3 DESCRIBE_METAPHOR_OF_ARTIFACT3

DESCRIBE_TEXTURE_OF_ARTIFACT3 IDENTITY_ARTISTINTENTION_OF_ARTIFACT3

dessert SUMMARIZE_TOUR

BYE

Table 2: Layout of a sample tour plan

The plan library serves as a repository of activity plans and discourse plans Each plan specifies the goal it intends to achieve At run time, the Planner picks up a plan to fulfill the agent goal The goal can be modified by a user’s navigation or query For instance, when a user takes navigational initiative and stops at an artifact ARTIFACT1, the Planner will be notified about the new goal NARRATE (ARTIFACT1) A discourse plan can fulfill this goal by decomposing it into sub-goals Say a discourse plan, DESCRIBE_METAPHOR_OF_ARTIFACT1ÆDESCRIBE_TEXTURE_OF_ARTIFACT1ÆDESCRIBE_MATERIAL_OF_ARTIFACT1, is selected The goal NARRATE (ARTIFACT1) is then achieved through the accomplishment of the sub-goals DESCRIBE (ARTIFACT1, METAPHOR), DESCRIBE (ARTIFACT1, TEXTURE) and DESCRIBE (ARTIFACT1, MATERIAL)

In the present development, the Planner provides Elva with a means to generate a random tour plan, and localize the discourse content for individual sculptures when visited The choices of plans are limited to the available plans in the plan library To some extent, the

Trang 39

Planner functions in a deterministic manner While this seems adequate for the domain of the intended application, a museum where a guide plans a tour of “must-sees” in a pre-defined sequence, the future development of the planner should favor a more flexible way

to generate plans that can cater to a dynamic world situation For example, it is desirable

if a guide could customize a tour plan in accordance with the user’s preference

The Scheduler parses and executes a plan Before the execution of a plan, the agent’s attentional state, i.e., FOCUS, will be positioned at a starting node of the targeted schema The scheduling task is performed at regular intervals At each interval, the Scheduler has

to decide “what to say next” by selecting one of the successive nodes of FOCUS (this selected node will become the next FOCUS) During the conversation, FOCUS keeps advancing until the Scheduler establishes a dynamic sequence of utterances to span all the schemas in the plan

4.5 Dialog Coordination

The Dialog Coordinator (recall Figure 4) is responsible for turn taking management It keeps a conversation state diagram that contains possible states: NARRATE, DIALOG, MAKE_REFERENCE, SHIFT_TOPIC, and WITNESS Dialog Coordinator examines the inputs from the reactive layer, the deliberative layer, and the agent’s perceptual data to determine the next appropriate state For example, the agent transits the conversation state

to DIALOG if dialog links are frequently traversed It changes to WITNESS state when the user starts to manipulate an artifact or suddenly moves away The treatment of turn taking is designed carefully for each conversation state For example, in DIALOG state,

Trang 40

the agent temporarily suspends the execution of the scheduler to wait for a user turn In WITNESS state, the waiting period can be even longer

Another functionality of Dialog Coordinator is to manage conversation topics using a topic stack Considering when Elva makes a reference to a concept, the related topic (represented with schema ID) will be pushed on to the topic stack When reference is completed, the topic will be cleared from the top of the stack, so that the agent can proceed with its original topic

4.6 Summary

We utilized the notion of “schema” to support structured and coherent verbal behaviors Under a unified discourse framework, ECA is able to answer appropriately to the user enquiries, as well as to generate plans to fulfill its information delivery goals In addition,

we briefly described how turn taking was coordinated based on the state of conversation

Định dạng
Số trang	88
Dung lượng	1,35 MB