Situted dialog in speach based human computer interaction

Part I Dialog Management and Spoken Language Processing Evaluation of Statistical POMDP-Based Dialogue Systems in Noisy Environments.. 3Steve Young, Catherine Breslin, Milica Gašić, Matt

Trang 1

Free ebooks ==> www.Ebook777.com

Signals and Communication Technology

Human-Interaction

www.Ebook777.com

Trang 2

Signals and Communication Technology

www.Ebook777.com

Trang 3

More information about this series at http://www.springer.com/series/4748

Trang 4

Alexander Rudnicky • Antoine Raux

Trang 5

Editors

Alexander Rudnicky

School of Computer Science

Carnegie Mellon University

Moffett Field, CAUSA

Teruhisa MisuMountain View, CAUSA

Signals and Communication Technology

ISBN 978-3-319-21833-5 ISBN 978-3-319-21834-2 (eBook)

DOI 10.1007/978-3-319-21834-2

Library of Congress Control Number: 2015949507

Springer Cham Heidelberg New York Dordrecht London

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part

or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc in this

the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com)

www.Ebook777.com

Trang 6

Part I Dialog Management and Spoken Language Processing

Evaluation of Statistical POMDP-Based Dialogue Systems

in Noisy Environments 3Steve Young, Catherine Breslin, Milica Gašić, Matthew Henderson,

Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis

and Eli Tzirkel Hancock

Syntactic Filtering and Content-Based Retrieval of Twitter Sentences

for the Generation of System Utterances in Dialogue Systems 15Ryuichiro Higashinaka, Nozomi Kobayashi, Toru Hirano,

Chiaki Miyazaki, Toyomi Meguro, Toshiro Makino

and Yoshihiro Matsuo

Knowledge-Guided Interpretation and Generation

of Task-Oriented Dialogue 27Alfredo Gabaldon, Pat Langley, Ben Meadows and Ted Selker

Justiﬁcation and Transparency Explanations in Dialogue Systems

to Maintain Human-Computer Trust 41Florian Nothdurft and Wolfgang Minker

Dialogue Management for User-Centered Adaptive Dialogue 51Stefan Ultes, Hüseyin Dikme and Wolfgang Minker

Chat-Like Conversational System Based on Selection

of Reply Generating Module with Reinforcement Learning 63Tomohide Shibata, Yusuke Egashira and Sadao Kurohashi

Investigating Critical Speech Recognition Errors in Spoken

Short Messages 71Aasish Pappu, Teruhisa Misu and Rakesh Gupta

v

Trang 7

Part II Human Interaction with Dialog Systems

The HRI-CMU Corpus of Situated In-Car Interactions 85David Cohen, Akshay Chandrashekaran, Ian Lane

and Antoine Raux

Detecting‘Request Alternatives’ User Dialog Acts

from Dialog Context 97

Yi Ma and Eric Fosler-Lussier

Emotion and Its Triggers in Human Spoken Dialogue:

Recognition and Analysis 103Nurul Lubis, Sakriani Sakti, Graham Neubig, Tomoki Toda,

Ayu Purwarianti and Satoshi Nakamura

Evaluation of In-Car SDS Notiﬁcation Concepts for Incoming

Proactive Events 111Hansjörg Hofmann, Mario Hermanutz, Vanessa Tobisch,

Ute Ehrlich, André Berton and Wolfgang Minker

Construction and Analysis of a Persuasive Dialogue Corpus 125Takuya Hiraoka, Graham Neubig, Sakriani Sakti, Tomoki Toda

and Satoshi Nakamura

Evaluating Model that Predicts When People Will Speak

to a Humanoid Robot and Handling Variations of Individuals

and Instructions 139Takaaki Sugiyama, Kazunori Komatani and Satoshi Sato

Entrainment in Pedestrian Direction Giving: How Many

Kinds of Entrainment? 151Zhichao Hu, Gabrielle Halberg, Carolynn R Jimenez

and Marilyn A Walker

Situated Interaction in a Multilingual Spoken Information

Access Framework 165Niklas Laxström, Kristiina Jokinen and Graham Wilcock

Part III Speech Recognition and Core Technologies

A Turbo-Decoding Weighted Forward-Backward Algorithm

for Multimodal Speech Recognition 179Simon Receveur, David Scheler and Tim Fingscheidt

Engine-Independent ASR Error Management for Dialog Systems 193Junhwi Choi, Donghyeon Lee, Seounghan Ryu, Kyusong Lee,

Kyungduk Kim, Hyungjong Noh and Gary Geunbae Lee

Trang 8

Restoring Incorrectly Segmented Keywords and Turn-Taking

Caused by Short Pauses 205Kazunori Komatani, Naoki Hotta and Satoshi Sato

A Semi-automated Evaluation Metric for Dialogue

Model Coherence 217Sudeep Gandhe and David Traum

Trang 9

Part I

Dialog Management and Spoken

Language Processing

Trang 10

Evaluation of Statistical POMDP-Based

Dialogue Systems in Noisy Environments

Steve Young, Catherine Breslin, Milica Gaši´c, Matthew Henderson,

Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis

and Eli Tzirkel Hancock

Abstract Compared to conventional hand-crafted rule-based dialogue management

systems, statistical POMDP-based dialogue managers offer the promise of increasedrobustness, reduced development and maintenance costs, and scaleability to largeopen-domains As a consequence, there has been considerable research activity inapproaches to statistical spoken dialogue systems over recent years However, build-ing and deploying a real-time spoken dialogue system is expensive, and even whenoperational, it is hard to recruit sufficient users to get statistically significant results.Instead, researchers have tended to evaluate using user simulators or by reprocess-ing existing corpora, both of which are unconvincing predictors of actual real worldperformance This paper describes the deployment of a real-world restaurant infor-mation system and its evaluation in a motor car using subjects recruited locallyand by remote users recruited using Amazon Mechanical Turk The paper exploresthree key questions: are statistical dialogue systems more robust than conventionalhand-crafted systems; how does the performance of a system evaluated on a usersimulator compare to performance with real users; and can performance of a systemtested over the telephone network be used to predict performance in more hostileenvironments such as a motor car? The results show that the statistical approach

is indeed more robust, but results from a simulator significantly over-estimate formance both absolute and relative Finally, by matching WER rates, performanceresults obtained over the telephone can provide useful predictors of performance

per-in noisier environments such as the motor car, but agaper-in they tend to over-estimateperformance

S Young (B) · C Breslin · M Gaši´c · M Henderson · D Kim · M Szummer ·

A Rudnicky et al (eds.), Situated Dialog in Speech-Based

Human-Computer Interaction, Signals and Communication Technology,

DOI 10.1007/978-3-319-21834-2_1

3

www.Ebook777.com

Trang 11

4 S Young et al.

A spoken dialogue system (SDS) allows a user to access information and enacttransactions using voice as the primary input-output medium Unlike so-called voicesearch applications, the tasks undertaken by an SDS are typically too complex to beachieved by a single voice command Instead they require a conversation to be heldwith the user consisting of a number of dialogue turns Interpreting each user inputand deciding how to respond lies at the core of effective SDS design

In a traditional SDS as shown in Fig.1, the symbolic components of Fig.1 areimplemented using rules and flowcharts The semantic decoder uses rule-based sur-face parsing techniques to extract the most likely user dialogue act and estimate themost likely dialogue state The choice of system action in response is then deter-mined by if-then else rules applied to the dialogue state or by following a flowchart.These systems are tuned by trial deployment, inspection of performance and iterativerefinement of the rules They can work well in reasonably quiet operating environ-ments when the user knows exactly what to say at each turn However, they are notrobust to speech recognition errors or user confusions, they are expensive to produceand maintain, and they do not scale well as task complexity increases The latterwill be particularly significant as technology moves from limited to open domainsystems

To mitigate against the deficiencies of hand-crafted rule-based systems, statisticalapproaches to dialogue management have received considerable attention over recentyears [1 3] The statistical approach is based on the framework of partially observableMarkov decision processes (POMDPs) [4] As shown in Fig.2, in the statisticalapproach the dialogue manager is split into two components: a belief tracker which

maintains a distribution over all possible dialogue states b (s), and a policy which

takes decisions based not on the most likely state but on the whole distribution Thesemantic decoder is extended to output a distribution over all possible user dialogue

acts and the belief tracker updates its estimate of b every turn using this distribution as

evidence The policy is optimised by defining a reward function for each dialogue turnand then using reinforcement learning to maximise the total (possibly discounted)cumulative reward

Fig 1 Block diagram of a

conventional SDS Input

speech y is mapped first into

words w and then into a user

dialogue act v A dialogue

manager tracks the state of

the dialogue s and based on

this generates a system

action a which is converted

to a text message m and then

into speech x

Speech Recogniser

Semantic Decoder

Dialogue Manager

[ State s]

Natural Language Speech

Synthesiser User

a m

y

x

Trang 12

Evaluation of Statistical POMDP-Based … 5

Speech Recogniser

Semantic Decoder

Belief Tracker

Natural Language Generation

Speech Synthesiser User

p(w|y) p(v|y)

(a|b) p(m|a)

y

x

Policy b(s)

Fig 2 Block diagram of a statistical SDS The semantic decoder generates a distribution over

possible user dialogue acts v given user input x A dialogue manager tracks the probability of all possible dialogue states b (s) using p(v|y) as evidence This distribution b is called the belief state.

A policy maps b into a distribution over possible system actions a which is converted back into natural language and sampled to provide spoken response x

One of the difficulties that researchers face when developing an SDS is trainingand evaluation Statistical SDS often require a large number of dialogues (∼105–106)

to estimate the parameters of the models, and optimise the policy using reinforcementlearning As a consequence, user simulators are commonly used operating directly atthe dialogue act level [5 7] These simulators attempt to model real user behaviour.They also include an error model to simulate the effects of speech recognition andsemantic decoding errors [8,9] A user simulator also provides a convenient tool fortesting since it can be run many times and the error rate can be varied over a widerange to test robustness

The use of simulators obviates the need to build a real system, thereby avoiding all

of the engineering complexities involved in integrating telephony interfaces, voiceactivity detection, recognition and synthesis However, evaluation using the sameuser simulator as for training constitutes training and testing under perfectly matchedconditions and it is not clear how well this approach can predict system performancewith real users

Even when a real live spoken dialogue system is available for evaluation, thereremains the significant problem of recruiting and managing subjects through the tests

in sufficient numbers to obtain statistical significance For example, previous ence (eg [10]) has shown that direct testing in a motor car is a major undertaking Toprovide statistically significant results, a system contrast may require 500 dialogues

experi-or mexperi-ore Recruiting subjects and managing them through in-car tests is slow andexpensive Safety considerations prevent direct testing by the driver, hence testingcan only be done by a passenger sitting next to the driver with the microphone systemredirected accordingly Typically, we have found that a team of three assistants plus adriver can process around 6–8 subjects per day with each subject completing around12–20 dialogues Adding the time taken in preparation to recruit and timetable sub-jects, means that each contrast will typically take about 10 man-days of resource.For large scale development and testing, this is barely practicable

Trang 13

6 S Young et al.Provided that the system is accessible via telephone, one route to mitigatingthis problem is to use crowd-sourcing web sites such as Amazon Mechanical Turk(MTurk) [11] This allows subjects to be recruited in large numbers, and it also auto-mates the process of distributing task scenarios and checking whether the dialogueswere successful.

This paper describes an experimental study designed to explore these issues.The primary question addressed is whether or not a statistical SDS is more robustthan a conventional hand-crafted SDS in a motor car and this was answered by thetraditional route of recruiting subjects to perform tasks in a car whilst being drivenaround a busy town However, in parallel a phone-based system was configured inwhich the recogniser’s acoustic models were designed to give similar performance

to that anticipated in the motor car This parallel system was tested using MTurksubjects The results were also compared with those obtained using a user simulator.The remainder of this paper is organised as follows Section2 describes theBayesian Update of Dialogue State (BUDS) POMDP-based restaurant informationsystem used in the study and the conventional system used in the baseline Section3

then describes the experimental set-up in more detail and Sect.4reports the results.Finally, Sect.5offers conclusions

Both the conventional baseline and the statistical dialogue system share a commonarchitecture and a common set of understanding and generation components Therecogniser is a real-time implementation of the HTK system [12] The front-end usesPLP features with energy, 1st, 2nd and 3rd order derivatives mapped into 39 dimen-sions using a heteroscedastic linear discriminant analysis (HLDA) transform Theacoustic models use conventional HTK tied-state Gaussians and the trigram languagemodel was trained on previously collected dialogue transcriptions with attribute val-ues such as food types, place names, etc mapped to class names The semanticdecoder extracts n-grams from the confusion networks output by the recogniser anduses a bank of support vector machine (SVM) classifiers to construct a ranked list ofdialogue act hypotheses where each dialogue act consists of an act type and a set ofattribute value pairs [13,14] Some example dialogue acts are shown in Table1and

a full description is given in [15]

The statistical dialogue manager is derived from the BUDS system [16] In thissystem the belief state is represented by a dynamic Bayesian network in which thegoal, user input and history are factored into conditionally independent attributes (or

slots) where each slot represents a property of a database entity An example is shown

in Fig.3for the restaurant domain which shows slots for the type of food (French,Chinese, snacks, etc.), the price-range (cheap, moderate, expensive) and area (central,north, east, etc.) Each time step (i.e turn), the observation is instantiated with theoutput of the semantic decoder, and the marginal probabilities of all of the hiddenvariables (unshaded nodes) are updated using a form of belief propagation called

Trang 14

Table 1 Example dialogue acts

inform(area=centre) I want something in the centre of town

confirm(pricerange=cheap) And it is cheap isn’t it?

affirm(food=chinese) Yes, I want chinese food.

a

g area

u area

h area

g food

u food

h food

g price

u price

h price

u

o

Fig 3 Example BUDS Dynamic Bayesian Network Structure Shaded variables are observed, all

others are hidden Each slot is represented by 3 random variables corresponding to the users goal

(g), last user input (u) and history (h) The network shown represents just one time slice All variable

nodes are conditioned by the last action Goal and history nodes are also conditioned on previous time slice

expectation propagation [17] The complete set of marginal probabilities encoded in

the network constitute the belief state b.

The initial parameters of the Bayesian network are estimated from annotatedcorpus data Since expectation propagation can deal with continuous as well asdiscrete variables, it is also possible to extend the network to include the parameters

of the multinomial distributions along with their conjugate Dirichlet priors Thenetwork parameters can then be updated on-line during interaction with real usersalthough that was not done in this trial [18]

The belief state b can be viewed as a vector with dimensionality equal to the cardinality of the state space i.e b ∈ R |S| where|S| is equal to the total number

of discrete values distributed across all of the nodes in the network Since this islarge, it is compressed to form a set of features appropriate for each action,φ a (b) A

stochastic policy with parametersθ is then constructed using a softmax function:

π(a|b; θ) = e θ.φ a (b)

Trang 15

8 S Young et al.

which represents the probability of taking action a in belief state b At the end of

every turn, the probability of every possible action is sampled using (1), and the mostprobable action is selected

Since the policy defined by (1) is smoothly differentiable inθ, gradient ascent can

be used to adjust the parameter vectorθ to maximise the reward [19] This is done

by letting the dialogue system interact with a user simulator [20] Typically around

105training dialogues are required to fully train the policy

The baseline dialogue manager consists of a conventional state estimator which

maintains a record for each possible slot consisting of the slot status (filled or unfilled),

the slot value, and the confidence derived directly from the confidence of the mostlikely semantic decoder output Based on the current state of the slots a set of if-thenrules determine which of the possible actions to invoke at the end of each turn Thebaseline was developed and tested over a long period and was itself subject to severalrounds of iterative refinement using the same user simulator as was used to train thePOMDP system

The output of the dialogue manager in both systems is a system dialogue act lowing exactly the same schema as for the input These system acts are converted first

fol-to text using a template matching scheme, and then infol-to speech using a HTS-basedHMM synthesiser [21] A fully statistical method of text generation is also availablebut was not used in this trial to ensure consistency of output across systems [22]

As noted in the introduction, the aims of this evaluation were to firstly establishwhether or not a fully statistical dialogue system is more robust in a noisy environmentsuch as a motor car and to investigate the extent to which performance in a specificenvironment can be predicted by proxy environments which afford testing with higherthroughput and lower cost

The overall system architecture used for the in-car evaluation is shown in Fig.4.The same system was used for the phone-based MTurk evaluation except that usersspoke directly into the phone via a US Toll-free number, rather than via the On-StarMirror

Trang 16

On-Star Mirror

Android Phone Bluetooth

Speech Input/Output

Automobile

SIP Provider Asterisk

Virtual PBX

VOIP Server

Dialogue Server

Dialogue Server Restaurant Information Database

Dialogue Servers

(Cambridge University) Web API e.g toptable.com

Mobile Network VOIP

Fig 4 Block diagram of the overall system architecture used for the in-car evaluation The On-Star

mirror [ 23 ] includes a microphone and signal-processing for far-field voice capture in a motor car The speech input to the mirror is transported via Bluetooth to an Android phone and then over the mobile network to a commercial SIP server (IPComms) The signal is then channeled to an Asterisk virtual PABX in order to allow multiple channels to be supported The PBX routes the call through to an available VOIP server which interfaces directly to the Spoken Dialogue System.

At the backend, task related information (in this case restaurant information) is extracted from an on-line database and locally cached

had no solution in the database and in that case the participant was advised to askfor something else, e.g find an Italian restaurant instead of French Also sometimesthe user was asked to find more than one venue that matched the constraints

To perform the test, each participant was seated in the front passenger seat of asaloon car fitted with the On-Star mirror system and a supervisor sat in the rear seat

in order to instruct the subject and monitor the test The On-Star mirror was affixed

to the passenger seat visor to make it useable by the passenger rather than the driver.Power for this assembly was taken from the cars lighter socket A digital recorderwith an external microphone was used to provide a second recording

The subject received only limited instructions consisting of a brief explanation

of what the experiment involved and an example dialogue For each dialogue thesubject informed the supervisor if they thought the dialogue was successful Afterthe experiment the subjects were asked to fill in a questionnaire

Trang 17

10 S Young et al.

3.2 Proxy Phone-Based Evaluation

By providing a toll-free access number to the system shown in Fig.4, large numbers ofsubjects can be recruited quickly and cheaply using crowd sourcing services such asAmazon Mechanical Turk In order to simulate the effect of a noisy environment, thetechnique usually used for off-line speech recognition evaluation is to add randomlyaligned segments of pre-recorded background noise to the clean acoustic source.However, in the architecture shown in Fig.4, this is difficult to achieve for a variety

of reasons including ensuring that the user hears an appropriate noise level, avoidingdisrupting the voice/activity detection and compensating for the effects of the variousnon-linear signal processing stages buried in the user’s phone, the pabx and the voipconversion As an alternative, a simpler approach is to reduce the discrimination of theacoustic models in the recogniser so that the recognition performance over the phonewas similar to that achieved in the car This was achieved by reducing the number

of Gaussian mixture components to 1 and controlling the decision tree clusteringthresholds to fine tune the recogniser using development data from previous phoneand in-car evaluations

Given this change to the recogniser, the experimental protocol for the phone-basedevaluation was identical to that used in the car except that the presentation of the tasksand the elicitation of feedback was done automatically using a web-based interfaceintegrated with Amazon Mechanical Turk

The results of the evaluation are summarised in Table2 The in-car results refer to

the supervised tests in a real motor car travelling around the centre of Cambridge,

UK, and the phone proxy results refer to the phone-based evaluation with MTurk

subjects where the speech recogniser’s acoustic models were detuned to give similarperformance to that obtained in a motor car Also, shown in this table for comparisonare results for a regular phone-based MTurk evaluation using fully trained acousticmodels As can be seen, the average word error rate (WER) obtained in the car drivingaround town was around 30 % compared to the 20 % obtained over the telephone.The average WER for the proxy phone system is also around 30 % showing that thedetuned models performed as required

Three metrics are reported for each test Prior to each dialogue, each user wasgiven a task consisting of a set of constraints and an information need such as find thephone number and address of a cheap restaurant selling Chinese food The objectivesuccess rate measures the percentage of dialogues for which the system provided thesubject with a restaurant matching the task constraints If the system provided thecorrect restaurant and the required information needed such as phone number and

address, then this is full success If a valid restaurant was provided, but the user did

not obtain the required information (perhaps because they forgot to ask for it), then a

Trang 18

Table 2 Summary of results for in-car and proxy-phone evaluation

Test System Num

dialogs

Objective success rate Perceived

success rate

Average turns

WER Partial Full

In-car Baseline 118 78.8 ± 3.7* 67.8 ± 4.3* 77.1 ± 3.8* 7.9 ± 3.1 29.7 POMDP 120 85.0 ± 3.2 75.8 ± 3.9 83.3 ± 3.4 9.7 ± 3.7 26.9 Phone Baseline 387 80.1 ± 2.0* 75.2 ± 2.2* 91.2 ± 1.4 6.9 ± 3.6 29.4 Proxy POMDP 548 87.0 ± 1.4 81.2 ± 1.7 89.8 ± 1.3 9.3 ± 4.8 30.3 Phone Baseline 589 88.8 ± 1.3 84.6 ± 1.5 94.4 ± 1.0 6.5 ± 2.9 21.4 POMDP 578 91.0 ± 1.2 86.9 ± 1.4 94.5 ±1.0 8.3 ± 3.8 21.2 Also shown is performance of phone-based system using fully trained acoustic models Contrasts marked * are statistically significant(p < 0.05) using a Kruskal-Wallis rank sum test

partial success is recorded The users perceived success rate is measured by asking

the subjects if they thought the system had given them all of the information theyneed The partial success rate is always higher than the full success rate Note thatthe tasks vary in complexity, in some cases the constraints were not immediatelyachievable in which case the subjects were instructed to relax one of them and tryagain

As can be seen in Table2, the in-car performance of the statistical POMDP baseddialogue manager was better than the conventional baseline on all three measures.The proxy phone test showed the same trend for the objective measures but not for thesubjective measures In fact, there is little correlation between the subjective measuresand the objective measures in all the MTurk phone tests A possible explanation isthat the subjects in the in-car test were supervised throughout and were thereforemore likely to give accurate assessments of the system’s performance The Turksused in the phone tests were not supervised and many might have felt it was safest

to say they were satisfied just to make sure they were paid

The objective proxy phone performance overestimated the actual in-car mance by around 2 % on partial success and by around 10 % on full success Thismay be due to the fact that the subjects in the car found it harder to remember all

perfor-of the venue details they were required to find Nevertheless, the proxy phone testprovides a reasonable indicator of in-car performance

To gain more insight into the results, Fig.5shows regression plots of predictedfull objective success rate as a function of WER computed by pooling all of thetrial data As can be seen, the statistical dialogue system (POMDP-trial) consistentlyoutperforms the conventional baseline system (Baseline-trial) Figure5 also plotsthe success rate of both systems using the user simulator used to train the POMDPsystem (xxx-sim) It can be seen that the general trend is similar to the user trial databut the simulator success rates significantly overestimate performance, especiallyfor the statistical system This is probably due to a combination of two effects

Trang 19

12 S Young et al.

Fig 5 Comparison of system performance obtained using a user simulator compared to the actual

performance achieved in a trial

Firstly, the user simulator presents perfectly matched data to both systems.2Secondly,the simulation of errors will differ to the errors encountered in the real system Inparticular, the errors will be largely uncorrelated allowing the belief tracking to gainmaximum advantage When errors are correlated belief tracking is less accuratebecause it tends to over-estimate alternatives in the N-best list [24]

The widespread adoption of end-to-end statistical dialogue systems offers the tial to develop systems which are more robust to noise, and which can be automat-ically trained to adapt to new and extended domains [25] However, testing suchsystems is problematic requiring considerable resource not only to build and deployworking real-time implementations but also to run the large scale experiments needed

poten-to properly evaluate them

The results presented in this paper show that fully statistical systems are not onlyviable, they also outperform conventional systems especially in challenging envi-ronments The results also suggest that by matching word error rate, crowd sourcedphone-based testing can be a useful and economic surrogate for specific environ-ments such as the motor car This is in contrast to the use of user simulators acting

at the dialogue level which grossly exaggerate expected performance A corollary ofthis result is that using user simulators to train statistical dialogue systems is equallyundesirable, and this observation is supported by recent results which show that when

a statistical dialogue system is trained directly by real users, success rates furtherimprove relative to conventional systems [26]

2 As well as being used to train the POMDP-based system, the user simulator was used to tune the rules in the conventional hand-crafted system.

Trang 20

References

1 Roy N, Pineau J, Thrun S (2000) Spoken dialogue management using probabilistic reasoning In: Proceedings of ACL

2 Young S (2002) Talking to machines (statistically speaking) In: Proceedings of ICSLP

3 Williams J, Young S (2007) Partially observable markov decision processes for spoken dialog systems Comput Speech Lang 21(2):393–422

4 Young S, Gasic M, Thomson B, Williams J (2013) POMDP-based statistical spoken dialogue systems: a review Proc IEEE 101(5):1160–1179

5 Scheffler K, Young S (2000) Probabilistic simulation of human-machine dialogues In: ICASSP

6 Pietquin O, Dutoit T (2006) A probabilistic framework for dialog simulation and optimal strategy learning IEEE Trans Speech Audio Process, Spec Issue Data Min Speech, Audio Dialog 14(2):589–599

7 Schatzmann J, Weilhammer K, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies KER 21(2):97–126

8 Pietquin O, Renals S (2002) ASR system modelling for automatic evaluation and optimisation

of dialogue systems In: International Conference on Acoustics Speech and Signal Processing Florida

9 Thomson B, Henderson M, Gasic M, Tsiakoulis P, Young S (2012) N-Best error simulation for training spoken dialogue systems In: IEEE SLT 2012 Miami

10 Tsiakoulis P, Gaši´c M, Henderson M, Planells-Lerma J, Prombonas J, Thomson B, Yu K, Young S, Tzirkel E (2012) Statistical methods for building robust spoken dialogue systems in

an automobile In: Proceedings of the 4th applied human factors and ergonomics

11 Jurˇcíˇcek F, Keizer S, Gaši´c M, Mairesse F, Thomson B, Yu K, Young S (2011) Real user evaluation of spoken dialogue systems using amazon mechanical Turk In: Proceedings of interspeech

12 Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book version 3.4 Cambridge University, Cambridge

13 Mairesse F, Gaši´c M, Jurˇcíˇcek F, Keizer S, Thomson B, Yu K, Young S (2009) Spoken language understanding from unaligned data using discriminative classification models In: Proceedings

17 Minka T (2001) Expectation propagation for approximate bayesian inference In: Proceedings

of the 17th conference in uncertainty in artificial intelligence (Seattle) Morgan-Kaufmann, pp 362–369

18 Thomson B, Jurcicek F, Gasic M, Keizer S, Mairesse F, Yu K, Young S (2010) Parameter ing for POMDP spoken dialogue models In: IEEE workshop on spoken language technology (SLT 2010) Berkeley

learn-19 Jurcicek F, Thomson B, Young S (2011) Natural actor and belief critic: reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs ACM Trans Speech Lang Process 7(3)

20 Schatzmann J, Thomson B, Weilhammer K, Ye H, Young S (2007) Agenda-Based user lation for bootstrapping a POMDP dialogue system In: Proceedings of HLT

simu-21 Yu K, Young S (2011) Continuous F0 modelling for HMM based statistical parametric speech synthesis IEEE Audio, Speech Lang Process 19(5):1071–1079

22 Mairesse F, Gaši´c M, Jurˇcíˇcek F, Keizer S, Thomson B, Yu K, Young S (2010) Phrase-based statistical language generation using graphical models and active learning In: Proceedings of ACL

www.Ebook777.com

Trang 21

14 S Young et al.

23 OnStar (2013) OnStar FMV mirror http://www.onstarconnections.com/

24 Williams J (2012) A critical analysis of two statistical spoken dialog systems in public use In: Spoken language technology workshop (SLT) Miami

25 Gasic M, Breslin C, Henderson M, Kim D, Szummer M, Thomson B, Tsiakoulis P, Young S (2013) POMDP-based dialogue manager adaptation to extended domains In: SigDial 13 Metz

26 Gasic M, Breslin C, Henderson M, Kim D, Szummer M, Thomson B, Tsiakoulis P, Young S (2013) On-line policy optimisation of bayesian spoken dialogue systems via human interaction In: ICASSP 2013 Vancouver

Trang 22

Syntactic Filtering and Content-Based

Retrieval of Twitter Sentences

for the Generation of System Utterances

in Dialogue Systems

Ryuichiro Higashinaka, Nozomi Kobayashi, Toru Hirano,

Chiaki Miyazaki, Toyomi Meguro, Toshiro Makino and Yoshihiro Matsuo

Abstract Sentences extracted from Twitter have been seen as a valuable resource

for response generation in dialogue systems However, selecting appropriate ones

is difficult due to their noise This paper proposes tackling such noise by syntacticfiltering and content-based retrieval Syntactic filtering ascertains the valid sentencestructure as system utterances, and content-based retrieval ascertains that the contenthas the relevant information related to user utterances Experimental results showthat our proposed method can appropriately select high-quality Twitter sentences,significantly outperforming the baseline

In addition to performing tasks [19], dialogue systems should be able to performopen-domain conversation or chat in order for them to look affective and to buildsocial relationships with users [2] Chat capability also leverages the usability of

R Higashinaka (B) · N Kobayashi · T Hirano · C Miyazaki · T Makino · Y Matsuo

NTT Media Intelligence Laboratories, Kanagawa, Japan

DOI 10.1007/978-3-319-21834-2_2

15

Trang 23

16 R Higashinaka et al.task-oriented dialogue systems because real users do not necessarily utter only task-related (in-domain) utterances but also chatty utterances [17]; such utterances, if nothandled correctly, can cause misunderstandings.

One challenge facing an open-domain conversational system is the wide variety

of topics in user utterances Conventional methods have used hand-crafted rules, butthe coverage of topics is usually very limited [20] To increase the coverage, recentstudies have exploited the web, typically Twitter, to extract and use sentences forresponse generation [1,15] However, due to the nature of the web, such sentencesare likely to be negatively affected by noise

Heuristic rules have been proposed by Inaba et al [10] to filter inappropriateTwitter sentences, but since their filtering is performed on the word level, their fil-tering capability is very limited To overcome this limitation, this paper proposessyntactic filtering and content-based retrieval of Twitter sentences; syntactic filteringascertains the validity of sentence structures and content-based retrieval ascertainsthat the extracted sentences contain information relevant to user utterances

In what follows, Sect.2 covers related work Section3 explains our proposedmethod in detail Section4 describes the experiment we performed to verify ourmethod Section5summarizes the paper and mentions future work

Conventional approaches to open-domain conversation have heavily depended onhand-crafted rules The early systems such as ELIZA [21] and PARRY [3] usedheuristic rules derived from psycho-linguistic theories Recent systems at the Loebnerprize (a chat system competition) typically use tens of thousands of hand-crafted rules[20] Although such rules enable high-quality responses to expected user utterances,they fail to respond appropriately to unexpected ones In such cases, systems tend

to utter innocuous (fall-back) utterances or change topic, which often lowers usersatisfaction

To overcome this problem, recent studies have used the web for response ation For example, Shibata et al and Yoshino et al used sentences in web-searchresults for response generation [15,22] To make utterances more colloquial and suit-able for conversation, instead of web-search results, Twitter has become the recenttarget for sentence extraction [1] Although extracting sentences from the web candeal with a wide variety of topics in user utterances, due to the web’s diversity, theextracted sentences are likely to contain noise

gener-To suppress this noise, Inaba et al proposed word-based filtering of Twitter tences [10] Their rules filter sentences that contain context-dependent words such

sen-as referring/temporal expressions They also score sentences by using the weights

of words calculated from a reference corpus and remove those with low scores Ourmotivation is similar to Inaba et al.’s in that we want to extract sentences from Twitterthat are appropriate as system utterances, but our work is different in that, in addition

Trang 24

Syntactic Filtering and Content-Based Retrieval … 17

to word-level filters, we also take into account the syntax and the content of Twittersentences for more accurate sentence extraction

Although not within the scope of this paper, there are emerging approaches tobuilding knowledge bases for chat systems by using web resources Higuchi et al.mined the web for associative words (mainly adjectives) to fill in their generationtemplates [8], and Sugiyama et al created a database of dependency structures fromTwitter to find words for their templates [16] Statistical machine translation tech-niques have also been utilized to obtain transformation rules (as a phrase table) frominput to output utterances [14] Although we find it important to create good knowl-edge bases from the web for generation, since it is still in a preliminary phase andthe reported quality of generated utterances is rather low, we currently focus on theselection of sentences

In this paper, we assume that the input to our method is what we refer to as a

topic word A topic word (represented by noun phrases in this paper) represents the

current topic (focus) in dialogue and can be obtained from a user utterance or fromthe dialogue context We do not focus on the extraction of topic words in this paper;note that finding appropriate topic words themselves is a difficult problem, requiringthe understanding of the context

Under this assumption, our task is to retrieve appropriate sentences from ter given a topic word Our method comprises four steps: preprocess, word-basedfiltering, syntactic filtering, and content-based retrieval Note that, in this paper, weassume the language used is Japanese

Twit-3.1 Preprocess

As a preprocess, input tweets are first stripped of Twitter-dependent expressions(e.g., retweeted content and user names with mention markers) Then, the tweetsare split into sentences by sentence-ending punctuation marks After that, sentencesthat are too short (less than five characters) or too long (more than 30 characters)are removed because they may not be appropriate as colloquial utterances We alsoremove sentences that contain no Japanese characters

3.2 Word-Based Filtering

The sentences that pass the preprocess are processed by a morphological lyzer The sentences together with their analysis results are sent to the word-basedfilters There are three filters:

Trang 25

ana-18 R Higashinaka et al.(1) Sentence Fragment Filter If the sentence starts with sentence-end particles,punctuation marks, or case markers (Japanese case markers do not appear at thebeginning of a sentence), it is removed If the sentence ends with a conjunc-tive form of verbs/adjectives (meaning that the sentence is not complete), it isremoved This filter is intended to remove sentence fragments caused mainly bysentence splitting errors.

(2) Reference Filter If the sentence contains pronouns, deixes, or referring sions such as ‘it’ and ‘that’, it is removed If the sentence has words related tocomparisons (such as more/than) or an anteroposterior relation (such as follow-ing/next), it is also removed If the sentence has words representing reason orcause, it is removed If the sentence contains relation-related words, such asfamily members (mother, brother, etc.), it is also removed Such sentences need

expres-to be removed because entities and events being referred expres-to may not be present

in the sentence or differ depending on the speaker

(3) Time Filter If the sentence contains time-related words, such as dates andrelative dates, it is removed If the sentence has verbal suffixes representing pasttenses (such as ‘mashita’ and ‘deshita’), it is also removed Such sentences areassociated with certain time points and therefore may not be used independently

of the context

The filters here are similar to those used by Inaba et al [10] with some extensions,such as the use of tense and relation-related words The filters are applied to inputsentences in a cascading manner If a sentence passes all the filters, it is sent tosyntactic filtering

3.3 Syntactic Filtering

The sentences are checked with regard to their syntactic structures This process isintended to ascertain if the sentence is structurally valid as an independent utterance;that is, the sentence is grammatical and has necessary arguments for predicates Forexample, “watashi wa iku (I go)” does not have a destination for the predicate “go”,making it an non-understandable utterance on its own

However, such checks are actually difficult to perform This is because Twittersentences are mostly in colloquial Japanese with many omissions of particles and casemarkers, making it hard to use the rigid grammar of written Japanese for validation

In addition, missing arguments do not necessarily mean an invalid structure becauseJapanese contains many zero-predicate and zero-pronoun structures For example,

“eiga ni ikitai (want to go to the movies)” does not have a subject for a predicate,but since the sentence is in the desiderative mood, we can assume that the subject is

“watashi (I)” and the sentence is thus understandable The checks need to take intoaccount the types of predicates as well as mood, aspect, and voice, making it difficult

to enumerate by hand all the conditions when a sentence can be valid Therefore,

to automatically find conditions when a sentence is valid, we turn to a machine

Trang 26

Fig 1 A word dependency tree for “Ichiro wa eiga ni iku (Ichiro goes to the movies)” The nodes

of base forms and end forms are omitted from illustration because they are exactly the same as word surfaces in this example

learning based approach and use a binary classifier that has been trained from data

to determine whether a sentence is valid or invalid on the basis of its structure Notethat the aim of this filtering is NOT to guarantee the “syntactic well-formedness” ofsentences since responses need not be syntactically well-formed in “chit-chat” typeinteractions; here we simply want to remove sentences that are considered invalidfrom their structures Below shows how we created the classifier

3.3.1 Machine Learning Based Classifier

To create the classifier, we first collected Twitter sentences and labeled them as valid(i.e., positive examples) and invalid (i.e., negative examples) Then, we converted thesentences into word dependency trees by using a dependency analyzer in a mannersimilar to Higashinaka and Isozaki [7] The trees have part-of-speech tags as mainnodes with word surfaces, base forms, and end forms as their daughters (see Fig.1

for an example) Finally, the trees of negative and positive examples were input

to BACT [11], a boosting based algorithm for classifying trees, to train a binaryclassifier BACT enumerates subtrees in the input data and uses the existence ofthe subtrees as features for boosting-based classification Since subtrees are used asfeatures, syntactic structures are taken into account for classification

For creating the training data, we sampled 164 words as topic words from ourdialogue corpus [13] Then, for each topic word, we retrieved up to 100 Twittersentences by using a text search engine that has an index similar to (d) in Table1with

a content-based retrieval method we describe later (see Sect.3.4) For the retrievedsentences, an annotator, who is not the author, labeled validity scores on a five-pointLikert scale where 1 indicates completely invalid and 5 completely valid We treatedsentences scored 1 and 2 as negative examples and those scored 4 and 5 as positiveexamples We did not use sentences scored 3 In total, we created 3880 positive and

1304 negative examples By using these data, a classifier was learned by BACT.The evaluation was done by using a twofold cross validation, with each fold havingexamples regarding 82 topic words Figure2shows the recall-precision curves for the

Trang 27

20 R Higashinaka et al.

Table 1 Statistics of our Twitter data

Number Retained ratio

(c) Number of sentences retained by word-based filtering 103,655,452 11.9 %

(e) Number of unique sentences retained by the syntactic

filtering

7,907,888 0.9 % Retained ratio is the ratio of retained sentences over (b)

Fig 2 Recall-precision

curves for N-gram based and

syntactic filtering The graph

shows the result for one of

the folds in twofold cross

validation The other fold has

the same tendency

0.7 0.75 0.8 0.85 0.9 0.95 1

trained syntactic classifier (Syntax) with a comparison to an N-gram based baseline(N-gram) Here, the baseline uses word and part-of-speech N-grams (unigrams to5-grams) as features with logistic regression as a training algorithm [4] The curvesshow that our trained syntactic filter classifies sentences with good precision It

is also visible that the syntactic filter consistently outperforms the baseline As arequirement for a filter, low false acceptance is desirable By a statistical test (a sign-test that compares the number of times the syntactic filter outperforms the N-grambased filter and vise-versa), we confirmed that the syntactic filter has significantlylower false acceptance than the baseline (p < 0.001), verifying the usefulness of

syntactic information

3.3.2 Filtering by the Trained Classifier to Create an Index

On the basis of the evaluation result, we decided to use the syntactic classifier (trainedwith all the examples) to filter input sentences The sentences that pass this filter areindexed by a text search engine (we use Lucene, see Sect.4.1) that allows for efficientsearching

Trang 28

3.4 Content-Based Retrieval

Content-based retrieval can retrieve sentences that contain information related to an

input topic word For this, we use a dictionary of related words Related words are

the words strongly associated with a topic word We collect such words from theweb and use them to expand the search query so that the retrieved sentences containsuch words

The idea here is inspired by the work of Louis and Newman [12] that uses relatedwords for tweet retrieval, but our work is different in that we allow arbitrary words

as input (not just an named-entity type) and use a high-quality dictionary of relatedwords by strict lexico-syntactic patterns, not just a simple word collocation

3.4.1 Creating a Dictionary of Related Words

We use lexico-syntactic patterns to extract related words Lexico-syntactic patternshave been successfully used to obtain related words such as hyponyms [6] andattributes [18]

For a given word W, we collect noun phrases (NP), adjectives (ADJ), and verbs(V) as related words For noun phrases, we use a lexico-syntactic pattern similar tothat used by Tokunaga [18] and collect attributes of W More specifically, we use thepattern “W no NP (ga|wo|ni|de|kara|yori|e|made)”, corresponding to

“NP of W Verb” in English We collect attributes because they form part of a topicword and therefore are likely to be closely related For adjectives, we use the pattern

“W (ga|wa) ADJ”, corresponding to “W is ADJ” in English This pattern retrievesadjectival properties of W For verbs, we use “W (ga|wo|ni|de) V” where Wappears in the important argument positions (nominative, accusative, dative, andlocative positions) of V

By using the weblogs of 180M articles we crawled, we used the above patterns

to extract the related words for all noun phrases in the data Then, we distilled theresults by filtering words that do not correlate well with the entry word (i.e., W)

We used the log likelihood ratio test (G-test) to determine whether a related wordappears significantly more than chance We retained only the related words that havethe G value of over 10.83 (i.e., p < 0.001) Finally, the retained words comprise

our related word dictionary The dictionary contains about 2.2M entries To give abrief example, an entry of “Ramen” (a type of noodle dish) includes noodles, soup,restaurant as NP, delicious, tasty, longing as ADJ, and eat, order, sip, for V

3.4.2 Retrieval Method

Given a topic word T, we search for top-N sentences from the index Here, we score

a sentence S by the following formula:

Trang 29

relatively lowers the rank of sentences that contain irrelevant content.

an attempt to make our syntactic filter more sensitive to false acceptance, we used0.005 as a cut-off threshold (default 0.00) We created two indices from the data: onecreated with (d) and the other with (e) The aim of this is to compare the effectiveness

of the syntactic filter in the experiment we describe later We call the former the whole

index and the latter the filtered index We used Lucene, which is an open source text

search engine, to create the indices

4.2 Experimental Procedure

We made four systems for comparison: one is the baseline that only uses word-basedfiltering, and the others are variations of the proposed method The systems are asfollows:

Baseline The whole index is used for sentence retrieval In ranking the sentences,

a vector space model using TF-IDF weighted word vectors is used This is thedefault search mechanism in Lucene This is the condition where there is nosyntactic filter or content-based retrieval

Trang 30

amazon, Minatomirai, Iraq, Cocos, Smart-phone, Disney Sea, news, Hashed Beef, Hello Work, FamilyMart, Fuji Television, horror, Pocari Sweat, Mister Donut, mosquito, weather, Kinkakuji temple, accident, Hatsushima, Shinsengumi, fortune- telling, region, local area, Tokyo Bay, pan, Yatsugatake, damage, Kitasenju, Meguro, baseball club, courage

Fig 3 Topic words used for the experiment The words were originally in Japanese and translated

to evaluate only the top-1 sentence; dialogue systems usually continue on the sametopic for a certain number of turns, making it necessary for the systems to be able

to create multiple sentences for a given topic In addition, it is common practice inchat systems that sentences be randomly chosen from a pool of sentences for makingvariation in utterances We believe evaluating randomly selected utterances from top-ranked retrieved sentences is appropriate in terms of actual system deployment Bythis procedure, we created 93 utterances for each system, for a total of 372 utterances

We had two judges, who are not any of the authors, subjectively evaluate thequality of the generated utterances (shown with topic words and presented in arandomized order) in terms of (i) understandability (if the utterance is understandable

as a response to a topic word) and (ii) continuity (if the utterance makes you willing

to continue the conversation on the topic) on a five-point Likert scale, where 1 isthe worst and 5 the best We use averaged understandability and continuity scores toevaluate the systems In addition to these metrics, we also use a metric that we call (iii)non-understanding rate, which is the rate of lowly rated utterances (scores 1 and 2) inthe understandability score over the number of total utterances Since even a singlenon-understandable utterance can lead to a sudden breakdown in conversation, weconsider this figure to be an important indicator of robustness to keep the conversation

on track Each utterance was judged independently

Trang 31

Table 2 Averaged understandability scores, continuity scores, and non-understanding rates

at all When we look at the non-understanding rates, we find that Syntax+Contentachieves a very low figure of 6 %, suggesting that in most cases the utterances do notlead to a sudden breakdown of dialogue

Within the utterances that Syntax+Content created, only one utterance scored 1for understandability:

(1) aiteru-yoo kitasenju-ni ii yakinikuya-*kara ikou-zee

open-SEP Kitasenju-at good BBQ-restaurant-from go-SEP

‘It’s open Why don’t we go *from the good BBQ restaurant at Kitasenju’Here, SEP denotes a sentence-end particle and an asterisk means ungrammatical.This sentence contains two sentences without any punctuation mark in between, andthe first sentence has a missing argument and the second sentence has an incorrectpredicate-argument structure The trained syntactic classifier probably failed to detect

it as invalid because such a complex combination of errors was not seen in the trainingdata An increase in the training data could solve the problem

This paper proposed syntactic filtering and content-based retrieval of Twitter tences so that the retrieved sentences can be safely used for response generation

sen-in dialogue systems Experimental results showed that our proposed method can

Trang 32

Syntactic Filtering and Content-Based Retrieval … 25appropriately select high-quality Twitter sentences, significantly outperforming theword-based baseline Our contribution lies in discovering the usefulness of syntac-tic information in filtering Twitter sentences and in validating the effectiveness ofrelated words in retrieving sentences For future work, we plan to investigate how

to extract topic words from the context and also to create a workable conversationalsystem with speech recognition and speech synthesis

Acknowledgments We thank Prof Kohji Dohsaka of Akita Prefectural University for his helpful

advice on statistical tests We also thank Tomoko Izumi for her suggestions on how to write linguistic examples.

Asian Lang Inf Process 7(2)

8 Higuchi S, Rzepka R, Araki K (2008) A casual conversation system using modality and word associations retrieved from the web In: Proceedings of the EMNLP, pp 382–390

9 Imamura K, Kikui G, Yasuda N (2007) Japanese dependency parsing using sequential labeling for semi-spoken language In: Proceedings of the ACL, pp 225–228

10 Inaba M, Kamizono S, Takahashi K (2013) Utterance generation for non-task-oriented dialogue systems using Twitter In: Proceedings of the 27th annual conference of the japanese society for artificial intelligence 1K4-OS-17b-4 (in Japanese)

11 Kudo T, Matsumoto Y (2004) A boosting algorithm for classification of semi-structured text In: Proceedings of the EMNLP, pp 301–308

12 Louis A, Newman T (2012) Summarization of business-related tweets: A concept-based approach In: Proceedings of the COLING 2012 (Posters), pp 765–774

13 Meguro T, Higashinaka R, Minami Y, Dohsaka K (2010) Controlling listening-oriented logue using partially observable Markov decision processes In: Proceedings of the COLING,

Trang 33

17 Takeuchi S, Cincarek T, Kawanami H, Saruwatari H, Shikano K (2007) Construction and mization of a question and answer database for a real-environment speech-oriented guidance system In: Proceedings of the Oriental COCOSDA

opti-18 Tokunaga K, Kazama J, Torisawa K (2005) Automatic discovery of attribute words from web documents In: Proceedings of the IJCNLP, pp 106–118

19 Walker MA, Passonneau R, Boland JE (2001) Quantitative and qualitative evaluation of DARPA communicator spoken dialogue systems In: Proceedings of the ACL, pp 515–522

20 Wallace RS (2004) The anatomy of A.L.I.C.E A.L.I.C.E artificial intelligence foundation, Inc

21 Weizenbaum J (1966) ELIZA-a computer program for the study of natural language nication between man and machine Commun ACM 9(1):36–45

commu-22 Yoshino K, Mori S, Kawahara T (2011) Spoken dialogue system based on information tion using similarity of predicate argument structures In: Proceedings of the SIGDIAL, pp 59–66

Trang 34

extrac-Knowledge-Guided Interpretation

and Generation of Task-Oriented Dialogue

Alfredo Gabaldon, Pat Langley, Ben Meadows and Ted Selker

Abstract In this paper, we present an architecture for task-oriented dialogue that

integrates the processes of interpretation and generation We analyze implementedsystems based on this architecture—one for meeting support and another for assistingmilitary medics—and discuss results obtained with the first In closing, we reviewsome related dialogue architectures and outline plans for future research

Systems that use natural language to assist a user in carrying out some task mustinteract with that user as execution of the task progresses The system in turn mustinterpret the user’s utterances and other environmental input to build a model of what

both it and the user believe and intend—in regard to each other and the environment.

The system also requires knowledge to use the model it constructs to participate in

a dialogue with the user and support him in achieving his goals

A Gabaldon (B) · P Langley · T Selker

Silicon Valley Campus, Carnegie Mellon University, Moffett Field, CA 94035, USA

e-mail: alfredo.gabaldon@ge.com

A Gabaldon

Present Address: GE Global Research, 1 Research Circle, Niskayuna, NY 12309, USA

P Langley

Present Address: Department of Computer Science, University of Auckland,

Auckland 1142, New Zealand

DOI 10.1007/978-3-319-21834-2_3

27

Trang 35

incor-relies on abstract meta-level knowledge that generalizes across different domains In

particular, we are interested in high-level aspects of dialogue: knowledge and gies relevant to dialogue processing that are independent of the actual content ofthe conversation The architecture separates domain-level from meta-level content,using both during interpretation and generation The work we report is informed

strate-by cognitive systems research, a key feature of which is arguably integration andprocessing of knowledge at different levels of abstraction [7]

Another feature of our architecture is the incremental nature of its processes Weassume that dialogues occur within a changing environment and that the tasks to

be accomplished are not predetermined but discerned as the dialogue proceeds Ourarchitecture incrementally expands its understanding of the situation and the user’sgoals, acts according to this understanding, and adapts to changes in the situation,sometimes choosing to pursue different goals In other words, the architecture sup-

ports situated systems that carry out goal-directed dialogues to aid their users In

the next section we discuss two implemented prototypes that demonstrate this keyfunctionality We follow this with a detailed description in Sect.3of the underlyingarchitecture and a discussion of results in Sect.4 We conclude with comments onrelated work and plans for future research

In this section, we discuss two prototypes that incorporate our architecture as theirdialogue engine The first system facilitates cyber-physical meetings by interactingwith humans and equipment; the second is an advisory system that collaborates with

a military medic to address the mutual goal of treating a patient In each case, wediscuss the setting, the knowledge that drives behavior, and a sample dialogue thatillustrates operation Neither system interacts with users in spoken language, buteach could be adapted to this end in a straightforward manner

www.Ebook777.com

Trang 36

Knowledge-Guided Interpretation … 29

2.1 A System for Meeting Support

We have developed a system that supports ‘cyber-physical’ meetings in which anumber of human users interact with each other and with a number of devices topursue some joint activity The specific scenario we examine here involves a sessionbetween a physical therapist, a patient, and a relative of the patient who have theshared aim of completing a physical therapy session During the meeting, the userscarry out a dialogue that includes the system as a participant The users and systemalso interact with distributed sensors that measure leg motion, heart rate, and ambientlight and temperature, as well as distributed actuators that include a ‘smart’ televisionand a text-to-speech generator

During a physical therapy session, users communicate by entering sentencesthrough a menu-based smart-phone interface, although we intend to replace this with

a spoken-language interface in future work The current interface accepts sentences

in a controlled version of English, translates them into a logic-like representation,and sends the result to the dialogue system for processing At the same time, thesensing devices send messages to the dialogue system via HTTP Each user can seethe utterances of other users and those of the system via the phone interface.The system has knowledge about activities that arise during a therapy session,including physical activities the patient can carry out, such as repeatedly lifting aleg, and therapist procedures, such as obtaining relevant patient information (from adatabase or the patient), measuring range of movement, and tracking heart rate andblood pressure In a typical session, the system asks the patient standard questionsand the therapist asks additional ones Then the therapist decides on the task for

Table 1 Sample dialogue

involving a physical therapist

(Doctor), a patient (John), and

the meeting support system

Doctor John, please do leg lifts.

Patient Okay I will do leg lifts.

TV [starts playing tutorial video about leg lifting] Sensor [sends movement signal to system]

System Great start John!

Sensor [sends movement signal to system]

[after patient does a few more leg lifts …] System You are doing great, John!

[after a few more, the patient stops …]

Patient I can’t do any more leg lifts.

Doctor It’s okay, John That was good.

System John, you did eight repetitions.

Patient How many did I do in the previous session? System In the previous session you did five repetitions.

Trang 37

30 A Gabaldon et al.the session and suggests it to the patient, relative, and system If the patient orfamily member accepts the proposed task, the system updates its model of the currentsituation and proceeds accordingly The system supports the patient’s execution of atask by tracking his progress, instructing the television to show a tutorial video, andproviding feedback For instance, once sensor input reveals the patient has starteddoing an exercise, it might encourage him by saying “Great start!”

Specific components of the meeting support system include a menu-based face on a smart phone to input English sentences, a phone application that serves

inter-as a motion detector, a television for displaying tutorials and other support videos,

a heart-rate monitor, environmental sensors for temperature and lighting, an HTTPclient/server module for component communication, and the dialogue system Table1

shows a sample dialogue for one of the physical therapy scenarios In this case, thepatient John participates in a session in which he partially complete a leg exerciseunder supervision of a therapist at a remote location We will return to this case study

in Sect.4, where we examine it in more detail

2.2 A Medic Assistant

Our second prototype involves scenarios in which a military medic on the battlefieldhelps an injured teammate Because the medic has limited training, he interacts withthe dialogue system to get advice on treating the person; the system plays the role of amentor with medical expertise The medic and system collaborate towards achievingthe shared goal of stabilizing the patient’s medical condition The system does notknow the specific task in advance Only after the conversation starts, and the medicprovides relevant information, does the system act on this content and respond inways that are appropriate to achieving the goal The system does not effect change

on the environment directly; the medic provides both sensors and effectors, with thesystem influencing him by giving instructions

During an interaction, the system asks an initial sequence of questions that leadthe medic to provide details about the nature of the injury This sequence is not pre-determined, in that later questions are influenced by the medic’s responses to earlierones Table2 shows a sample dialogue in which the medic-system team attempts

to stabilize a person with a bleeding injury The system possesses domain edge about how to treat different types of injuries, taking into account their location,severity, and other characteristics The program can also adapt the treatment accord-ing to the medic’s situation For instance, it may try a different treatment for a wound

knowl-if the medic claims that he cannot apply a particular treatment because he lacks thesupplies necessary for that purpose

This system uses a Web interface similar to a text-messaging application, althoughagain we plan to replace this with a spoken dialogue module in the future The medictypes English sentences into a form element within the interface, which it sends tothe dialogue system via an HTTP request The system in turn sends the content to anatural language processor that translates into a logical form our system can interpret

Trang 38

Knowledge-Guided Interpretation … 31

We have used Skyphrase (http://www.skyphrase.com), a proprietary, driven language interpeter, for this purpose Skyphrase sends back the translation,again via HTTP, and the dialogue system uses the information to update its model

semantically-of the current situation This drives the system’s continuing behavior in pursuing thejoint task Lack of space constrains our discussion of the system here, but we haveprovided further details elsewhere [6]

2.3 Discussion

Although the two systems we have just described are limited in many respects,developing them raised a number of new challenges that have not been fully addressed

by existing dialogue systems, at least in combination:

• As the dialogue and the users’ situation evolves, the system must update its account

of the situation by interpreting new information as it arrives, including dialogueutterances and sensor inputs

• The system’s interaction with users is goal directed and involves joint activityover time; this activity includes not only actions carried out by the users, but alsocommunicative actions and commands to device actuators

• The system must establish a common ground [4] with its users, that is, it mustdevelop a shared model of the environmental situation and others’ mental states; itmust also update this joint model as circumstances change and information flowsamong participants

Table 2 Sample dialogue

between a human medic and a

more knowledgeable

computer advisor

Medic We have a man injured!

Advisor Okay What type of injury?

Medic He’s bleeding.

Advisor How bad is the bleeding?

Medic Pretty bad I think it is an artery.

Advisor Okay Where is the injury?

Medic It’s on the left leg.

Advisor Apply pressure on the leg’s pressure point Medic Roger that.

Advisor Has the bleeding stopped?

Medic No He’s still bleeding.

Advisor Okay Apply a tourniquet.

Medic Where do i put the tourniquet?

Advisor Just below the joint above the wound Medic Okay The bleeding has stopped.

Trang 39

32 A Gabaldon et al.

• The beliefs and goals of each participant are not stated explicitly, but the systemmust infer enough of them to be effective; this involves using not only domain-specific knowledge, but also more abstract knowledge that relates mental states tocommunication events

• The overall process is highly dynamic, as the system continuously draws inferencesfrom users’ utterances and other input to expand its understanding of the evolvingsituation, and as it carries out activities to achieve goals as they arise

Our application systems and architecture represent first steps towards addressingthese challenges In the next section we describe the integrated architecture, an imple-mentation of which serves as the main component of the two systems above

Now we can turn to our framework for task-oriented dialogue We have focused onsupporting goal-directed behavior that is physically situated in dynamic contexts Thearchitecture depends on a knowledge base that lets it generate inferences, introduce

goals, and execute actions Input is multi-modal in that it might come from speech,

text, visual cues, or external sensors We have implemented the architecture in Prolog,making use of its support for embedded structures and pattern matching, but itsrepresentation and control mechanisms diverge substantially from the default Prologinference engine, as we will see shortly

3.1 Representation and Content

As in research on cognitive architectures [9], we distinguish between a dynamic

short-term or working memory, which stores external inputs and inferences based upon this information, and a more stable long-term memory, which serves as a store

of knowledge that is used to make inferences and organize activities

Working memory is a rapidly changing set of ground literals that contains the tem’s beliefs and goals as it models the evolving situation Literals for domain-levelcontent, which do not appear as top-level elements in working memory, are stored

sys-as relational triples, sys-as in [i1, type, injury] or [i1, severity, major] This reification

lets the system examine and refer separately to different aspects of a single complexconcept, including its predicate

Our representation also incorporates meta-level predicates, divorced entirely fromthe domain level, to denote speech acts [1,13] The literature contains many alter-native taxonomies for speech acts; we have adopted a reduced set of six types thathas been sufficient for our current purposes These include:

All domain-level and meta-level concepts in working memory are embedded within

one of two predicates that denote aspects of mental states: belief(A, C) or goal(A,

Trang 40

inform (S, L, C): speaker S asks L to believe content C;

acknowledge (S, L, C): S tells L it has received and now believes content C;

question (S, L, C): S asks L a question C;

propose (S, L, C): S asks L to adopt goal C;

accept (S, L, C): S tells L it has adopted goal C;

reject (S, L, C): S tells L it has rejected goal C.

C) for some agent A and content C, as in belief(medic, [i1, type, injury]) A mental

state’s content may be a triple, [i, r, x], a belief or goal term (nested mental states),

an agent’s belief that some attribute has a value, as in belief_wh(A, [i, r]), a belief about whether some propositional content is true, as in belief_if(A, C), or a meta-level

literal, such as the description of a speech act

Long-term memory contains generic knowledge in the form of rules Each ruleencodes a situation or activity by associating a set of triples in its head with a pattern ofconcepts in its body High-level predicates are defined by decomposition into otherstructures, imposing an organization similar to that in hierarchical task networks[11] Structures in long-term memory include conceptual knowledge, skills, andgoal-generating rules

Conceptual knowledge comprises a set of rules which describe classes of situations

that can arise relative to a single agent’s beliefs or goals These typically occur at thedomain level and involve relations among states of the world Conceptual rules definecomplex categories in terms of simpler ones and organize these relatonal predicatesinto taxonomies

Skills encode the activities that agents can execute to achieve their goals Each skill

describes the effects of some action or high-level activity under specified conditions.The body of a skill include a set of preconditions, a set of effects, and a set ofinvariants, along with a sequence of subtasks that are either executable actions, inthe case of primitive skills, or other skills, in the case of nonprimitive skills

Goal-generating rules specify domain-level knowledge about the circumstances

under which an agent should establish new goals For example, an agent might have

a rule stating that, when a teammate is injured, it should adopt a goal for him to bestabilized These are similar to conceptual rules, but they support the generation ofgoals rather than inference of beliefs

The architecture also includes more abstract, domain-independent knowledge

at the meta-level This typically involves skills, but it can also specify conceptualrelations (e.g., about transitivity) The most important structures of this type are

speech act rules that explain dialogue actions to patterns of agents’ beliefs and goals without making reference to domain-level concepts However, the content of a speech

act is instantiated as in any other concept For example, the rule for an inform act is:

inform (S, L, C) ← belief (S, C),

goal (S, belief (L, C)), belief (S, belief (L, C)).

Here S refers to the speaker, L to the listener, and C to the content of the speech act.

Rules for other speech acts take a similar abstract form

www.Ebook777.com

Định dạng
Số trang	224
Dung lượng	7,3 MB