Part I Dialog Management and Spoken Language Processing Evaluation of Statistical POMDP-Based Dialogue Systems in Noisy Environments.. 3Steve Young, Catherine Breslin, Milica Gašić, Matt
Trang 1Free ebooks ==> www.Ebook777.com
Signals and Communication Technology
Human-Interaction
www.Ebook777.com
Trang 2Free ebooks ==> www.Ebook777.com
Signals and Communication Technology
www.Ebook777.com
Trang 3More information about this series at http://www.springer.com/series/4748
Trang 4Alexander Rudnicky • Antoine Raux
Trang 5Free ebooks ==> www.Ebook777.com
Editors
Alexander Rudnicky
School of Computer Science
Carnegie Mellon University
Moffett Field, CAUSA
Teruhisa MisuMountain View, CAUSA
Signals and Communication Technology
ISBN 978-3-319-21833-5 ISBN 978-3-319-21834-2 (eBook)
DOI 10.1007/978-3-319-21834-2
Library of Congress Control Number: 2015949507
Springer Cham Heidelberg New York Dordrecht London
© Springer International Publishing Switzerland 2016
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc in this
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
Printed on acid-free paper
Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com)
www.Ebook777.com
Trang 6Part I Dialog Management and Spoken Language Processing
Evaluation of Statistical POMDP-Based Dialogue Systems
in Noisy Environments 3Steve Young, Catherine Breslin, Milica Gašić, Matthew Henderson,
Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis
and Eli Tzirkel Hancock
Syntactic Filtering and Content-Based Retrieval of Twitter Sentences
for the Generation of System Utterances in Dialogue Systems 15Ryuichiro Higashinaka, Nozomi Kobayashi, Toru Hirano,
Chiaki Miyazaki, Toyomi Meguro, Toshiro Makino
and Yoshihiro Matsuo
Knowledge-Guided Interpretation and Generation
of Task-Oriented Dialogue 27Alfredo Gabaldon, Pat Langley, Ben Meadows and Ted Selker
Justification and Transparency Explanations in Dialogue Systems
to Maintain Human-Computer Trust 41Florian Nothdurft and Wolfgang Minker
Dialogue Management for User-Centered Adaptive Dialogue 51Stefan Ultes, Hüseyin Dikme and Wolfgang Minker
Chat-Like Conversational System Based on Selection
of Reply Generating Module with Reinforcement Learning 63Tomohide Shibata, Yusuke Egashira and Sadao Kurohashi
Investigating Critical Speech Recognition Errors in Spoken
Short Messages 71Aasish Pappu, Teruhisa Misu and Rakesh Gupta
v
Trang 7Part II Human Interaction with Dialog Systems
The HRI-CMU Corpus of Situated In-Car Interactions 85David Cohen, Akshay Chandrashekaran, Ian Lane
and Antoine Raux
Detecting‘Request Alternatives’ User Dialog Acts
from Dialog Context 97
Yi Ma and Eric Fosler-Lussier
Emotion and Its Triggers in Human Spoken Dialogue:
Recognition and Analysis 103Nurul Lubis, Sakriani Sakti, Graham Neubig, Tomoki Toda,
Ayu Purwarianti and Satoshi Nakamura
Evaluation of In-Car SDS Notification Concepts for Incoming
Proactive Events 111Hansjörg Hofmann, Mario Hermanutz, Vanessa Tobisch,
Ute Ehrlich, André Berton and Wolfgang Minker
Construction and Analysis of a Persuasive Dialogue Corpus 125Takuya Hiraoka, Graham Neubig, Sakriani Sakti, Tomoki Toda
and Satoshi Nakamura
Evaluating Model that Predicts When People Will Speak
to a Humanoid Robot and Handling Variations of Individuals
and Instructions 139Takaaki Sugiyama, Kazunori Komatani and Satoshi Sato
Entrainment in Pedestrian Direction Giving: How Many
Kinds of Entrainment? 151Zhichao Hu, Gabrielle Halberg, Carolynn R Jimenez
and Marilyn A Walker
Situated Interaction in a Multilingual Spoken Information
Access Framework 165Niklas Laxström, Kristiina Jokinen and Graham Wilcock
Part III Speech Recognition and Core Technologies
A Turbo-Decoding Weighted Forward-Backward Algorithm
for Multimodal Speech Recognition 179Simon Receveur, David Scheler and Tim Fingscheidt
Engine-Independent ASR Error Management for Dialog Systems 193Junhwi Choi, Donghyeon Lee, Seounghan Ryu, Kyusong Lee,
Kyungduk Kim, Hyungjong Noh and Gary Geunbae Lee
Trang 8Restoring Incorrectly Segmented Keywords and Turn-Taking
Caused by Short Pauses 205Kazunori Komatani, Naoki Hotta and Satoshi Sato
A Semi-automated Evaluation Metric for Dialogue
Model Coherence 217Sudeep Gandhe and David Traum
Trang 9Part I
Dialog Management and Spoken
Language Processing
Trang 10Free ebooks ==> www.Ebook777.com
Evaluation of Statistical POMDP-Based
Dialogue Systems in Noisy Environments
Steve Young, Catherine Breslin, Milica Gaši´c, Matthew Henderson,
Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis
and Eli Tzirkel Hancock
Abstract Compared to conventional hand-crafted rule-based dialogue management
systems, statistical POMDP-based dialogue managers offer the promise of increasedrobustness, reduced development and maintenance costs, and scaleability to largeopen-domains As a consequence, there has been considerable research activity inapproaches to statistical spoken dialogue systems over recent years However, build-ing and deploying a real-time spoken dialogue system is expensive, and even whenoperational, it is hard to recruit sufficient users to get statistically significant results.Instead, researchers have tended to evaluate using user simulators or by reprocess-ing existing corpora, both of which are unconvincing predictors of actual real worldperformance This paper describes the deployment of a real-world restaurant infor-mation system and its evaluation in a motor car using subjects recruited locallyand by remote users recruited using Amazon Mechanical Turk The paper exploresthree key questions: are statistical dialogue systems more robust than conventionalhand-crafted systems; how does the performance of a system evaluated on a usersimulator compare to performance with real users; and can performance of a systemtested over the telephone network be used to predict performance in more hostileenvironments such as a motor car? The results show that the statistical approach
is indeed more robust, but results from a simulator significantly over-estimate formance both absolute and relative Finally, by matching WER rates, performanceresults obtained over the telephone can provide useful predictors of performance
per-in noisier environments such as the motor car, but agaper-in they tend to over-estimateperformance
S Young (B) · C Breslin · M Gaši´c · M Henderson · D Kim · M Szummer ·
© Springer International Publishing Switzerland 2016
A Rudnicky et al (eds.), Situated Dialog in Speech-Based
Human-Computer Interaction, Signals and Communication Technology,
DOI 10.1007/978-3-319-21834-2_1
3
www.Ebook777.com
Trang 114 S Young et al.
A spoken dialogue system (SDS) allows a user to access information and enacttransactions using voice as the primary input-output medium Unlike so-called voicesearch applications, the tasks undertaken by an SDS are typically too complex to beachieved by a single voice command Instead they require a conversation to be heldwith the user consisting of a number of dialogue turns Interpreting each user inputand deciding how to respond lies at the core of effective SDS design
In a traditional SDS as shown in Fig.1, the symbolic components of Fig.1 areimplemented using rules and flowcharts The semantic decoder uses rule-based sur-face parsing techniques to extract the most likely user dialogue act and estimate themost likely dialogue state The choice of system action in response is then deter-mined by if-then else rules applied to the dialogue state or by following a flowchart.These systems are tuned by trial deployment, inspection of performance and iterativerefinement of the rules They can work well in reasonably quiet operating environ-ments when the user knows exactly what to say at each turn However, they are notrobust to speech recognition errors or user confusions, they are expensive to produceand maintain, and they do not scale well as task complexity increases The latterwill be particularly significant as technology moves from limited to open domainsystems
To mitigate against the deficiencies of hand-crafted rule-based systems, statisticalapproaches to dialogue management have received considerable attention over recentyears [1 3] The statistical approach is based on the framework of partially observableMarkov decision processes (POMDPs) [4] As shown in Fig.2, in the statisticalapproach the dialogue manager is split into two components: a belief tracker which
maintains a distribution over all possible dialogue states b (s), and a policy which
takes decisions based not on the most likely state but on the whole distribution Thesemantic decoder is extended to output a distribution over all possible user dialogue
acts and the belief tracker updates its estimate of b every turn using this distribution as
evidence The policy is optimised by defining a reward function for each dialogue turnand then using reinforcement learning to maximise the total (possibly discounted)cumulative reward
Fig 1 Block diagram of a
conventional SDS Input
speech y is mapped first into
words w and then into a user
dialogue act v A dialogue
manager tracks the state of
the dialogue s and based on
this generates a system
action a which is converted
to a text message m and then
into speech x
Speech Recogniser
Semantic Decoder
Dialogue Manager
[ State s]
Natural Language Speech
Synthesiser User
a m
y
x
Trang 12Evaluation of Statistical POMDP-Based … 5
Speech Recogniser
Semantic Decoder
Belief Tracker
Natural Language Generation
Speech Synthesiser User
p(w|y) p(v|y)
(a|b) p(m|a)
y
x
Policy b(s)
Fig 2 Block diagram of a statistical SDS The semantic decoder generates a distribution over
possible user dialogue acts v given user input x A dialogue manager tracks the probability of all possible dialogue states b (s) using p(v|y) as evidence This distribution b is called the belief state.
A policy maps b into a distribution over possible system actions a which is converted back into natural language and sampled to provide spoken response x
One of the difficulties that researchers face when developing an SDS is trainingand evaluation Statistical SDS often require a large number of dialogues (∼105–106)
to estimate the parameters of the models, and optimise the policy using reinforcementlearning As a consequence, user simulators are commonly used operating directly atthe dialogue act level [5 7] These simulators attempt to model real user behaviour.They also include an error model to simulate the effects of speech recognition andsemantic decoding errors [8,9] A user simulator also provides a convenient tool fortesting since it can be run many times and the error rate can be varied over a widerange to test robustness
The use of simulators obviates the need to build a real system, thereby avoiding all
of the engineering complexities involved in integrating telephony interfaces, voiceactivity detection, recognition and synthesis However, evaluation using the sameuser simulator as for training constitutes training and testing under perfectly matchedconditions and it is not clear how well this approach can predict system performancewith real users
Even when a real live spoken dialogue system is available for evaluation, thereremains the significant problem of recruiting and managing subjects through the tests
in sufficient numbers to obtain statistical significance For example, previous ence (eg [10]) has shown that direct testing in a motor car is a major undertaking Toprovide statistically significant results, a system contrast may require 500 dialogues
experi-or mexperi-ore Recruiting subjects and managing them through in-car tests is slow andexpensive Safety considerations prevent direct testing by the driver, hence testingcan only be done by a passenger sitting next to the driver with the microphone systemredirected accordingly Typically, we have found that a team of three assistants plus adriver can process around 6–8 subjects per day with each subject completing around12–20 dialogues Adding the time taken in preparation to recruit and timetable sub-jects, means that each contrast will typically take about 10 man-days of resource.For large scale development and testing, this is barely practicable
Trang 136 S Young et al.Provided that the system is accessible via telephone, one route to mitigatingthis problem is to use crowd-sourcing web sites such as Amazon Mechanical Turk(MTurk) [11] This allows subjects to be recruited in large numbers, and it also auto-mates the process of distributing task scenarios and checking whether the dialogueswere successful.
This paper describes an experimental study designed to explore these issues.The primary question addressed is whether or not a statistical SDS is more robustthan a conventional hand-crafted SDS in a motor car and this was answered by thetraditional route of recruiting subjects to perform tasks in a car whilst being drivenaround a busy town However, in parallel a phone-based system was configured inwhich the recogniser’s acoustic models were designed to give similar performance
to that anticipated in the motor car This parallel system was tested using MTurksubjects The results were also compared with those obtained using a user simulator.The remainder of this paper is organised as follows Section2 describes theBayesian Update of Dialogue State (BUDS) POMDP-based restaurant informationsystem used in the study and the conventional system used in the baseline Section3
then describes the experimental set-up in more detail and Sect.4reports the results.Finally, Sect.5offers conclusions
Both the conventional baseline and the statistical dialogue system share a commonarchitecture and a common set of understanding and generation components Therecogniser is a real-time implementation of the HTK system [12] The front-end usesPLP features with energy, 1st, 2nd and 3rd order derivatives mapped into 39 dimen-sions using a heteroscedastic linear discriminant analysis (HLDA) transform Theacoustic models use conventional HTK tied-state Gaussians and the trigram languagemodel was trained on previously collected dialogue transcriptions with attribute val-ues such as food types, place names, etc mapped to class names The semanticdecoder extracts n-grams from the confusion networks output by the recogniser anduses a bank of support vector machine (SVM) classifiers to construct a ranked list ofdialogue act hypotheses where each dialogue act consists of an act type and a set ofattribute value pairs [13,14] Some example dialogue acts are shown in Table1and
a full description is given in [15]
The statistical dialogue manager is derived from the BUDS system [16] In thissystem the belief state is represented by a dynamic Bayesian network in which thegoal, user input and history are factored into conditionally independent attributes (or
slots) where each slot represents a property of a database entity An example is shown
in Fig.3for the restaurant domain which shows slots for the type of food (French,Chinese, snacks, etc.), the price-range (cheap, moderate, expensive) and area (central,north, east, etc.) Each time step (i.e turn), the observation is instantiated with theoutput of the semantic decoder, and the marginal probabilities of all of the hiddenvariables (unshaded nodes) are updated using a form of belief propagation called
Trang 14Evaluation of Statistical POMDP-Based … 7
Table 1 Example dialogue acts
inform(area=centre) I want something in the centre of town
confirm(pricerange=cheap) And it is cheap isn’t it?
affirm(food=chinese) Yes, I want chinese food.
a
g area
u area
h area
g food
u food
h food
g price
u price
h price
u
o
Fig 3 Example BUDS Dynamic Bayesian Network Structure Shaded variables are observed, all
others are hidden Each slot is represented by 3 random variables corresponding to the users goal
(g), last user input (u) and history (h) The network shown represents just one time slice All variable
nodes are conditioned by the last action Goal and history nodes are also conditioned on previous time slice
expectation propagation [17] The complete set of marginal probabilities encoded in
the network constitute the belief state b.
The initial parameters of the Bayesian network are estimated from annotatedcorpus data Since expectation propagation can deal with continuous as well asdiscrete variables, it is also possible to extend the network to include the parameters
of the multinomial distributions along with their conjugate Dirichlet priors Thenetwork parameters can then be updated on-line during interaction with real usersalthough that was not done in this trial [18]
The belief state b can be viewed as a vector with dimensionality equal to the cardinality of the state space i.e b ∈ R |S| where|S| is equal to the total number
of discrete values distributed across all of the nodes in the network Since this islarge, it is compressed to form a set of features appropriate for each action,φ a (b) A
stochastic policy with parametersθ is then constructed using a softmax function:
π(a|b; θ) = e θ.φ a (b)
Trang 158 S Young et al.
which represents the probability of taking action a in belief state b At the end of
every turn, the probability of every possible action is sampled using (1), and the mostprobable action is selected
Since the policy defined by (1) is smoothly differentiable inθ, gradient ascent can
be used to adjust the parameter vectorθ to maximise the reward [19] This is done
by letting the dialogue system interact with a user simulator [20] Typically around
105training dialogues are required to fully train the policy
The baseline dialogue manager consists of a conventional state estimator which
maintains a record for each possible slot consisting of the slot status (filled or unfilled),
the slot value, and the confidence derived directly from the confidence of the mostlikely semantic decoder output Based on the current state of the slots a set of if-thenrules determine which of the possible actions to invoke at the end of each turn Thebaseline was developed and tested over a long period and was itself subject to severalrounds of iterative refinement using the same user simulator as was used to train thePOMDP system
The output of the dialogue manager in both systems is a system dialogue act lowing exactly the same schema as for the input These system acts are converted first
fol-to text using a template matching scheme, and then infol-to speech using a HTS-basedHMM synthesiser [21] A fully statistical method of text generation is also availablebut was not used in this trial to ensure consistency of output across systems [22]
As noted in the introduction, the aims of this evaluation were to firstly establishwhether or not a fully statistical dialogue system is more robust in a noisy environmentsuch as a motor car and to investigate the extent to which performance in a specificenvironment can be predicted by proxy environments which afford testing with higherthroughput and lower cost
The overall system architecture used for the in-car evaluation is shown in Fig.4.The same system was used for the phone-based MTurk evaluation except that usersspoke directly into the phone via a US Toll-free number, rather than via the On-StarMirror
Trang 16Evaluation of Statistical POMDP-Based … 9
On-Star Mirror
Android Phone Bluetooth
Speech Input/Output
Automobile
SIP Provider Asterisk
Virtual PBX
VOIP Server
VOIP Server
Dialogue Server
Dialogue Server Restaurant Information Database
Dialogue Servers
(Cambridge University) Web API e.g toptable.com
Mobile Network VOIP
Fig 4 Block diagram of the overall system architecture used for the in-car evaluation The On-Star
mirror [ 23 ] includes a microphone and signal-processing for far-field voice capture in a motor car The speech input to the mirror is transported via Bluetooth to an Android phone and then over the mobile network to a commercial SIP server (IPComms) The signal is then channeled to an Asterisk virtual PABX in order to allow multiple channels to be supported The PBX routes the call through to an available VOIP server which interfaces directly to the Spoken Dialogue System.
At the backend, task related information (in this case restaurant information) is extracted from an on-line database and locally cached
had no solution in the database and in that case the participant was advised to askfor something else, e.g find an Italian restaurant instead of French Also sometimesthe user was asked to find more than one venue that matched the constraints
To perform the test, each participant was seated in the front passenger seat of asaloon car fitted with the On-Star mirror system and a supervisor sat in the rear seat
in order to instruct the subject and monitor the test The On-Star mirror was affixed
to the passenger seat visor to make it useable by the passenger rather than the driver.Power for this assembly was taken from the cars lighter socket A digital recorderwith an external microphone was used to provide a second recording
The subject received only limited instructions consisting of a brief explanation
of what the experiment involved and an example dialogue For each dialogue thesubject informed the supervisor if they thought the dialogue was successful Afterthe experiment the subjects were asked to fill in a questionnaire
Trang 1710 S Young et al.
3.2 Proxy Phone-Based Evaluation
By providing a toll-free access number to the system shown in Fig.4, large numbers ofsubjects can be recruited quickly and cheaply using crowd sourcing services such asAmazon Mechanical Turk In order to simulate the effect of a noisy environment, thetechnique usually used for off-line speech recognition evaluation is to add randomlyaligned segments of pre-recorded background noise to the clean acoustic source.However, in the architecture shown in Fig.4, this is difficult to achieve for a variety
of reasons including ensuring that the user hears an appropriate noise level, avoidingdisrupting the voice/activity detection and compensating for the effects of the variousnon-linear signal processing stages buried in the user’s phone, the pabx and the voipconversion As an alternative, a simpler approach is to reduce the discrimination of theacoustic models in the recogniser so that the recognition performance over the phonewas similar to that achieved in the car This was achieved by reducing the number
of Gaussian mixture components to 1 and controlling the decision tree clusteringthresholds to fine tune the recogniser using development data from previous phoneand in-car evaluations
Given this change to the recogniser, the experimental protocol for the phone-basedevaluation was identical to that used in the car except that the presentation of the tasksand the elicitation of feedback was done automatically using a web-based interfaceintegrated with Amazon Mechanical Turk
The results of the evaluation are summarised in Table2 The in-car results refer to
the supervised tests in a real motor car travelling around the centre of Cambridge,
UK, and the phone proxy results refer to the phone-based evaluation with MTurk
subjects where the speech recogniser’s acoustic models were detuned to give similarperformance to that obtained in a motor car Also, shown in this table for comparisonare results for a regular phone-based MTurk evaluation using fully trained acousticmodels As can be seen, the average word error rate (WER) obtained in the car drivingaround town was around 30 % compared to the 20 % obtained over the telephone.The average WER for the proxy phone system is also around 30 % showing that thedetuned models performed as required
Three metrics are reported for each test Prior to each dialogue, each user wasgiven a task consisting of a set of constraints and an information need such as find thephone number and address of a cheap restaurant selling Chinese food The objectivesuccess rate measures the percentage of dialogues for which the system provided thesubject with a restaurant matching the task constraints If the system provided thecorrect restaurant and the required information needed such as phone number and
address, then this is full success If a valid restaurant was provided, but the user did
not obtain the required information (perhaps because they forgot to ask for it), then a
Trang 18Evaluation of Statistical POMDP-Based … 11
Table 2 Summary of results for in-car and proxy-phone evaluation
Test System Num
dialogs
Objective success rate Perceived
success rate
Average turns
WER Partial Full
In-car Baseline 118 78.8 ± 3.7* 67.8 ± 4.3* 77.1 ± 3.8* 7.9 ± 3.1 29.7 POMDP 120 85.0 ± 3.2 75.8 ± 3.9 83.3 ± 3.4 9.7 ± 3.7 26.9 Phone Baseline 387 80.1 ± 2.0* 75.2 ± 2.2* 91.2 ± 1.4 6.9 ± 3.6 29.4 Proxy POMDP 548 87.0 ± 1.4 81.2 ± 1.7 89.8 ± 1.3 9.3 ± 4.8 30.3 Phone Baseline 589 88.8 ± 1.3 84.6 ± 1.5 94.4 ± 1.0 6.5 ± 2.9 21.4 POMDP 578 91.0 ± 1.2 86.9 ± 1.4 94.5 ±1.0 8.3 ± 3.8 21.2 Also shown is performance of phone-based system using fully trained acoustic models Contrasts marked * are statistically significant(p < 0.05) using a Kruskal-Wallis rank sum test
partial success is recorded The users perceived success rate is measured by asking
the subjects if they thought the system had given them all of the information theyneed The partial success rate is always higher than the full success rate Note thatthe tasks vary in complexity, in some cases the constraints were not immediatelyachievable in which case the subjects were instructed to relax one of them and tryagain
As can be seen in Table2, the in-car performance of the statistical POMDP baseddialogue manager was better than the conventional baseline on all three measures.The proxy phone test showed the same trend for the objective measures but not for thesubjective measures In fact, there is little correlation between the subjective measuresand the objective measures in all the MTurk phone tests A possible explanation isthat the subjects in the in-car test were supervised throughout and were thereforemore likely to give accurate assessments of the system’s performance The Turksused in the phone tests were not supervised and many might have felt it was safest
to say they were satisfied just to make sure they were paid
The objective proxy phone performance overestimated the actual in-car mance by around 2 % on partial success and by around 10 % on full success Thismay be due to the fact that the subjects in the car found it harder to remember all
perfor-of the venue details they were required to find Nevertheless, the proxy phone testprovides a reasonable indicator of in-car performance
To gain more insight into the results, Fig.5shows regression plots of predictedfull objective success rate as a function of WER computed by pooling all of thetrial data As can be seen, the statistical dialogue system (POMDP-trial) consistentlyoutperforms the conventional baseline system (Baseline-trial) Figure5 also plotsthe success rate of both systems using the user simulator used to train the POMDPsystem (xxx-sim) It can be seen that the general trend is similar to the user trial databut the simulator success rates significantly overestimate performance, especiallyfor the statistical system This is probably due to a combination of two effects
Trang 1912 S Young et al.
Fig 5 Comparison of system performance obtained using a user simulator compared to the actual
performance achieved in a trial
Firstly, the user simulator presents perfectly matched data to both systems.2Secondly,the simulation of errors will differ to the errors encountered in the real system Inparticular, the errors will be largely uncorrelated allowing the belief tracking to gainmaximum advantage When errors are correlated belief tracking is less accuratebecause it tends to over-estimate alternatives in the N-best list [24]
The widespread adoption of end-to-end statistical dialogue systems offers the tial to develop systems which are more robust to noise, and which can be automat-ically trained to adapt to new and extended domains [25] However, testing suchsystems is problematic requiring considerable resource not only to build and deployworking real-time implementations but also to run the large scale experiments needed
poten-to properly evaluate them
The results presented in this paper show that fully statistical systems are not onlyviable, they also outperform conventional systems especially in challenging envi-ronments The results also suggest that by matching word error rate, crowd sourcedphone-based testing can be a useful and economic surrogate for specific environ-ments such as the motor car This is in contrast to the use of user simulators acting
at the dialogue level which grossly exaggerate expected performance A corollary ofthis result is that using user simulators to train statistical dialogue systems is equallyundesirable, and this observation is supported by recent results which show that when
a statistical dialogue system is trained directly by real users, success rates furtherimprove relative to conventional systems [26]
2 As well as being used to train the POMDP-based system, the user simulator was used to tune the rules in the conventional hand-crafted system.
Trang 20Free ebooks ==> www.Ebook777.com
References
1 Roy N, Pineau J, Thrun S (2000) Spoken dialogue management using probabilistic reasoning In: Proceedings of ACL
2 Young S (2002) Talking to machines (statistically speaking) In: Proceedings of ICSLP
3 Williams J, Young S (2007) Partially observable markov decision processes for spoken dialog systems Comput Speech Lang 21(2):393–422
4 Young S, Gasic M, Thomson B, Williams J (2013) POMDP-based statistical spoken dialogue systems: a review Proc IEEE 101(5):1160–1179
5 Scheffler K, Young S (2000) Probabilistic simulation of human-machine dialogues In: ICASSP
6 Pietquin O, Dutoit T (2006) A probabilistic framework for dialog simulation and optimal strategy learning IEEE Trans Speech Audio Process, Spec Issue Data Min Speech, Audio Dialog 14(2):589–599
7 Schatzmann J, Weilhammer K, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies KER 21(2):97–126
8 Pietquin O, Renals S (2002) ASR system modelling for automatic evaluation and optimisation
of dialogue systems In: International Conference on Acoustics Speech and Signal Processing Florida
9 Thomson B, Henderson M, Gasic M, Tsiakoulis P, Young S (2012) N-Best error simulation for training spoken dialogue systems In: IEEE SLT 2012 Miami
10 Tsiakoulis P, Gaši´c M, Henderson M, Planells-Lerma J, Prombonas J, Thomson B, Yu K, Young S, Tzirkel E (2012) Statistical methods for building robust spoken dialogue systems in
an automobile In: Proceedings of the 4th applied human factors and ergonomics
11 Jurˇcíˇcek F, Keizer S, Gaši´c M, Mairesse F, Thomson B, Yu K, Young S (2011) Real user evaluation of spoken dialogue systems using amazon mechanical Turk In: Proceedings of interspeech
12 Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book version 3.4 Cambridge University, Cambridge
13 Mairesse F, Gaši´c M, Jurˇcíˇcek F, Keizer S, Thomson B, Yu K, Young S (2009) Spoken language understanding from unaligned data using discriminative classification models In: Proceedings
17 Minka T (2001) Expectation propagation for approximate bayesian inference In: Proceedings
of the 17th conference in uncertainty in artificial intelligence (Seattle) Morgan-Kaufmann, pp 362–369
18 Thomson B, Jurcicek F, Gasic M, Keizer S, Mairesse F, Yu K, Young S (2010) Parameter ing for POMDP spoken dialogue models In: IEEE workshop on spoken language technology (SLT 2010) Berkeley
learn-19 Jurcicek F, Thomson B, Young S (2011) Natural actor and belief critic: reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs ACM Trans Speech Lang Process 7(3)
20 Schatzmann J, Thomson B, Weilhammer K, Ye H, Young S (2007) Agenda-Based user lation for bootstrapping a POMDP dialogue system In: Proceedings of HLT
simu-21 Yu K, Young S (2011) Continuous F0 modelling for HMM based statistical parametric speech synthesis IEEE Audio, Speech Lang Process 19(5):1071–1079
22 Mairesse F, Gaši´c M, Jurˇcíˇcek F, Keizer S, Thomson B, Yu K, Young S (2010) Phrase-based statistical language generation using graphical models and active learning In: Proceedings of ACL
www.Ebook777.com
Trang 2114 S Young et al.
23 OnStar (2013) OnStar FMV mirror http://www.onstarconnections.com/
24 Williams J (2012) A critical analysis of two statistical spoken dialog systems in public use In: Spoken language technology workshop (SLT) Miami
25 Gasic M, Breslin C, Henderson M, Kim D, Szummer M, Thomson B, Tsiakoulis P, Young S (2013) POMDP-based dialogue manager adaptation to extended domains In: SigDial 13 Metz
26 Gasic M, Breslin C, Henderson M, Kim D, Szummer M, Thomson B, Tsiakoulis P, Young S (2013) On-line policy optimisation of bayesian spoken dialogue systems via human interaction In: ICASSP 2013 Vancouver
Trang 22Syntactic Filtering and Content-Based
Retrieval of Twitter Sentences
for the Generation of System Utterances
in Dialogue Systems
Ryuichiro Higashinaka, Nozomi Kobayashi, Toru Hirano,
Chiaki Miyazaki, Toyomi Meguro, Toshiro Makino and Yoshihiro Matsuo
Abstract Sentences extracted from Twitter have been seen as a valuable resource
for response generation in dialogue systems However, selecting appropriate ones
is difficult due to their noise This paper proposes tackling such noise by syntacticfiltering and content-based retrieval Syntactic filtering ascertains the valid sentencestructure as system utterances, and content-based retrieval ascertains that the contenthas the relevant information related to user utterances Experimental results showthat our proposed method can appropriately select high-quality Twitter sentences,significantly outperforming the baseline
In addition to performing tasks [19], dialogue systems should be able to performopen-domain conversation or chat in order for them to look affective and to buildsocial relationships with users [2] Chat capability also leverages the usability of
R Higashinaka (B) · N Kobayashi · T Hirano · C Miyazaki · T Makino · Y Matsuo
NTT Media Intelligence Laboratories, Kanagawa, Japan
© Springer International Publishing Switzerland 2016
A Rudnicky et al (eds.), Situated Dialog in Speech-Based
Human-Computer Interaction, Signals and Communication Technology,
DOI 10.1007/978-3-319-21834-2_2
15
Trang 2316 R Higashinaka et al.task-oriented dialogue systems because real users do not necessarily utter only task-related (in-domain) utterances but also chatty utterances [17]; such utterances, if nothandled correctly, can cause misunderstandings.
One challenge facing an open-domain conversational system is the wide variety
of topics in user utterances Conventional methods have used hand-crafted rules, butthe coverage of topics is usually very limited [20] To increase the coverage, recentstudies have exploited the web, typically Twitter, to extract and use sentences forresponse generation [1,15] However, due to the nature of the web, such sentencesare likely to be negatively affected by noise
Heuristic rules have been proposed by Inaba et al [10] to filter inappropriateTwitter sentences, but since their filtering is performed on the word level, their fil-tering capability is very limited To overcome this limitation, this paper proposessyntactic filtering and content-based retrieval of Twitter sentences; syntactic filteringascertains the validity of sentence structures and content-based retrieval ascertainsthat the extracted sentences contain information relevant to user utterances
In what follows, Sect.2 covers related work Section3 explains our proposedmethod in detail Section4 describes the experiment we performed to verify ourmethod Section5summarizes the paper and mentions future work
Conventional approaches to open-domain conversation have heavily depended onhand-crafted rules The early systems such as ELIZA [21] and PARRY [3] usedheuristic rules derived from psycho-linguistic theories Recent systems at the Loebnerprize (a chat system competition) typically use tens of thousands of hand-crafted rules[20] Although such rules enable high-quality responses to expected user utterances,they fail to respond appropriately to unexpected ones In such cases, systems tend
to utter innocuous (fall-back) utterances or change topic, which often lowers usersatisfaction
To overcome this problem, recent studies have used the web for response ation For example, Shibata et al and Yoshino et al used sentences in web-searchresults for response generation [15,22] To make utterances more colloquial and suit-able for conversation, instead of web-search results, Twitter has become the recenttarget for sentence extraction [1] Although extracting sentences from the web candeal with a wide variety of topics in user utterances, due to the web’s diversity, theextracted sentences are likely to contain noise
gener-To suppress this noise, Inaba et al proposed word-based filtering of Twitter tences [10] Their rules filter sentences that contain context-dependent words such
sen-as referring/temporal expressions They also score sentences by using the weights
of words calculated from a reference corpus and remove those with low scores Ourmotivation is similar to Inaba et al.’s in that we want to extract sentences from Twitterthat are appropriate as system utterances, but our work is different in that, in addition
Trang 24Syntactic Filtering and Content-Based Retrieval … 17
to word-level filters, we also take into account the syntax and the content of Twittersentences for more accurate sentence extraction
Although not within the scope of this paper, there are emerging approaches tobuilding knowledge bases for chat systems by using web resources Higuchi et al.mined the web for associative words (mainly adjectives) to fill in their generationtemplates [8], and Sugiyama et al created a database of dependency structures fromTwitter to find words for their templates [16] Statistical machine translation tech-niques have also been utilized to obtain transformation rules (as a phrase table) frominput to output utterances [14] Although we find it important to create good knowl-edge bases from the web for generation, since it is still in a preliminary phase andthe reported quality of generated utterances is rather low, we currently focus on theselection of sentences
In this paper, we assume that the input to our method is what we refer to as a
topic word A topic word (represented by noun phrases in this paper) represents the
current topic (focus) in dialogue and can be obtained from a user utterance or fromthe dialogue context We do not focus on the extraction of topic words in this paper;note that finding appropriate topic words themselves is a difficult problem, requiringthe understanding of the context
Under this assumption, our task is to retrieve appropriate sentences from ter given a topic word Our method comprises four steps: preprocess, word-basedfiltering, syntactic filtering, and content-based retrieval Note that, in this paper, weassume the language used is Japanese
Twit-3.1 Preprocess
As a preprocess, input tweets are first stripped of Twitter-dependent expressions(e.g., retweeted content and user names with mention markers) Then, the tweetsare split into sentences by sentence-ending punctuation marks After that, sentencesthat are too short (less than five characters) or too long (more than 30 characters)are removed because they may not be appropriate as colloquial utterances We alsoremove sentences that contain no Japanese characters
3.2 Word-Based Filtering
The sentences that pass the preprocess are processed by a morphological lyzer The sentences together with their analysis results are sent to the word-basedfilters There are three filters:
Trang 25ana-18 R Higashinaka et al.(1) Sentence Fragment Filter If the sentence starts with sentence-end particles,punctuation marks, or case markers (Japanese case markers do not appear at thebeginning of a sentence), it is removed If the sentence ends with a conjunc-tive form of verbs/adjectives (meaning that the sentence is not complete), it isremoved This filter is intended to remove sentence fragments caused mainly bysentence splitting errors.
(2) Reference Filter If the sentence contains pronouns, deixes, or referring sions such as ‘it’ and ‘that’, it is removed If the sentence has words related tocomparisons (such as more/than) or an anteroposterior relation (such as follow-ing/next), it is also removed If the sentence has words representing reason orcause, it is removed If the sentence contains relation-related words, such asfamily members (mother, brother, etc.), it is also removed Such sentences need
expres-to be removed because entities and events being referred expres-to may not be present
in the sentence or differ depending on the speaker
(3) Time Filter If the sentence contains time-related words, such as dates andrelative dates, it is removed If the sentence has verbal suffixes representing pasttenses (such as ‘mashita’ and ‘deshita’), it is also removed Such sentences areassociated with certain time points and therefore may not be used independently
of the context
The filters here are similar to those used by Inaba et al [10] with some extensions,such as the use of tense and relation-related words The filters are applied to inputsentences in a cascading manner If a sentence passes all the filters, it is sent tosyntactic filtering
3.3 Syntactic Filtering
The sentences are checked with regard to their syntactic structures This process isintended to ascertain if the sentence is structurally valid as an independent utterance;that is, the sentence is grammatical and has necessary arguments for predicates Forexample, “watashi wa iku (I go)” does not have a destination for the predicate “go”,making it an non-understandable utterance on its own
However, such checks are actually difficult to perform This is because Twittersentences are mostly in colloquial Japanese with many omissions of particles and casemarkers, making it hard to use the rigid grammar of written Japanese for validation
In addition, missing arguments do not necessarily mean an invalid structure becauseJapanese contains many zero-predicate and zero-pronoun structures For example,
“eiga ni ikitai (want to go to the movies)” does not have a subject for a predicate,but since the sentence is in the desiderative mood, we can assume that the subject is
“watashi (I)” and the sentence is thus understandable The checks need to take intoaccount the types of predicates as well as mood, aspect, and voice, making it difficult
to enumerate by hand all the conditions when a sentence can be valid Therefore,
to automatically find conditions when a sentence is valid, we turn to a machine
Trang 26Syntactic Filtering and Content-Based Retrieval … 19
Fig 1 A word dependency tree for “Ichiro wa eiga ni iku (Ichiro goes to the movies)” The nodes
of base forms and end forms are omitted from illustration because they are exactly the same as word surfaces in this example
learning based approach and use a binary classifier that has been trained from data
to determine whether a sentence is valid or invalid on the basis of its structure Notethat the aim of this filtering is NOT to guarantee the “syntactic well-formedness” ofsentences since responses need not be syntactically well-formed in “chit-chat” typeinteractions; here we simply want to remove sentences that are considered invalidfrom their structures Below shows how we created the classifier
3.3.1 Machine Learning Based Classifier
To create the classifier, we first collected Twitter sentences and labeled them as valid(i.e., positive examples) and invalid (i.e., negative examples) Then, we converted thesentences into word dependency trees by using a dependency analyzer in a mannersimilar to Higashinaka and Isozaki [7] The trees have part-of-speech tags as mainnodes with word surfaces, base forms, and end forms as their daughters (see Fig.1
for an example) Finally, the trees of negative and positive examples were input
to BACT [11], a boosting based algorithm for classifying trees, to train a binaryclassifier BACT enumerates subtrees in the input data and uses the existence ofthe subtrees as features for boosting-based classification Since subtrees are used asfeatures, syntactic structures are taken into account for classification
For creating the training data, we sampled 164 words as topic words from ourdialogue corpus [13] Then, for each topic word, we retrieved up to 100 Twittersentences by using a text search engine that has an index similar to (d) in Table1with
a content-based retrieval method we describe later (see Sect.3.4) For the retrievedsentences, an annotator, who is not the author, labeled validity scores on a five-pointLikert scale where 1 indicates completely invalid and 5 completely valid We treatedsentences scored 1 and 2 as negative examples and those scored 4 and 5 as positiveexamples We did not use sentences scored 3 In total, we created 3880 positive and
1304 negative examples By using these data, a classifier was learned by BACT.The evaluation was done by using a twofold cross validation, with each fold havingexamples regarding 82 topic words Figure2shows the recall-precision curves for the
Trang 2720 R Higashinaka et al.
Table 1 Statistics of our Twitter data
Number Retained ratio
(c) Number of sentences retained by word-based filtering 103,655,452 11.9 %
(e) Number of unique sentences retained by the syntactic
filtering
7,907,888 0.9 % Retained ratio is the ratio of retained sentences over (b)
Fig 2 Recall-precision
curves for N-gram based and
syntactic filtering The graph
shows the result for one of
the folds in twofold cross
validation The other fold has
the same tendency
0.7 0.75 0.8 0.85 0.9 0.95 1
trained syntactic classifier (Syntax) with a comparison to an N-gram based baseline(N-gram) Here, the baseline uses word and part-of-speech N-grams (unigrams to5-grams) as features with logistic regression as a training algorithm [4] The curvesshow that our trained syntactic filter classifies sentences with good precision It
is also visible that the syntactic filter consistently outperforms the baseline As arequirement for a filter, low false acceptance is desirable By a statistical test (a sign-test that compares the number of times the syntactic filter outperforms the N-grambased filter and vise-versa), we confirmed that the syntactic filter has significantlylower false acceptance than the baseline (p < 0.001), verifying the usefulness of
syntactic information
3.3.2 Filtering by the Trained Classifier to Create an Index
On the basis of the evaluation result, we decided to use the syntactic classifier (trainedwith all the examples) to filter input sentences The sentences that pass this filter areindexed by a text search engine (we use Lucene, see Sect.4.1) that allows for efficientsearching
Trang 28Syntactic Filtering and Content-Based Retrieval … 21
3.4 Content-Based Retrieval
Content-based retrieval can retrieve sentences that contain information related to an
input topic word For this, we use a dictionary of related words Related words are
the words strongly associated with a topic word We collect such words from theweb and use them to expand the search query so that the retrieved sentences containsuch words
The idea here is inspired by the work of Louis and Newman [12] that uses relatedwords for tweet retrieval, but our work is different in that we allow arbitrary words
as input (not just an named-entity type) and use a high-quality dictionary of relatedwords by strict lexico-syntactic patterns, not just a simple word collocation
3.4.1 Creating a Dictionary of Related Words
We use lexico-syntactic patterns to extract related words Lexico-syntactic patternshave been successfully used to obtain related words such as hyponyms [6] andattributes [18]
For a given word W, we collect noun phrases (NP), adjectives (ADJ), and verbs(V) as related words For noun phrases, we use a lexico-syntactic pattern similar tothat used by Tokunaga [18] and collect attributes of W More specifically, we use thepattern “W no NP (ga|wo|ni|de|kara|yori|e|made)”, corresponding to
“NP of W Verb” in English We collect attributes because they form part of a topicword and therefore are likely to be closely related For adjectives, we use the pattern
“W (ga|wa) ADJ”, corresponding to “W is ADJ” in English This pattern retrievesadjectival properties of W For verbs, we use “W (ga|wo|ni|de) V” where Wappears in the important argument positions (nominative, accusative, dative, andlocative positions) of V
By using the weblogs of 180M articles we crawled, we used the above patterns
to extract the related words for all noun phrases in the data Then, we distilled theresults by filtering words that do not correlate well with the entry word (i.e., W)
We used the log likelihood ratio test (G-test) to determine whether a related wordappears significantly more than chance We retained only the related words that havethe G value of over 10.83 (i.e., p < 0.001) Finally, the retained words comprise
our related word dictionary The dictionary contains about 2.2M entries To give abrief example, an entry of “Ramen” (a type of noodle dish) includes noodles, soup,restaurant as NP, delicious, tasty, longing as ADJ, and eat, order, sip, for V
3.4.2 Retrieval Method
Given a topic word T, we search for top-N sentences from the index Here, we score
a sentence S by the following formula:
Trang 29relatively lowers the rank of sentences that contain irrelevant content.
an attempt to make our syntactic filter more sensitive to false acceptance, we used0.005 as a cut-off threshold (default 0.00) We created two indices from the data: onecreated with (d) and the other with (e) The aim of this is to compare the effectiveness
of the syntactic filter in the experiment we describe later We call the former the whole
index and the latter the filtered index We used Lucene, which is an open source text
search engine, to create the indices
4.2 Experimental Procedure
We made four systems for comparison: one is the baseline that only uses word-basedfiltering, and the others are variations of the proposed method The systems are asfollows:
Baseline The whole index is used for sentence retrieval In ranking the sentences,
a vector space model using TF-IDF weighted word vectors is used This is thedefault search mechanism in Lucene This is the condition where there is nosyntactic filter or content-based retrieval
Trang 30Syntactic Filtering and Content-Based Retrieval … 23
amazon, Minatomirai, Iraq, Cocos, Smart-phone, Disney Sea, news, Hashed Beef, Hello Work, FamilyMart, Fuji Television, horror, Pocari Sweat, Mister Donut, mosquito, weather, Kinkakuji temple, accident, Hatsushima, Shinsengumi, fortune- telling, region, local area, Tokyo Bay, pan, Yatsugatake, damage, Kitasenju, Meguro, baseball club, courage
Fig 3 Topic words used for the experiment The words were originally in Japanese and translated
to evaluate only the top-1 sentence; dialogue systems usually continue on the sametopic for a certain number of turns, making it necessary for the systems to be able
to create multiple sentences for a given topic In addition, it is common practice inchat systems that sentences be randomly chosen from a pool of sentences for makingvariation in utterances We believe evaluating randomly selected utterances from top-ranked retrieved sentences is appropriate in terms of actual system deployment Bythis procedure, we created 93 utterances for each system, for a total of 372 utterances
We had two judges, who are not any of the authors, subjectively evaluate thequality of the generated utterances (shown with topic words and presented in arandomized order) in terms of (i) understandability (if the utterance is understandable
as a response to a topic word) and (ii) continuity (if the utterance makes you willing
to continue the conversation on the topic) on a five-point Likert scale, where 1 isthe worst and 5 the best We use averaged understandability and continuity scores toevaluate the systems In addition to these metrics, we also use a metric that we call (iii)non-understanding rate, which is the rate of lowly rated utterances (scores 1 and 2) inthe understandability score over the number of total utterances Since even a singlenon-understandable utterance can lead to a sudden breakdown in conversation, weconsider this figure to be an important indicator of robustness to keep the conversation
on track Each utterance was judged independently
Trang 3124 R Higashinaka et al.
Table 2 Averaged understandability scores, continuity scores, and non-understanding rates
at all When we look at the non-understanding rates, we find that Syntax+Contentachieves a very low figure of 6 %, suggesting that in most cases the utterances do notlead to a sudden breakdown of dialogue
Within the utterances that Syntax+Content created, only one utterance scored 1for understandability:
(1) aiteru-yoo kitasenju-ni ii yakinikuya-*kara ikou-zee
open-SEP Kitasenju-at good BBQ-restaurant-from go-SEP
‘It’s open Why don’t we go *from the good BBQ restaurant at Kitasenju’Here, SEP denotes a sentence-end particle and an asterisk means ungrammatical.This sentence contains two sentences without any punctuation mark in between, andthe first sentence has a missing argument and the second sentence has an incorrectpredicate-argument structure The trained syntactic classifier probably failed to detect
it as invalid because such a complex combination of errors was not seen in the trainingdata An increase in the training data could solve the problem
This paper proposed syntactic filtering and content-based retrieval of Twitter tences so that the retrieved sentences can be safely used for response generation
sen-in dialogue systems Experimental results showed that our proposed method can
Trang 32Syntactic Filtering and Content-Based Retrieval … 25appropriately select high-quality Twitter sentences, significantly outperforming theword-based baseline Our contribution lies in discovering the usefulness of syntac-tic information in filtering Twitter sentences and in validating the effectiveness ofrelated words in retrieving sentences For future work, we plan to investigate how
to extract topic words from the context and also to create a workable conversationalsystem with speech recognition and speech synthesis
Acknowledgments We thank Prof Kohji Dohsaka of Akita Prefectural University for his helpful
advice on statistical tests We also thank Tomoko Izumi for her suggestions on how to write linguistic examples.
Asian Lang Inf Process 7(2)
8 Higuchi S, Rzepka R, Araki K (2008) A casual conversation system using modality and word associations retrieved from the web In: Proceedings of the EMNLP, pp 382–390
9 Imamura K, Kikui G, Yasuda N (2007) Japanese dependency parsing using sequential labeling for semi-spoken language In: Proceedings of the ACL, pp 225–228
10 Inaba M, Kamizono S, Takahashi K (2013) Utterance generation for non-task-oriented dialogue systems using Twitter In: Proceedings of the 27th annual conference of the japanese society for artificial intelligence 1K4-OS-17b-4 (in Japanese)
11 Kudo T, Matsumoto Y (2004) A boosting algorithm for classification of semi-structured text In: Proceedings of the EMNLP, pp 301–308
12 Louis A, Newman T (2012) Summarization of business-related tweets: A concept-based approach In: Proceedings of the COLING 2012 (Posters), pp 765–774
13 Meguro T, Higashinaka R, Minami Y, Dohsaka K (2010) Controlling listening-oriented logue using partially observable Markov decision processes In: Proceedings of the COLING,
Trang 3326 R Higashinaka et al.
17 Takeuchi S, Cincarek T, Kawanami H, Saruwatari H, Shikano K (2007) Construction and mization of a question and answer database for a real-environment speech-oriented guidance system In: Proceedings of the Oriental COCOSDA
opti-18 Tokunaga K, Kazama J, Torisawa K (2005) Automatic discovery of attribute words from web documents In: Proceedings of the IJCNLP, pp 106–118
19 Walker MA, Passonneau R, Boland JE (2001) Quantitative and qualitative evaluation of DARPA communicator spoken dialogue systems In: Proceedings of the ACL, pp 515–522
20 Wallace RS (2004) The anatomy of A.L.I.C.E A.L.I.C.E artificial intelligence foundation, Inc
21 Weizenbaum J (1966) ELIZA-a computer program for the study of natural language nication between man and machine Commun ACM 9(1):36–45
commu-22 Yoshino K, Mori S, Kawahara T (2011) Spoken dialogue system based on information tion using similarity of predicate argument structures In: Proceedings of the SIGDIAL, pp 59–66
Trang 34extrac-Knowledge-Guided Interpretation
and Generation of Task-Oriented Dialogue
Alfredo Gabaldon, Pat Langley, Ben Meadows and Ted Selker
Abstract In this paper, we present an architecture for task-oriented dialogue that
integrates the processes of interpretation and generation We analyze implementedsystems based on this architecture—one for meeting support and another for assistingmilitary medics—and discuss results obtained with the first In closing, we reviewsome related dialogue architectures and outline plans for future research
Systems that use natural language to assist a user in carrying out some task mustinteract with that user as execution of the task progresses The system in turn mustinterpret the user’s utterances and other environmental input to build a model of what
both it and the user believe and intend—in regard to each other and the environment.
The system also requires knowledge to use the model it constructs to participate in
a dialogue with the user and support him in achieving his goals
A Gabaldon (B) · P Langley · T Selker
Silicon Valley Campus, Carnegie Mellon University, Moffett Field, CA 94035, USA
e-mail: alfredo.gabaldon@ge.com
A Gabaldon
Present Address: GE Global Research, 1 Research Circle, Niskayuna, NY 12309, USA
P Langley
Present Address: Department of Computer Science, University of Auckland,
Auckland 1142, New Zealand
© Springer International Publishing Switzerland 2016
A Rudnicky et al (eds.), Situated Dialog in Speech-Based
Human-Computer Interaction, Signals and Communication Technology,
DOI 10.1007/978-3-319-21834-2_3
27
Trang 35Free ebooks ==> www.Ebook777.com
incor-relies on abstract meta-level knowledge that generalizes across different domains In
particular, we are interested in high-level aspects of dialogue: knowledge and gies relevant to dialogue processing that are independent of the actual content ofthe conversation The architecture separates domain-level from meta-level content,using both during interpretation and generation The work we report is informed
strate-by cognitive systems research, a key feature of which is arguably integration andprocessing of knowledge at different levels of abstraction [7]
Another feature of our architecture is the incremental nature of its processes Weassume that dialogues occur within a changing environment and that the tasks to
be accomplished are not predetermined but discerned as the dialogue proceeds Ourarchitecture incrementally expands its understanding of the situation and the user’sgoals, acts according to this understanding, and adapts to changes in the situation,sometimes choosing to pursue different goals In other words, the architecture sup-
ports situated systems that carry out goal-directed dialogues to aid their users In
the next section we discuss two implemented prototypes that demonstrate this keyfunctionality We follow this with a detailed description in Sect.3of the underlyingarchitecture and a discussion of results in Sect.4 We conclude with comments onrelated work and plans for future research
In this section, we discuss two prototypes that incorporate our architecture as theirdialogue engine The first system facilitates cyber-physical meetings by interactingwith humans and equipment; the second is an advisory system that collaborates with
a military medic to address the mutual goal of treating a patient In each case, wediscuss the setting, the knowledge that drives behavior, and a sample dialogue thatillustrates operation Neither system interacts with users in spoken language, buteach could be adapted to this end in a straightforward manner
www.Ebook777.com
Trang 36Knowledge-Guided Interpretation … 29
2.1 A System for Meeting Support
We have developed a system that supports ‘cyber-physical’ meetings in which anumber of human users interact with each other and with a number of devices topursue some joint activity The specific scenario we examine here involves a sessionbetween a physical therapist, a patient, and a relative of the patient who have theshared aim of completing a physical therapy session During the meeting, the userscarry out a dialogue that includes the system as a participant The users and systemalso interact with distributed sensors that measure leg motion, heart rate, and ambientlight and temperature, as well as distributed actuators that include a ‘smart’ televisionand a text-to-speech generator
During a physical therapy session, users communicate by entering sentencesthrough a menu-based smart-phone interface, although we intend to replace this with
a spoken-language interface in future work The current interface accepts sentences
in a controlled version of English, translates them into a logic-like representation,and sends the result to the dialogue system for processing At the same time, thesensing devices send messages to the dialogue system via HTTP Each user can seethe utterances of other users and those of the system via the phone interface.The system has knowledge about activities that arise during a therapy session,including physical activities the patient can carry out, such as repeatedly lifting aleg, and therapist procedures, such as obtaining relevant patient information (from adatabase or the patient), measuring range of movement, and tracking heart rate andblood pressure In a typical session, the system asks the patient standard questionsand the therapist asks additional ones Then the therapist decides on the task for
Table 1 Sample dialogue
involving a physical therapist
(Doctor), a patient (John), and
the meeting support system
Doctor John, please do leg lifts.
Patient Okay I will do leg lifts.
TV [starts playing tutorial video about leg lifting] Sensor [sends movement signal to system]
System Great start John!
Sensor [sends movement signal to system]
[after patient does a few more leg lifts …] System You are doing great, John!
[after a few more, the patient stops …]
Patient I can’t do any more leg lifts.
Doctor It’s okay, John That was good.
System John, you did eight repetitions.
Patient How many did I do in the previous session? System In the previous session you did five repetitions.
Trang 3730 A Gabaldon et al.the session and suggests it to the patient, relative, and system If the patient orfamily member accepts the proposed task, the system updates its model of the currentsituation and proceeds accordingly The system supports the patient’s execution of atask by tracking his progress, instructing the television to show a tutorial video, andproviding feedback For instance, once sensor input reveals the patient has starteddoing an exercise, it might encourage him by saying “Great start!”
Specific components of the meeting support system include a menu-based face on a smart phone to input English sentences, a phone application that serves
inter-as a motion detector, a television for displaying tutorials and other support videos,
a heart-rate monitor, environmental sensors for temperature and lighting, an HTTPclient/server module for component communication, and the dialogue system Table1
shows a sample dialogue for one of the physical therapy scenarios In this case, thepatient John participates in a session in which he partially complete a leg exerciseunder supervision of a therapist at a remote location We will return to this case study
in Sect.4, where we examine it in more detail
2.2 A Medic Assistant
Our second prototype involves scenarios in which a military medic on the battlefieldhelps an injured teammate Because the medic has limited training, he interacts withthe dialogue system to get advice on treating the person; the system plays the role of amentor with medical expertise The medic and system collaborate towards achievingthe shared goal of stabilizing the patient’s medical condition The system does notknow the specific task in advance Only after the conversation starts, and the medicprovides relevant information, does the system act on this content and respond inways that are appropriate to achieving the goal The system does not effect change
on the environment directly; the medic provides both sensors and effectors, with thesystem influencing him by giving instructions
During an interaction, the system asks an initial sequence of questions that leadthe medic to provide details about the nature of the injury This sequence is not pre-determined, in that later questions are influenced by the medic’s responses to earlierones Table2 shows a sample dialogue in which the medic-system team attempts
to stabilize a person with a bleeding injury The system possesses domain edge about how to treat different types of injuries, taking into account their location,severity, and other characteristics The program can also adapt the treatment accord-ing to the medic’s situation For instance, it may try a different treatment for a wound
knowl-if the medic claims that he cannot apply a particular treatment because he lacks thesupplies necessary for that purpose
This system uses a Web interface similar to a text-messaging application, althoughagain we plan to replace this with a spoken dialogue module in the future The medictypes English sentences into a form element within the interface, which it sends tothe dialogue system via an HTTP request The system in turn sends the content to anatural language processor that translates into a logical form our system can interpret
Trang 38Knowledge-Guided Interpretation … 31
We have used Skyphrase (http://www.skyphrase.com), a proprietary, driven language interpeter, for this purpose Skyphrase sends back the translation,again via HTTP, and the dialogue system uses the information to update its model
semantically-of the current situation This drives the system’s continuing behavior in pursuing thejoint task Lack of space constrains our discussion of the system here, but we haveprovided further details elsewhere [6]
2.3 Discussion
Although the two systems we have just described are limited in many respects,developing them raised a number of new challenges that have not been fully addressed
by existing dialogue systems, at least in combination:
• As the dialogue and the users’ situation evolves, the system must update its account
of the situation by interpreting new information as it arrives, including dialogueutterances and sensor inputs
• The system’s interaction with users is goal directed and involves joint activityover time; this activity includes not only actions carried out by the users, but alsocommunicative actions and commands to device actuators
• The system must establish a common ground [4] with its users, that is, it mustdevelop a shared model of the environmental situation and others’ mental states; itmust also update this joint model as circumstances change and information flowsamong participants
Table 2 Sample dialogue
between a human medic and a
more knowledgeable
computer advisor
Medic We have a man injured!
Advisor Okay What type of injury?
Medic He’s bleeding.
Advisor How bad is the bleeding?
Medic Pretty bad I think it is an artery.
Advisor Okay Where is the injury?
Medic It’s on the left leg.
Advisor Apply pressure on the leg’s pressure point Medic Roger that.
Advisor Has the bleeding stopped?
Medic No He’s still bleeding.
Advisor Okay Apply a tourniquet.
Medic Where do i put the tourniquet?
Advisor Just below the joint above the wound Medic Okay The bleeding has stopped.
Trang 3932 A Gabaldon et al.
• The beliefs and goals of each participant are not stated explicitly, but the systemmust infer enough of them to be effective; this involves using not only domain-specific knowledge, but also more abstract knowledge that relates mental states tocommunication events
• The overall process is highly dynamic, as the system continuously draws inferencesfrom users’ utterances and other input to expand its understanding of the evolvingsituation, and as it carries out activities to achieve goals as they arise
Our application systems and architecture represent first steps towards addressingthese challenges In the next section we describe the integrated architecture, an imple-mentation of which serves as the main component of the two systems above
Now we can turn to our framework for task-oriented dialogue We have focused onsupporting goal-directed behavior that is physically situated in dynamic contexts Thearchitecture depends on a knowledge base that lets it generate inferences, introduce
goals, and execute actions Input is multi-modal in that it might come from speech,
text, visual cues, or external sensors We have implemented the architecture in Prolog,making use of its support for embedded structures and pattern matching, but itsrepresentation and control mechanisms diverge substantially from the default Prologinference engine, as we will see shortly
3.1 Representation and Content
As in research on cognitive architectures [9], we distinguish between a dynamic
short-term or working memory, which stores external inputs and inferences based upon this information, and a more stable long-term memory, which serves as a store
of knowledge that is used to make inferences and organize activities
Working memory is a rapidly changing set of ground literals that contains the tem’s beliefs and goals as it models the evolving situation Literals for domain-levelcontent, which do not appear as top-level elements in working memory, are stored
sys-as relational triples, sys-as in [i1, type, injury] or [i1, severity, major] This reification
lets the system examine and refer separately to different aspects of a single complexconcept, including its predicate
Our representation also incorporates meta-level predicates, divorced entirely fromthe domain level, to denote speech acts [1,13] The literature contains many alter-native taxonomies for speech acts; we have adopted a reduced set of six types thathas been sufficient for our current purposes These include:
All domain-level and meta-level concepts in working memory are embedded within
one of two predicates that denote aspects of mental states: belief(A, C) or goal(A,
Trang 40Free ebooks ==> www.Ebook777.com
inform (S, L, C): speaker S asks L to believe content C;
acknowledge (S, L, C): S tells L it has received and now believes content C;
question (S, L, C): S asks L a question C;
propose (S, L, C): S asks L to adopt goal C;
accept (S, L, C): S tells L it has adopted goal C;
reject (S, L, C): S tells L it has rejected goal C.
C) for some agent A and content C, as in belief(medic, [i1, type, injury]) A mental
state’s content may be a triple, [i, r, x], a belief or goal term (nested mental states),
an agent’s belief that some attribute has a value, as in belief_wh(A, [i, r]), a belief about whether some propositional content is true, as in belief_if(A, C), or a meta-level
literal, such as the description of a speech act
Long-term memory contains generic knowledge in the form of rules Each ruleencodes a situation or activity by associating a set of triples in its head with a pattern ofconcepts in its body High-level predicates are defined by decomposition into otherstructures, imposing an organization similar to that in hierarchical task networks[11] Structures in long-term memory include conceptual knowledge, skills, andgoal-generating rules
Conceptual knowledge comprises a set of rules which describe classes of situations
that can arise relative to a single agent’s beliefs or goals These typically occur at thedomain level and involve relations among states of the world Conceptual rules definecomplex categories in terms of simpler ones and organize these relatonal predicatesinto taxonomies
Skills encode the activities that agents can execute to achieve their goals Each skill
describes the effects of some action or high-level activity under specified conditions.The body of a skill include a set of preconditions, a set of effects, and a set ofinvariants, along with a sequence of subtasks that are either executable actions, inthe case of primitive skills, or other skills, in the case of nonprimitive skills
Goal-generating rules specify domain-level knowledge about the circumstances
under which an agent should establish new goals For example, an agent might have
a rule stating that, when a teammate is injured, it should adopt a goal for him to bestabilized These are similar to conceptual rules, but they support the generation ofgoals rather than inference of beliefs
The architecture also includes more abstract, domain-independent knowledge
at the meta-level This typically involves skills, but it can also specify conceptualrelations (e.g., about transitivity) The most important structures of this type are
speech act rules that explain dialogue actions to patterns of agents’ beliefs and goals without making reference to domain-level concepts However, the content of a speech
act is instantiated as in any other concept For example, the rule for an inform act is:
inform (S, L, C) ← belief (S, C),
goal (S, belief (L, C)), belief (S, belief (L, C)).
Here S refers to the speaker, L to the listener, and C to the content of the speech act.
Rules for other speech acts take a similar abstract form
www.Ebook777.com