Starting from the evaluation of a preliminar version of the system we 1 conclude the ne- cessity to desing a robust and flexible system suitable to have to have different dialogue contro
Trang 1Proceedings of EACL '99
R o b u s t and Flexible M i x e d - I n i t i a t i v e D i a l o g u e
for T e l e p h o n e Services
Relafio Gil, Jos~ ~ and Tapias, D a n i e l and Gancedo, M a r i a C
Charfuelan, Marcela ~ and Hern£ndez, Luis A
Speech Technology Group, Telefdnica Investigacihn y Desarrollo, S.A
C Emilio Vargas, 6 28043 - Madrid (Spain) Teh34.1.549500 Fax:34.1.3367350 e-mail:jretanio@gaps.ssr.upm.es
A b s t r a c t
In this work, we present an experimental
analysis of a Dialogue System for the au-
tomatization of simple telephone services
Starting from the evaluation of a preliminar
version of the system we 1 conclude the ne-
cessity to desing a robust and flexible system
suitable to have to have different dialogue
control strategies depending on the charac-
teristics of the user and the performance of
the speech recognition module Experimen-
tal results following the PARADISE frame-
work show an important improvement both
in terms of task success and dialogue cost
for the proposed system
1 I N T R O D U C T I O N
In this contribution we present some improve-
ments on the design of a Dialogue Management
System for the a u t o m a t i z a t i o n of simple telephone
tasks in a PABX environment (automatic name
dialing, voice messaging, ) From the point
of view of its functionality, our system is a very
simple one because there is no need of advanced
Plan Recognition strategies or General P r o b l e m
Solving methods However we think t h a t even for
these kind of dialogue sytems there is still a long
way to d e m o n s t r a t e their usability in real situa-
tions by the "general public"
In our work we will concentrate on systems
designed for the telephone line and for a wide
range of potential users Therefore our evalua-
tions will be done taking into account different lev-
els of speech recognition performance and user be-
haviours In particular we will propose and eval-
uate strategies directed to increase the robustness
against recognition errors and flexibility to deal
with a wide range of users We will use the PAR-
ADISE evaluation framework (Walker et al., 1998)
to analyze b o t h task success and agent dialogue
behaviour related to subjective user satisfaction
1~ Dep SSR ETSIT-UPM Spain
2 R O B U S T A N D F L E X I B L E
S Y S T E M Following the classification of Dialogue Systems proposed by Allen (Allen, 1997), our baseline clia- logue system could be described as a system with topic-based performance capabilities, a d a p t i v e single task, a minimal pair clarification/correction dialogue manager and fixed mixed-initiative One of the most i m p o r t a n t objectives of our di- alogue manager has been the i m p l e m e n t a t i o n of a collaborative dialogue model So the system has
to be able to understand all the user actions, in whatever order they appear, a n d even if the focus
of the dialogue has been changed by the user In order to achieve this, we organize the information
in an information tree, controlled by a t a s k knowl- edge interpreter and we let the d a t a to partici- pate in driving the dialogue However, to control
a mixed-initiative s t r a t e g y we use three s e p a r a t e sources of information: the user data, the world knowledge embedded in the t a s k structure and the general dialogue acts
Therefore, from this preliminar evaluation of the system we found t h a t in order to increase its p e r m o r m a n c e two m a j o r points should be ad- dressed: a) robustness against recognition and parser errors, and b) more flexibility to be able
to deal with different user models We designed four complementary strategies to improve its per- formance:
1 To estimate the performance of the speech recog- nition module This was done from a count on the number of corrections during previous inter- actions with the same user
2 To classify each user as belonging to group A or B that will be described later in the Experimental Results section This was done combining a nor- malized average number of utterances per task and the amount of information in each utterance, especially at some particular dialogue points (for example when answering to the question of our previous example)
Trang 2Proceedings of E A C L '99
3 To include a control module that from the re-
sults of steps 1 and 2 defines two different kinds
of control management allowing a flexible mixed-
initiative strategy: more user initiative for Group
A users and high recognition rates, and more
restictive strategies for Group B users and/or low
recognition performance
All of these strategies have been included in our
system as it is depicted in Figure 1
3 E X P E R I M E N T A L R E S U L T S
In order to test the i m p r o v e m e n t s over our original
system (described in (Alvarez et al., 1996)) we de-
signed a simulated evaluation environment where
the p e r f o r m a n c e of the Speech Recognition Mod-
ule (recognition rate) was artificially controlled
A Wizard of Oz simulation environment was de-
signed to o b t a i n different levels of recognition per-
formance for a v o c a b u l a r y of 1170 words: 96.4%
word recognition rate for high performance and
80% for low performance A pre-defined single
fixed mixed-initiative s t r a t e g y was used in all the
cases
We used an a n n o t a t e d d a t a base composed of
50 dialogues with 50 different novice users and 6
different simple telephone tasks in each dialogue:
25 dialogues were simulated using 94.6% recogni-
tion rate a n d 25 with 80% Performance results
were obtained using the P A R A D I S E evaluation
framework (Walker et al., 1998), determining the
contributions of task success and dialogue cost to
user satisfaction Therefore as task success mea-
sure me obtained the K a p p a coefficient while dia-
logue cost measures were based on the n u m b e r of
users turns In this case it is i m p o r t a n t to point
out t h a t as each tested dialogue is composed of a
set of six different tasks which have quantify differ-
ent n u m b e r of turns, the n u m b e r of t u r n s for each
t a s k was normalized to it's N ( x ) = ~+ ~ O" x score
Both Group High ASR
Lo ASR Hi ASR
Satisf 26.4 30.1 3 5 4 25.2
Table 1: Shows means results for both group in low
and high ASR And separately for each Group A and
B, only in high ASR situation
User satisfaction in Table 1 was o b t a i n e d as a
cumulative satisfaction score for each dialogue by
s u m m i n g the scores of a set of questions similar
t,o those proposed in (Walker et al., 1998) The
ANOVA for K a p p a , the cost measure a n d user sat-
isfaction d e m o s t r a t e d a significant effect of ASR
performance As it could be predicted, we found
t h a t in all cases a low recognition r a t e corresponds
to a dramatical decrease in the absolute n u m b e r
of suscessfully completed tasks and an i m p o r t a n t increase in the average n u m b e r of utterances However we also found t h a t in high ASR situ- ation the t a s k success m e a s u r e of K a p p a was sur- prisingly low
A closer inspection of the dialogues in Table 1 revealed t h a t this low performance under high
A S R situations was due to the presence of two groups of users A first group, G r o u p A, showed
a "fluent" interaction with the system, similar to
t h e one supposed by the mixed-initiative s t r a t e g y (for example, as an answer to the question of the system " d o you want to do any other t a s k ? " , these users could answer something like "yes, I would like to send a message to John Smith") While the other group of users, G r o u p B, exibited a very restrictive interaction with the system (for exam- ple, a short answer "yes" for the s a m e question)
As a conclusion of this first evaluation we found
t h a t in order to increase the p e r m o r m a n c e of our baseline system, two m a j o r points should be ad- dressed: a) robustness against recognition and parser errors, a n d b) m o r e flexibility to be able
to deal with different user models
Therefore we designed an adaptive s t r a t e g y to
a d a p t our dialogue m a n a g e r to G r o u p A or B of users and to High and Low ASR situations T h e
a d a p t a t i o n was done based on linear discrimina- tion, as it is ilustrated in Figure 2, using b o t h the average n u m b e r of turns and recognition errors from the two first tasks in each dialogue
Low ASR Both Gr
0.71
User Turn 7.2 Satisfaction 26.9
High ASR
5.3 6.1 32.1 29.4
Table 2: Shows means results for each Group in high ASR situations and for both in low ASR
Table 2 shows mean results for each G r o u p A and B of users for High ASR performance, and for all users in Low ASR situations These results show a m o r e stable behaviour of the system, t h a t
is, less difference in performance between users of
G r o u p A a n d G r o u p B and, although to a lower extend, between high and low recognition rates
4 C O N C L U S I O N S
T h e main conclusion of the work is the necessity
to design a d a p t i v e dialogue m a n a g e m e n t strate- gies to make the system robust against recogniton performance and different user behaviours
Trang 3Proceedings of EACL '99
R e f e r e n c e s
James Allen 1997 Tutorial: Dialogue Modeling
uno, ACL/ERACL Workshop on Spoken Dia-
logue System, Madrid, Spain
D Tapias 1996 The Natural Language Pro-
cessing Module ]or a Voice Asisted Operator at
Tele]oniea I÷D uno, ICSLP '96, Philadelphia,
USA
M Walker, D Litman, C Kamm, and A Abella
1998 Evaluating spoken dialog agents with
PARADISE: Two case studies, uno, Computer
speech and language
Trang 4Proceedings of EACL '99
[
PARSER
TRAKER BASIC ACTS
USERS GROUPS SELECTOR
SYSTEM DEFINED DIALOG GROUPS STRATEG~
SELECTOR
BASIC ACTS
BACKWARD USER INTENTIONS CO-REFERENCE PROCESSOR
< PROCESSOR y
[ SE~'NTIC y
> GATHERINGS PROCESSOR
>[ CORRECTION ] DETECTOR
I BEHAVIOUR USER ACTS
KNOWLEDGE
> INTERPRETER
TASK ACTS
ACTS INTERPRETER
DIALOG ACTS
L
• REQUEST-REPLY INFOP,$L~TIOF
• ACTUALIZATION OF DIALOG'S INFORMATION:
'\\
]
* REQU~T.REpLy DATA INFO~T~ON
• STORE DATA INFOI~MATION
APLICATION
Figure 1: Modules of Robust and Flexible Mixed-Iniciative Dialogue
r ~
1 2
I 0
: : ~,:: , ' o , ~ : : ; ~
5 i 0 1 5 2 0 % E R R O R
R A T E
Figure 2: User clasification