Input parameters reflecting task success: Although the basic formula of the PARADISE model contains as a mandatory input parameter, it has been often replaced by the user judgment on tas
Trang 1be informative in the data pre-analysis This set includes DD, STD,
UTD, SRD, URD, # TURNS, WPST, WPUT, # BARGE-INS,
# SYSTEM ERROR MESSAGES, # SYSTEM QUESTIONS, # USER
QUESTIONS , AN:CO, AN:PA, AN:FA, PA:
CO, PA: PA, PA:FA, SCR, UCR, CA:AP, CA:IA, IR, IC, UA,
WA, WER,
In both sets, the turn-related parameters have been normalized to the overall
number of turns (or for the AN parameters to the number of user questions),
as is described in Section 6.2.1
Input parameters reflecting task success: Although the basic formula of
the PARADISE model contains as a mandatory input parameter, it has
been often replaced by the user judgment on task completion, COMP, in the practical application of the model This COMP parameter roughly
corresponds to a binary version of the judgment on question B1 For theanalysis of the experiment 6.3 data, the following options for describingtask success have been chosen:
Task success calculated on the basis of the AVM, either on a per-dialoguelevel or on a per-configuration level
Task success measures based on the overall solution, and
User judgment on question B1
A binary version of B1 calculated by assigning a value of 0 for
a rating and a value of 1 for a rating B1 > 3.0
It should be noted that B1 and are calculated on the basis of user ments Thus, using one of these parameters as an input to a prediction model
judg-is not in line with the general idea of quality prediction, namely to becomeindependent of direct user ratings
Apart from the input and output parameters, the choice of the regressionapproach carries an influence on the resulting model A linear multivariateanalysis like the one used in PARADISE has been chosen here The choice
of parameters which are included in the regression function depends on theamount of available parameters For set 1, a forced inclusion of all four pa-rameters has been chosen For set 2, a stepwise inclusion method is moreappropriate, because of the large number of input parameters The stepwisemethod sequentially includes the variables with the highest partial correlationvalues with the target variable (forward step), and then excludes variables withthe lowest partial correlation (backward step) In case of missing values, thecorresponding cases have been excluded from the analysis for the set 1 data(listwise exclusion) For set 2, such an exclusion would lead to a relatively low
Trang 2coefficient Instead, their models rely on the COMP measure Making use of
the parameter (which is similar to COMP) increases to 0.24 0.45,which is in the range of values given in the literature The model’s performancecan be further increased by using the non-simplified judgment on question B1for describing task success In this case, reaches 0.52, a value which isamongst the best of Table 6.34 Task success measures which are based on theoverall solution and the modified version provide slightly betterestimations than and but they are not competitive with the subject-derived measures B1 and Apparently, the PARADISE model performsbest when it is not completely relying on interaction parameters, but whensubjective estimations of task success are included in the estimation function.This finding is in line with comparative experiments described by Bonneau-Maynard et al (2000) When using subjective judgments of task success instead
of the coefficient, the amount of predicted variance raised from 0.41 to 0.48.Comparing the performance for the different target variables, seems to
be least predictable The amount of covered variance is significantly lower than
in the experiments described by Walker The relatively low number of inputparameters in set 1 may be a major reason for this finding Prediction accuracysignificantly raises when B0 or B23 are taken as the target parameter, and withB1 or describing task success A further improvement is observed when
Trang 3the target parameter is calculated as a mean over several ratings, namely as
MEAN(B0, B23) or MEAN(B) The model’s performance is equally high
in these cases Apparently, the smoothing of individual judgments which isinherent to the calculation of the mean has a positive effect on the model’sprediction accuracy
Table 6.36 shows the significant predictors for different models determinedusing the set 1 “dialogue cost” parameters and different task-success-relatedparameters as the input Target variables are either the or the MEAN(B)parameter For most significant dialogue cost contributions come from
# TURNS (with a negative sign), and partly also from the # BARGE-INS
pa-rameter (negative sign) DD and IC only play a subordinate role in predicting
For the task-success-related parameters, a clear order can be observed:B1 and have a dominant effect on (both with a positive sign),
and only a moderate one (the first with a positive and the second with a
Trang 4negative sign), and and are nearly irrelevant in predicting For
MEAN(B) as the target, the situation is very similar Once again, # TURNS is a
persistent predictor (always with a negative sign), and DD, IC and # BARGE
-INS only have minor importance The task-success-related input parametersshow the same significance order in predicting MEAN(B): B1 and have
a strong effect (positive sign), and a moderate one (also tive sign), and and are not important predictors Apparently, thePARADISE model is strongly dependent on the type of the input parameterdescribing task success
posi-The prediction results for different target variables are depicted in Table 6.37,both for the expert-derived parameter and for the user-derivedparameter describing task success The most important contributors for theprediction of are # TURNS (negative sign) and the task-success-related
parameter For predicting B0, also DD and # BARGE-INS (both negative sign)play a certain role B23 and MEAN(B0, B23) seem to be better predicted from
Trang 5DD and the task-success-related parameter; here, the # TURNSparameter isrelatively irrelevant For predicting MEAN(B), the most significant contribu-tions come from # TURNSand As may be expected, the differenttarget parameters related to user satisfaction require different input parametersfor an adequate prediction Thus, the models established by the multivariateregression analysis are only capable of predicting different indicators of usersatisfaction to a limited extent.
The number of input parameters in set 1 is very restricted (four “dialoguecost” parameters and one task-success-related parameter) Taking the set 2 pa-rameters as an input, it can be expected that more general aspects of qualityare covered by the resulting models An overview of the achievable variancecoverage is given in Table 6.38 In general, the coverage is much better than wasobserved for set 1 Using the interaction parameters or for describ-ing task success, raises to 0.28 0.47 depending on the target parameter.With B1 or an even better coverage can be reached
As was observed for the set 1 data, it seems to be important to include derived estimations of task success in the prediction function Expert-derived
indicators of user satisfaction Interestingly, the and parameters arenever selected by the stepwise inclusion algorithm Thus, the low importance ofthese parameters in the prediction function (see Table 6.36) is confirmed for theaugmented set of input parameters Overall, the prediction functions include arelatively large number of input parameters However, the amount of variancecovered by the function does not seem to be strictly related to the number ofinput parameters, as the results in the final row or column of Table 6.38 show
Trang 6val-MEAN(B0,B23)) can be observed In summary, the augmented data set leads
to far better prediction results, with a wider coverage of the resulting predictionfunctions
Table 6.39 shows the resulting prediction functions for different related input parameters The following parameters seem to be stable contrib-utors to the respective targets:
task-success-Measures of communication efficiency: Most models include either the
WPST and SRD parameters (positive sign), STD (negative sign), or
# TURNS (negative sign) The latter two parameters seem to indicate a
preference for shorter interactions, whereas the positive sign for the WPST
parameter indicates the opposite, namely that a talkative system would be
preferred A higher value for SRD is in principle linked to longer user
utter-ances which require an increased processing time from the system/wizard
No conclusive explanation can be drawn with respect to the communicationefficiency measures
Measures of appropriateness of system utterances: All prediction functions
contain the CA:AP parameter with a positive sign Two models of ble 6.39 also contain CA:IA (positive sign), which seems to rule out a part of the high effect of CA:AP in these functions In any case, dialogue
Ta-cooperativity proves to be a significant contributor to user satisfaction.Measures of task success: The task-success-related parameters do not al-ways provide an important contribution to the target parameter, except forB1 which is in both cases a significant contributor In the model estimatedfrom the first four input parameter sets (identical model), task success iscompletely omitted
Measures of initiative: Most models contain the # SYSTEMQUESTIONS
parameter, with a positive sign Apparently, the user likes systems whichtake a considerable part of the initiative Only one model contains the
# USERQUESTIONS parameter
Measures of meta-communication: Two parameters are frequently selected
in the models The PA:PA parameter (positive sign) indicates that partial
Trang 7system understanding seems to be a relevant factor for user satisfaction.
The SCR parameter is an indicator for corrected misunderstandings It is
always used with a positive sign
Trang 8The prediction functions differ for the mentioned target parameters, see ble 6.40 Apart from the parameters listed above, new contributors are the di-alogue duration (negative sign), the # BARGE-INS parameter (negative sign),and in two cases the word accuracy as well Whereas the first parameter under-lines the significant influence of communication efficiency, the latter introducesspeech input quality as a new quality aspect in the prediction function Twomodels differ significantly from the others, namely the ones for predicting B23and MEAN(B0, B23) on the basis of and the set 2 input parameters.The models are very simple (only two input parameters), but reach a relatively
Trang 9Ta-high amount of covered variance The relatively Ta-high correlation between B1and B23 may be responsible for this result.
The values given so far reflect the amount of variance in the training data
covered by the respective model However, the aim of a model is to allow forpredictions of new, unseen data Experiments have been carried out to train amodel on 90% of the available data, and to test it on the remaining 10% of data.The sets of training and test data can be chosen either in a purely randomizedway, i.e selecting a randomized 10% of the dialogues for testing (comparable
to the results reported in Table 6.34), or in a per-subject way, i.e selecting arandomized set of 4 of the 40 test subjects for testing The latter way is slightlymore independent, as it prevents within-subject extrapolation Both analyseshave been applied ten times, and the amount of variance covered by the trainingand test data sets ( values) is reported in Tables 6.41 and 6.42
It turns out that the models show a significantly lower predictive power forthe test data than for the training data The performance on the training data iscomparable to the one observed in Table 6.40, namely usingand using as the input parameter related to task success For apurely randomized set of unseen test data, the mean amount of covered variancedrops to 0.263 with and to 0.305 with The situation is similarwhen within-subject extrapolation is excluded: Here, the mean drops to0.198 with and to 0.360 with In contrast to what has been reported
Trang 10by Walker et al (see Table 6.34), the model predictions are more limited to thetraining data Several reasons may be responsible for this finding Firstly, thedifferences between system versions seem to be larger in experiment 6.3 than
in Walker et al (2000a) Although different functionalities are offered by thesystems at AT&T, it is to be expected that the components for speech input andoutput were identical for all systems Secondly, the amount of available trainingdata is considerably lower for each system version of experiment 6.3 Walker
et al showed saturation from about 200 dialogues onwards, but these 200dialogues only reflected three instead often different system versions Finally,several of the parameters used in the original PARADISE version only havelimited predictive power for experiment 6.3, e.g the # BARGE-INS, # ASR
REJECTIONS and # HELPREQUESTS parameters, see Section 6.2.1 It can beexpected that a linear regression analysis on parameters which are only differentfrom zero in a few cases, will not lead to an optimally fitting curve
The interaction parameters and user judgments which form the model inputhave been collected with different system versions In order to capture theresulting differences in perceived quality, it is possible to build separate pre-diction models for each system configuration In this way, model functions fordifferent system versions can be compared, as well as the amount of variancewhich is covered in each case Table 6.43 shows models derived for each ofthe ten system versions of experiment 6.3, as well as the overall model derived
Trang 11for all system versions, using set 1 and as input parameters Exceptfor configurations 6 and 7, where the # BARGE-INS parameter is constantlyzero, all models include the same input parameters It turns out that the in-dividual models attribute different degrees of importance (coefficient values)
to each input parameter Unfortunately, the coefficient values cannot easily
be interpreted with respect to the specific system configuration The
speech-input-related parameter IC does not show a stronger effect if ASR performance
decreases (configurations 6 to 10), nor does the extensive use of TTS have aninterpretable effect on the prediction function The amount of variance cov-ered by the models also differs significantly between the system configurations
Apparently, the system configuration has a strong influence
on how and how well a prediction model is able to estimate parameters related
to user satisfaction
The same analysis has been carried out for the augmented set of input rameters (set 2 and The results are given in Table 6.44 Once again, theamount of covered variance differs significantly betweenthe system configurations Some of the configurations for which set 1 fails toprovide an adequate model basis (e.g configuration 2) can be well covered bythe augmented set 2 Input parameters which are frequently included in the
pa-prediction function are those related to dialogue cooperativity (CA:AP with a positive sign, CA:IA with a negative sign), communication efficiency (STD
with positive sign, # TURNS with a negative sign), task success with a
positive sign), and meta-communication handling (SCR with a positive sign).
Trang 12The contradicting tendencies for the communication-efficiency-related eters have already been discussed above Interestingly, speech-input-relatedparameters are also included in the performance functions, but partly in an op-
param-posite sense: UA with a positive sign, with a positive sign, PA:CO with a negative sign, and PA:PA with a positive sign No explanation for
this finding can be given so far In conclusion, the regression model functionsproved to be highly dependent on the system configuration under test Thus,generalizability of model estimations – as reported in Section 6.3.1.2 – seems
to be very limited for the described experiment The large differences in thesystem configurations of experiment 6.3 may be responsible for this finding.Although the systems described by Walker et al (2000a) differ with respect totheir functionality, is is possible that the underlying components and their per-formance are very similar Further cross-laboratory experiments are necessary
to thoroughly test how generic quality prediction models are
In the case that system characteristics are known beforehand (which is mally true for system developers), this information can be included in the inputparameter set Because the regression analysis is not able to handle nominallyscaled variables with more than two distinct values, the system information has
nor-to be coded beforehand Five coding variables were used for this purpose:
conf_type: 0 for no confirmation, 1 for explicit confirmation.
rec_rate: Target recognition rate in percent (already given on an ordinal
scale)
Trang 13voc_m: 1 for natural male voice uttering the fixed system turns, 0 otherwise.
voc_s: 1 for synthetic male voice uttering the fixed and variable system
is only increased by the system information; other parameters of set 2 remainunchanged
The influence of individual parameters coding the system-specific tion is depicted in Table 6.46, for different target parameters In all cases, the
informa-most important system information seems to be coded in the voc_s and voc_f
parameters As has been observed in the analyses of Section 6.2.5.2, the speechoutput component seems to be the one with the highest impact on overall sys-tem quality and user satisfaction However, speech-output-related information
is not covered in any of the interaction parameters Thus, the increase in overallmodel coverage can be explained by the new aspect which is introduced with
the additional input parameters In most cases, the voc_s parameter carries a
negative coefficient, showing that synthetic speech leads to lower user
satisfac-tion scores In only a few cases the rec_rate parameter has a coefficient with
a value higher than 0.1 (always with a positive sign) Apparently, the
Trang 14recog-336
Trang 15nition rate does not have a direct impact on user satisfaction This finding is
congruent with the ones made in Section 6.2.5.1 The conf _type parameter
shows coefficients with positive and negative signs, indicating that there is noclear preference with respect to the confirmation strategy
6.3.3 Hierarchical Quality Prediction Models
Following the idea of the PARADISE model, the regression analyses carriedout so far aim at predicting high-level quality aspects like overall user satisfac-tion The target values for these aspects were either chosen according to theclassification given by the QoS taxonomy (B0 and B23), or calculated as a sim-ple arithmetic mean over different quality aspects In this way, no distinction
is made between the quality aspects and categories of the QoS taxonomy, andtheir interrelationships are not taken into account Even worse, different aspectslike perceived system understanding, TTS intelligibility, dialogue conciseness,
or acceptability are explicitly mixed in the variable
In order to better incorporate knowledge about quality aspects, related action parameters as well as interrelationships between aspects, new modellingapproaches are presented in the following which are based on the QoS taxon-omy In a first step, the taxonomy serves to define target variables for individualquality aspects and categories The targets are the arithmetic mean values overall judgments belonging to the respective aspect or category (see Figure 6.1),namely the judgments on part B and C questions obtained in experiment 6.3.Tables 6.47 and 6.48 show the definitions of target variables (noted for thetarget for each quality aspect and category Input parameters to the follow-ing models consist of the set 2 interaction parameters, augmented by the fourinteraction parameters (not user judgments!) on task success, namely
inter-and This augmented set will be called set 3 in the followingdiscussion
In a first approach, the different quality categories listed in Table 6.48 aremodelled on the basis of the complete set 3 data A standard multivariate re-gression analysis with stepwise inclusion of parameters and replacement bythe mean for missing values is used for this purpose The resulting modelsare shown in Table 6.49 It can be seen that for several quality categories theamount of covered variance is similar or even exceeds the one observed for theglobal quality predictors, see the first four rows of Table 6.38 (results for pureinteraction parameters as the input) The best prediction results are obtained forcommunication efficiency, dialogue cooperativity, comfort, and task efficiency.Usability, service efficiency, utility and acceptability resist a meaningful pre-diction, probably because they are only addressed by the judgments on part
C questions, which do not reflect the characteristics of the individual systemconfigurations
Trang 16The predictors chosen by the algorithm give an indication on the interactionparameters which are relevant for each quality category Independent of theparameter definition, dialogue cooperativity receives the strongest contribution
from the CA:AP parameter This shows that indeed contextual appropriateness
is the dominating dimension of cooperativity Other relevant predictors are the
system’s meta-communication capability (SCR) and task success alogue symmetry also seems to be dominated by the appropriateness of systemutterances The significant predictors are very similar to the ones observed forcooperativity Apparently, there is a close relationship between these two cate-gories, which can partly be explained by the considerable overlap of questions
Di-in both categories, see Table 6.47 The speech Di-input/output quality categorycannot be well predicted This is mainly due to the absence of speech-output-
338
Trang 17related interaction parameters Only the speech input aspect of the category iscovered by the interaction parameters of set 3 However, these parameters werenot identified as relevant predictors by the algorithm This finding underlinesthe fact that information may be lost when different quality aspects are mixed
as a target variable of the regression algorithm
Communication efficiency is the category which can be predicted best fromthe experimental data As may be expected, the most important predictors are
WPST (positive sign), # TURNS (negative sign), STD (negative sign), and
DD (positive sign) The apparent contradiction in the signs has already been
observed above It seems that the users prefer to have few turns, but that thesystem turns should be as informative as possible (high number of words), even
if this increases the overall dialogue duration The comfort experienced by the
user seems to be largely dominated by the STD parameter However, part of this effect is ruled out by the WPST parameter which influences predicted comfort
in the opposite direction, with a positive sign Further influencing factors on
comfort are SRD which is correlated to long user utterances (the more the user is able to speak, the higher the comfort), as well as CA:AP (appropriate
system utterances increase comfort) Task efficiency can be predicted to a
similar degree as comfort The most important contributors are UCR, CA:AP,
and Interestingly, the parameter gives a negative contribution Asobserved in the last section, coefficients do not seem to be reliable indicators
of perceived task success Apart from the user satisfaction category, which can
be predicted to a similar degree and with similar parameters as observed inTable 6.40, all other target variables do not allow for satisfactory predictions
Trang 18In the literature, only few examples of predicting individual quality aspectsare documented In the frame of the EURESCOM project MIVA, Johnston(2000) described a simple regression analysis for predicting single quality di-mensions from instrumentally measurable interaction parameters He foundrelatively good simple predictors for ease of use, learnability, pleasantness, ef-fort required to use the service, correctness of the provided information, andperceived duration However, no values have been calculated, and the num-ber of input interaction parameters is very low Thus, it has to be expectedthat the derived models are relatively specific to the system they have beendeveloped for
Trang 19The models in Table 6.49 show that the interaction parameters assigned
beforehand to a specific quality aspect or category (see Figure 6.1) are not always
the most relevant predictors Nevertheless, an approach will be presented in the
following discussion to include some of the knowledge contained in the QoS
taxonomy in a regression model A 3-layer hierarchical structure, reflecting
the quality aspects, quality categories, and the global target variables, is used
in an initial approach This structure is depicted in Figure 6.17 On the first
layer, quality aspect targets (see Table 6.47) are predicted on the basis of the
previously assigned interaction parameters (see Tables 3.1 and 3.2) On the
second layer, quality category targets (see Table 6.48) are predicted on the
basis of the predictions from layer 1 (indicated for category and in one
case (contextual appropriateness) amended by additional interaction parameters
which have been directly assigned to this quality category On the third layer,
the 5 target variables used in the last section are predicted on the basis of
the predictions for layer 2 All regression models are determined by forced
inclusion of all mentioned input parameters, and by replacing missing values
by the respective means Figure 6.17 shows the input and output parameters of
each layer and for each target, and the resulting amount of covered variance,
for each prediction It should be noted that only those quality aspects and
categories for which interaction parameters have been assigned can be modelled
in this way
It turns out that a meaningful prediction of individual quality aspects is only
possible in rare cases Reasonable values are observed for speech input
cannot be predicted on the basis of the assigned interaction parameters One
reason will be the limited number of parameters which are attributed in some
cases However, the amount of covered variance is not strictly related to the
number of input parameters, as the predictions for conciseness and smoothness
show When the predicted values of the first layer are taken as an input to
predict quality categories on layer 2, the prediction accuracy is not completely
satisfactory All values are far below a direct prediction on the basis of all set
3 parameters, cf Table 6.49 Only communication efficiency can be predicted
with an value of 0.323 The reason for the comparatively low amount of
covered variance will be linked to the restricted number of input parameters for
each category
On the highest level (layer 3), prediction accuracy on the basis of layer 2
predictions turns out to be lower than for the direct modelling in most cases,
compare Figure 6.17 and Table 6.38 It should be noted that the hierarchical
model includes all input parameters by force, according to the hierarchical
structure If a comparable forced-inclusion approach is chosen for the models
of Table 6.40, the amount of covered variance increases to for
Trang 20Figure 6.17 3-layer hierarchical multivariate regression model for experiment 6.3 data Input parameters are indicated in the black boxes Missing cases are replaced by the mean value, forced inclusion of all input parameters # UQ: # U SER Q UESTIONS ;
# BI: # BARGE -I NS ; # SEM: # S YSTEM E RROR M ESSAGES ; # SQ:
# S YSTEM Q UESTIONS ; # UQ: # U SER Q UESTIONS
Trang 21for B0, for B23, for MEAN(B0,B23),and for MEAN(B) These values show that the amount of variancewhich can be covered by a regression model strongly depends on the choice
of available input parameters It also shows that a simple hierarchical modelstructure, as was used here, does not lead to better results for predicting globalquality aspects
As an alternative, a 2-layer hierarchical structure has been chosen, see ure 6.18 In this structure, the first layer for predicting quality aspects is skipped,due to the low prediction accuracy (low amount of covered variance) which hasbeen observed for most quality aspect targets For predicting communicationefficiency, comfort and task efficiency, the predictions for cooperativity, dia-logue symmetry and speech input/output quality are taken as input variables,together with additional interaction parameters which have been assigned tothese categories In this way, the interdependence of quality categories dis-played in the QoS taxonomy is reflected in the model structure On the basis
Fig-of the predictions for all six quality categories, estimations Fig-of global qualityaspects are calculated, as in the previous example
A comparison between the prediction results of Figures 6.17 and 6.18 showsthat the amount of variance which is covered increases for all six predicted qual-ity categories The increase is most remarkable for the categories in the lowerpart of the QoS taxonomy, namely communication efficiency, comfort, andtask efficiency Apparently, the interrelations indicated in the taxonomy have
to be taken into account when perceptive quality dimensions are to be predicted.Still, the overall amount of covered variance is lower than the one obtained fordirect estimation on the basis of all set 3 parameters, see Table 6.49 It is alsoslightly lower when predicting global quality aspects like user satisfaction, e.g
in comparison to Table 6.40 (except for MEAN(B))
The reasons for this finding may be threefold: (1) Either incorrect targetvalues (here: mean over all questions related to a quality aspect or category)were chosen; or (2) incorrect input parameters for predicting the target valuewere chosen; or (3) the aspects or categories used in the taxonomy are notadequate for quality prediction Indeed, the choice of input parameters hasproven to carry a significant impact on quality prediction results It is difficult
to decide whether the quality categories defined in the taxonomy are adequatefor a prediction, and whether the respective target variables are adequate repre-sentatives for each category The example of speech output quality shows thatquality aspects which are not at all covered by instrumentally or expert-derivedinteraction parameters may be nevertheless very important for the user’s qualityperception Further investigations will be necessary to choose optimum targetvariables Such variables will have to represent a compromise between theinformative value for the system developer, the types of questions which can beanswered by the user, and the interaction parameters available for model input
Trang 22Figure 6.18 2-layer hierarchical multivariate regression model for experiment 6.3 data Input parameters are indicated in the black boxes Missing cases are replaced by the mean value, forced inclusion of all input parameters # UQ: # U SER Q UESTIONS ;
# BI: # B ARGE -I NS; # SEM: # SYSTEM E RROR M ESSAGES ; # SQ:
# S YSTEM Q UESTIONS ; # UQ: # U SER Q UESTIONS
Trang 23For the models calculated in Section 6.3.2, the amount of covered variancewas highly dependent on the system configuration As an example, the 2-layer hierarchical model has been calculated separately for configurations 1and 2 of experiment 6.3, see Figures 6.19 and 6.20 It can be seen that thevalues still differ considerably between the two configurations, depending
on the prediction target In both cases, good variance coverage is reachedfor communication efficiency, task efficiency, and MEAN(B) Communicationefficiency in particular can be predicted in a nearly ideal way It should however
be noted that the number of input parameters for this category is very high, andthe amount of target data is very restricted (20 dialogues for each connection).Thus, the optimization problem may be an easy one, even for linear regressionmodels
6.3.4 Conclusion of Modelling Approaches
The described modelling approaches perform a simple transformation ofinstrumentally or expert-derived interaction parameters in mean user judgmentswith respect to specific quality dimensions, or in global quality aspects likeuser satisfaction The amount of variance which can be covered in most casesdoes not exceed 50% Consequently, there seems to be a significant number
of contributors to perceived quality which are not covered by the interactionparameters For some quality aspects – like speech output quality – this fact
is obvious However, other aspects which seem to be well captured by therespective interaction parameters – like perceived system understanding – arestill quite difficult to predict Thus, there is strong evidence that direct judgmentsfrom the users are still the only reliable way for collecting information aboutperceived quality A description via interaction parameters can only be anadditional source of information, e.g in the system optimization phase.Because the traditional modelling approaches like PARADISE do not distin-guish between different quality dimensions, it was hoped that the incorporation
of knowledge about quality aspects into the model structure would lead to better
or more generic results At least the first target could not be reached by the
proposed – admittedly simple – hierarchical structures Although the 2-layermodel which reflects the interrelationships between quality categories showssome improvements with respect to the 3-layer model, both approaches still
do not provide any advantage in prediction accuracy with respect to a simplestraight-forward approach An increase in genericness is difficult to estimate,namely on the basis of experimental data which has been collected with a singlesystem All models –hierarchical as well as straight-forward, PARADISE-styleones – proved to be highly influenced by the system configuration This will be
a limiting factor of model usability: In order to estimate which level of qualitycan be reached with an improved system version, quality prediction modelsshould at least be able to extrapolate to higher recognition rates, other speech
Trang 24Figure 6.19 2-layer hierarchical multivariate regression model for experiment 6.3 data, system configuration 1 of Table 6.2 Input parameters are indicated in the black boxes Missing cases are replaced by the mean value, forced inclusion of all input parameters # UQ: # U SER
Q UESTIONS ; # BI: # BARGE -I NS; # SEM : # S YSTEM
E RROR M ESSAGES ; # SQ: # S YSTEM Q UESTIONS; # UQ: # USER Q UESTIONS