1. Trang chủ
  2. » Công Nghệ Thông Tin

Quality of Telephone-Based Spoken Dialogue Systems phần 8 pdf

48 252 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Quality of Telephone-Based Spoken Dialogue Systems phần 8 pdf
Trường học Vietnam National University, Ho Chi Minh City
Chuyên ngành Spoken Dialogue Systems
Thể loại Báo cáo nghiên cứu
Thành phố Ho Chi Minh City
Định dạng
Số trang 48
Dung lượng 4,21 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Input parameters reflecting task success: Although the basic formula of the PARADISE model contains as a mandatory input parameter, it has been often replaced by the user judgment on tas

Trang 1

be informative in the data pre-analysis This set includes DD, STD,

UTD, SRD, URD, # TURNS, WPST, WPUT, # BARGE-INS,

# SYSTEM ERROR MESSAGES, # SYSTEM QUESTIONS, # USER

QUESTIONS , AN:CO, AN:PA, AN:FA, PA:

CO, PA: PA, PA:FA, SCR, UCR, CA:AP, CA:IA, IR, IC, UA,

WA, WER,

In both sets, the turn-related parameters have been normalized to the overall

number of turns (or for the AN parameters to the number of user questions),

as is described in Section 6.2.1

Input parameters reflecting task success: Although the basic formula of

the PARADISE model contains as a mandatory input parameter, it has

been often replaced by the user judgment on task completion, COMP, in the practical application of the model This COMP parameter roughly

corresponds to a binary version of the judgment on question B1 For theanalysis of the experiment 6.3 data, the following options for describingtask success have been chosen:

Task success calculated on the basis of the AVM, either on a per-dialoguelevel or on a per-configuration level

Task success measures based on the overall solution, and

User judgment on question B1

A binary version of B1 calculated by assigning a value of 0 for

a rating and a value of 1 for a rating B1 > 3.0

It should be noted that B1 and are calculated on the basis of user ments Thus, using one of these parameters as an input to a prediction model

judg-is not in line with the general idea of quality prediction, namely to becomeindependent of direct user ratings

Apart from the input and output parameters, the choice of the regressionapproach carries an influence on the resulting model A linear multivariateanalysis like the one used in PARADISE has been chosen here The choice

of parameters which are included in the regression function depends on theamount of available parameters For set 1, a forced inclusion of all four pa-rameters has been chosen For set 2, a stepwise inclusion method is moreappropriate, because of the large number of input parameters The stepwisemethod sequentially includes the variables with the highest partial correlationvalues with the target variable (forward step), and then excludes variables withthe lowest partial correlation (backward step) In case of missing values, thecorresponding cases have been excluded from the analysis for the set 1 data(listwise exclusion) For set 2, such an exclusion would lead to a relatively low

Trang 2

coefficient Instead, their models rely on the COMP measure Making use of

the parameter (which is similar to COMP) increases to 0.24 0.45,which is in the range of values given in the literature The model’s performancecan be further increased by using the non-simplified judgment on question B1for describing task success In this case, reaches 0.52, a value which isamongst the best of Table 6.34 Task success measures which are based on theoverall solution and the modified version provide slightly betterestimations than and but they are not competitive with the subject-derived measures B1 and Apparently, the PARADISE model performsbest when it is not completely relying on interaction parameters, but whensubjective estimations of task success are included in the estimation function.This finding is in line with comparative experiments described by Bonneau-Maynard et al (2000) When using subjective judgments of task success instead

of the coefficient, the amount of predicted variance raised from 0.41 to 0.48.Comparing the performance for the different target variables, seems to

be least predictable The amount of covered variance is significantly lower than

in the experiments described by Walker The relatively low number of inputparameters in set 1 may be a major reason for this finding Prediction accuracysignificantly raises when B0 or B23 are taken as the target parameter, and withB1 or describing task success A further improvement is observed when

Trang 3

the target parameter is calculated as a mean over several ratings, namely as

MEAN(B0, B23) or MEAN(B) The model’s performance is equally high

in these cases Apparently, the smoothing of individual judgments which isinherent to the calculation of the mean has a positive effect on the model’sprediction accuracy

Table 6.36 shows the significant predictors for different models determinedusing the set 1 “dialogue cost” parameters and different task-success-relatedparameters as the input Target variables are either the or the MEAN(B)parameter For most significant dialogue cost contributions come from

# TURNS (with a negative sign), and partly also from the # BARGE-INS

pa-rameter (negative sign) DD and IC only play a subordinate role in predicting

For the task-success-related parameters, a clear order can be observed:B1 and have a dominant effect on (both with a positive sign),

and only a moderate one (the first with a positive and the second with a

Trang 4

negative sign), and and are nearly irrelevant in predicting For

MEAN(B) as the target, the situation is very similar Once again, # TURNS is a

persistent predictor (always with a negative sign), and DD, IC and # BARGE

-INS only have minor importance The task-success-related input parametersshow the same significance order in predicting MEAN(B): B1 and have

a strong effect (positive sign), and a moderate one (also tive sign), and and are not important predictors Apparently, thePARADISE model is strongly dependent on the type of the input parameterdescribing task success

posi-The prediction results for different target variables are depicted in Table 6.37,both for the expert-derived parameter and for the user-derivedparameter describing task success The most important contributors for theprediction of are # TURNS (negative sign) and the task-success-related

parameter For predicting B0, also DD and # BARGE-INS (both negative sign)play a certain role B23 and MEAN(B0, B23) seem to be better predicted from

Trang 5

DD and the task-success-related parameter; here, the # TURNSparameter isrelatively irrelevant For predicting MEAN(B), the most significant contribu-tions come from # TURNSand As may be expected, the differenttarget parameters related to user satisfaction require different input parametersfor an adequate prediction Thus, the models established by the multivariateregression analysis are only capable of predicting different indicators of usersatisfaction to a limited extent.

The number of input parameters in set 1 is very restricted (four “dialoguecost” parameters and one task-success-related parameter) Taking the set 2 pa-rameters as an input, it can be expected that more general aspects of qualityare covered by the resulting models An overview of the achievable variancecoverage is given in Table 6.38 In general, the coverage is much better than wasobserved for set 1 Using the interaction parameters or for describ-ing task success, raises to 0.28 0.47 depending on the target parameter.With B1 or an even better coverage can be reached

As was observed for the set 1 data, it seems to be important to include derived estimations of task success in the prediction function Expert-derived

indicators of user satisfaction Interestingly, the and parameters arenever selected by the stepwise inclusion algorithm Thus, the low importance ofthese parameters in the prediction function (see Table 6.36) is confirmed for theaugmented set of input parameters Overall, the prediction functions include arelatively large number of input parameters However, the amount of variancecovered by the function does not seem to be strictly related to the number ofinput parameters, as the results in the final row or column of Table 6.38 show

Trang 6

val-MEAN(B0,B23)) can be observed In summary, the augmented data set leads

to far better prediction results, with a wider coverage of the resulting predictionfunctions

Table 6.39 shows the resulting prediction functions for different related input parameters The following parameters seem to be stable contrib-utors to the respective targets:

task-success-Measures of communication efficiency: Most models include either the

WPST and SRD parameters (positive sign), STD (negative sign), or

# TURNS (negative sign) The latter two parameters seem to indicate a

preference for shorter interactions, whereas the positive sign for the WPST

parameter indicates the opposite, namely that a talkative system would be

preferred A higher value for SRD is in principle linked to longer user

utter-ances which require an increased processing time from the system/wizard

No conclusive explanation can be drawn with respect to the communicationefficiency measures

Measures of appropriateness of system utterances: All prediction functions

contain the CA:AP parameter with a positive sign Two models of ble 6.39 also contain CA:IA (positive sign), which seems to rule out a part of the high effect of CA:AP in these functions In any case, dialogue

Ta-cooperativity proves to be a significant contributor to user satisfaction.Measures of task success: The task-success-related parameters do not al-ways provide an important contribution to the target parameter, except forB1 which is in both cases a significant contributor In the model estimatedfrom the first four input parameter sets (identical model), task success iscompletely omitted

Measures of initiative: Most models contain the # SYSTEMQUESTIONS

parameter, with a positive sign Apparently, the user likes systems whichtake a considerable part of the initiative Only one model contains the

# USERQUESTIONS parameter

Measures of meta-communication: Two parameters are frequently selected

in the models The PA:PA parameter (positive sign) indicates that partial

Trang 7

system understanding seems to be a relevant factor for user satisfaction.

The SCR parameter is an indicator for corrected misunderstandings It is

always used with a positive sign

Trang 8

The prediction functions differ for the mentioned target parameters, see ble 6.40 Apart from the parameters listed above, new contributors are the di-alogue duration (negative sign), the # BARGE-INS parameter (negative sign),and in two cases the word accuracy as well Whereas the first parameter under-lines the significant influence of communication efficiency, the latter introducesspeech input quality as a new quality aspect in the prediction function Twomodels differ significantly from the others, namely the ones for predicting B23and MEAN(B0, B23) on the basis of and the set 2 input parameters.The models are very simple (only two input parameters), but reach a relatively

Trang 9

Ta-high amount of covered variance The relatively Ta-high correlation between B1and B23 may be responsible for this result.

The values given so far reflect the amount of variance in the training data

covered by the respective model However, the aim of a model is to allow forpredictions of new, unseen data Experiments have been carried out to train amodel on 90% of the available data, and to test it on the remaining 10% of data.The sets of training and test data can be chosen either in a purely randomizedway, i.e selecting a randomized 10% of the dialogues for testing (comparable

to the results reported in Table 6.34), or in a per-subject way, i.e selecting arandomized set of 4 of the 40 test subjects for testing The latter way is slightlymore independent, as it prevents within-subject extrapolation Both analyseshave been applied ten times, and the amount of variance covered by the trainingand test data sets ( values) is reported in Tables 6.41 and 6.42

It turns out that the models show a significantly lower predictive power forthe test data than for the training data The performance on the training data iscomparable to the one observed in Table 6.40, namely usingand using as the input parameter related to task success For apurely randomized set of unseen test data, the mean amount of covered variancedrops to 0.263 with and to 0.305 with The situation is similarwhen within-subject extrapolation is excluded: Here, the mean drops to0.198 with and to 0.360 with In contrast to what has been reported

Trang 10

by Walker et al (see Table 6.34), the model predictions are more limited to thetraining data Several reasons may be responsible for this finding Firstly, thedifferences between system versions seem to be larger in experiment 6.3 than

in Walker et al (2000a) Although different functionalities are offered by thesystems at AT&T, it is to be expected that the components for speech input andoutput were identical for all systems Secondly, the amount of available trainingdata is considerably lower for each system version of experiment 6.3 Walker

et al showed saturation from about 200 dialogues onwards, but these 200dialogues only reflected three instead often different system versions Finally,several of the parameters used in the original PARADISE version only havelimited predictive power for experiment 6.3, e.g the # BARGE-INS, # ASR

REJECTIONS and # HELPREQUESTS parameters, see Section 6.2.1 It can beexpected that a linear regression analysis on parameters which are only differentfrom zero in a few cases, will not lead to an optimally fitting curve

The interaction parameters and user judgments which form the model inputhave been collected with different system versions In order to capture theresulting differences in perceived quality, it is possible to build separate pre-diction models for each system configuration In this way, model functions fordifferent system versions can be compared, as well as the amount of variancewhich is covered in each case Table 6.43 shows models derived for each ofthe ten system versions of experiment 6.3, as well as the overall model derived

Trang 11

for all system versions, using set 1 and as input parameters Exceptfor configurations 6 and 7, where the # BARGE-INS parameter is constantlyzero, all models include the same input parameters It turns out that the in-dividual models attribute different degrees of importance (coefficient values)

to each input parameter Unfortunately, the coefficient values cannot easily

be interpreted with respect to the specific system configuration The

speech-input-related parameter IC does not show a stronger effect if ASR performance

decreases (configurations 6 to 10), nor does the extensive use of TTS have aninterpretable effect on the prediction function The amount of variance cov-ered by the models also differs significantly between the system configurations

Apparently, the system configuration has a strong influence

on how and how well a prediction model is able to estimate parameters related

to user satisfaction

The same analysis has been carried out for the augmented set of input rameters (set 2 and The results are given in Table 6.44 Once again, theamount of covered variance differs significantly betweenthe system configurations Some of the configurations for which set 1 fails toprovide an adequate model basis (e.g configuration 2) can be well covered bythe augmented set 2 Input parameters which are frequently included in the

pa-prediction function are those related to dialogue cooperativity (CA:AP with a positive sign, CA:IA with a negative sign), communication efficiency (STD

with positive sign, # TURNS with a negative sign), task success with a

positive sign), and meta-communication handling (SCR with a positive sign).

Trang 12

The contradicting tendencies for the communication-efficiency-related eters have already been discussed above Interestingly, speech-input-relatedparameters are also included in the performance functions, but partly in an op-

param-posite sense: UA with a positive sign, with a positive sign, PA:CO with a negative sign, and PA:PA with a positive sign No explanation for

this finding can be given so far In conclusion, the regression model functionsproved to be highly dependent on the system configuration under test Thus,generalizability of model estimations – as reported in Section 6.3.1.2 – seems

to be very limited for the described experiment The large differences in thesystem configurations of experiment 6.3 may be responsible for this finding.Although the systems described by Walker et al (2000a) differ with respect totheir functionality, is is possible that the underlying components and their per-formance are very similar Further cross-laboratory experiments are necessary

to thoroughly test how generic quality prediction models are

In the case that system characteristics are known beforehand (which is mally true for system developers), this information can be included in the inputparameter set Because the regression analysis is not able to handle nominallyscaled variables with more than two distinct values, the system information has

nor-to be coded beforehand Five coding variables were used for this purpose:

conf_type: 0 for no confirmation, 1 for explicit confirmation.

rec_rate: Target recognition rate in percent (already given on an ordinal

scale)

Trang 13

voc_m: 1 for natural male voice uttering the fixed system turns, 0 otherwise.

voc_s: 1 for synthetic male voice uttering the fixed and variable system

is only increased by the system information; other parameters of set 2 remainunchanged

The influence of individual parameters coding the system-specific tion is depicted in Table 6.46, for different target parameters In all cases, the

informa-most important system information seems to be coded in the voc_s and voc_f

parameters As has been observed in the analyses of Section 6.2.5.2, the speechoutput component seems to be the one with the highest impact on overall sys-tem quality and user satisfaction However, speech-output-related information

is not covered in any of the interaction parameters Thus, the increase in overallmodel coverage can be explained by the new aspect which is introduced with

the additional input parameters In most cases, the voc_s parameter carries a

negative coefficient, showing that synthetic speech leads to lower user

satisfac-tion scores In only a few cases the rec_rate parameter has a coefficient with

a value higher than 0.1 (always with a positive sign) Apparently, the

Trang 14

recog-336

Trang 15

nition rate does not have a direct impact on user satisfaction This finding is

congruent with the ones made in Section 6.2.5.1 The conf _type parameter

shows coefficients with positive and negative signs, indicating that there is noclear preference with respect to the confirmation strategy

6.3.3 Hierarchical Quality Prediction Models

Following the idea of the PARADISE model, the regression analyses carriedout so far aim at predicting high-level quality aspects like overall user satisfac-tion The target values for these aspects were either chosen according to theclassification given by the QoS taxonomy (B0 and B23), or calculated as a sim-ple arithmetic mean over different quality aspects In this way, no distinction

is made between the quality aspects and categories of the QoS taxonomy, andtheir interrelationships are not taken into account Even worse, different aspectslike perceived system understanding, TTS intelligibility, dialogue conciseness,

or acceptability are explicitly mixed in the variable

In order to better incorporate knowledge about quality aspects, related action parameters as well as interrelationships between aspects, new modellingapproaches are presented in the following which are based on the QoS taxon-omy In a first step, the taxonomy serves to define target variables for individualquality aspects and categories The targets are the arithmetic mean values overall judgments belonging to the respective aspect or category (see Figure 6.1),namely the judgments on part B and C questions obtained in experiment 6.3.Tables 6.47 and 6.48 show the definitions of target variables (noted for thetarget for each quality aspect and category Input parameters to the follow-ing models consist of the set 2 interaction parameters, augmented by the fourinteraction parameters (not user judgments!) on task success, namely

inter-and This augmented set will be called set 3 in the followingdiscussion

In a first approach, the different quality categories listed in Table 6.48 aremodelled on the basis of the complete set 3 data A standard multivariate re-gression analysis with stepwise inclusion of parameters and replacement bythe mean for missing values is used for this purpose The resulting modelsare shown in Table 6.49 It can be seen that for several quality categories theamount of covered variance is similar or even exceeds the one observed for theglobal quality predictors, see the first four rows of Table 6.38 (results for pureinteraction parameters as the input) The best prediction results are obtained forcommunication efficiency, dialogue cooperativity, comfort, and task efficiency.Usability, service efficiency, utility and acceptability resist a meaningful pre-diction, probably because they are only addressed by the judgments on part

C questions, which do not reflect the characteristics of the individual systemconfigurations

Trang 16

The predictors chosen by the algorithm give an indication on the interactionparameters which are relevant for each quality category Independent of theparameter definition, dialogue cooperativity receives the strongest contribution

from the CA:AP parameter This shows that indeed contextual appropriateness

is the dominating dimension of cooperativity Other relevant predictors are the

system’s meta-communication capability (SCR) and task success alogue symmetry also seems to be dominated by the appropriateness of systemutterances The significant predictors are very similar to the ones observed forcooperativity Apparently, there is a close relationship between these two cate-gories, which can partly be explained by the considerable overlap of questions

Di-in both categories, see Table 6.47 The speech Di-input/output quality categorycannot be well predicted This is mainly due to the absence of speech-output-

338

Trang 17

related interaction parameters Only the speech input aspect of the category iscovered by the interaction parameters of set 3 However, these parameters werenot identified as relevant predictors by the algorithm This finding underlinesthe fact that information may be lost when different quality aspects are mixed

as a target variable of the regression algorithm

Communication efficiency is the category which can be predicted best fromthe experimental data As may be expected, the most important predictors are

WPST (positive sign), # TURNS (negative sign), STD (negative sign), and

DD (positive sign) The apparent contradiction in the signs has already been

observed above It seems that the users prefer to have few turns, but that thesystem turns should be as informative as possible (high number of words), even

if this increases the overall dialogue duration The comfort experienced by the

user seems to be largely dominated by the STD parameter However, part of this effect is ruled out by the WPST parameter which influences predicted comfort

in the opposite direction, with a positive sign Further influencing factors on

comfort are SRD which is correlated to long user utterances (the more the user is able to speak, the higher the comfort), as well as CA:AP (appropriate

system utterances increase comfort) Task efficiency can be predicted to a

similar degree as comfort The most important contributors are UCR, CA:AP,

and Interestingly, the parameter gives a negative contribution Asobserved in the last section, coefficients do not seem to be reliable indicators

of perceived task success Apart from the user satisfaction category, which can

be predicted to a similar degree and with similar parameters as observed inTable 6.40, all other target variables do not allow for satisfactory predictions

Trang 18

In the literature, only few examples of predicting individual quality aspectsare documented In the frame of the EURESCOM project MIVA, Johnston(2000) described a simple regression analysis for predicting single quality di-mensions from instrumentally measurable interaction parameters He foundrelatively good simple predictors for ease of use, learnability, pleasantness, ef-fort required to use the service, correctness of the provided information, andperceived duration However, no values have been calculated, and the num-ber of input interaction parameters is very low Thus, it has to be expectedthat the derived models are relatively specific to the system they have beendeveloped for

Trang 19

The models in Table 6.49 show that the interaction parameters assigned

beforehand to a specific quality aspect or category (see Figure 6.1) are not always

the most relevant predictors Nevertheless, an approach will be presented in the

following discussion to include some of the knowledge contained in the QoS

taxonomy in a regression model A 3-layer hierarchical structure, reflecting

the quality aspects, quality categories, and the global target variables, is used

in an initial approach This structure is depicted in Figure 6.17 On the first

layer, quality aspect targets (see Table 6.47) are predicted on the basis of the

previously assigned interaction parameters (see Tables 3.1 and 3.2) On the

second layer, quality category targets (see Table 6.48) are predicted on the

basis of the predictions from layer 1 (indicated for category and in one

case (contextual appropriateness) amended by additional interaction parameters

which have been directly assigned to this quality category On the third layer,

the 5 target variables used in the last section are predicted on the basis of

the predictions for layer 2 All regression models are determined by forced

inclusion of all mentioned input parameters, and by replacing missing values

by the respective means Figure 6.17 shows the input and output parameters of

each layer and for each target, and the resulting amount of covered variance,

for each prediction It should be noted that only those quality aspects and

categories for which interaction parameters have been assigned can be modelled

in this way

It turns out that a meaningful prediction of individual quality aspects is only

possible in rare cases Reasonable values are observed for speech input

cannot be predicted on the basis of the assigned interaction parameters One

reason will be the limited number of parameters which are attributed in some

cases However, the amount of covered variance is not strictly related to the

number of input parameters, as the predictions for conciseness and smoothness

show When the predicted values of the first layer are taken as an input to

predict quality categories on layer 2, the prediction accuracy is not completely

satisfactory All values are far below a direct prediction on the basis of all set

3 parameters, cf Table 6.49 Only communication efficiency can be predicted

with an value of 0.323 The reason for the comparatively low amount of

covered variance will be linked to the restricted number of input parameters for

each category

On the highest level (layer 3), prediction accuracy on the basis of layer 2

predictions turns out to be lower than for the direct modelling in most cases,

compare Figure 6.17 and Table 6.38 It should be noted that the hierarchical

model includes all input parameters by force, according to the hierarchical

structure If a comparable forced-inclusion approach is chosen for the models

of Table 6.40, the amount of covered variance increases to for

Trang 20

Figure 6.17 3-layer hierarchical multivariate regression model for experiment 6.3 data Input parameters are indicated in the black boxes Missing cases are replaced by the mean value, forced inclusion of all input parameters # UQ: # U SER Q UESTIONS ;

# BI: # BARGE -I NS ; # SEM: # S YSTEM E RROR M ESSAGES ; # SQ:

# S YSTEM Q UESTIONS ; # UQ: # U SER Q UESTIONS

Trang 21

for B0, for B23, for MEAN(B0,B23),and for MEAN(B) These values show that the amount of variancewhich can be covered by a regression model strongly depends on the choice

of available input parameters It also shows that a simple hierarchical modelstructure, as was used here, does not lead to better results for predicting globalquality aspects

As an alternative, a 2-layer hierarchical structure has been chosen, see ure 6.18 In this structure, the first layer for predicting quality aspects is skipped,due to the low prediction accuracy (low amount of covered variance) which hasbeen observed for most quality aspect targets For predicting communicationefficiency, comfort and task efficiency, the predictions for cooperativity, dia-logue symmetry and speech input/output quality are taken as input variables,together with additional interaction parameters which have been assigned tothese categories In this way, the interdependence of quality categories dis-played in the QoS taxonomy is reflected in the model structure On the basis

Fig-of the predictions for all six quality categories, estimations Fig-of global qualityaspects are calculated, as in the previous example

A comparison between the prediction results of Figures 6.17 and 6.18 showsthat the amount of variance which is covered increases for all six predicted qual-ity categories The increase is most remarkable for the categories in the lowerpart of the QoS taxonomy, namely communication efficiency, comfort, andtask efficiency Apparently, the interrelations indicated in the taxonomy have

to be taken into account when perceptive quality dimensions are to be predicted.Still, the overall amount of covered variance is lower than the one obtained fordirect estimation on the basis of all set 3 parameters, see Table 6.49 It is alsoslightly lower when predicting global quality aspects like user satisfaction, e.g

in comparison to Table 6.40 (except for MEAN(B))

The reasons for this finding may be threefold: (1) Either incorrect targetvalues (here: mean over all questions related to a quality aspect or category)were chosen; or (2) incorrect input parameters for predicting the target valuewere chosen; or (3) the aspects or categories used in the taxonomy are notadequate for quality prediction Indeed, the choice of input parameters hasproven to carry a significant impact on quality prediction results It is difficult

to decide whether the quality categories defined in the taxonomy are adequatefor a prediction, and whether the respective target variables are adequate repre-sentatives for each category The example of speech output quality shows thatquality aspects which are not at all covered by instrumentally or expert-derivedinteraction parameters may be nevertheless very important for the user’s qualityperception Further investigations will be necessary to choose optimum targetvariables Such variables will have to represent a compromise between theinformative value for the system developer, the types of questions which can beanswered by the user, and the interaction parameters available for model input

Trang 22

Figure 6.18 2-layer hierarchical multivariate regression model for experiment 6.3 data Input parameters are indicated in the black boxes Missing cases are replaced by the mean value, forced inclusion of all input parameters # UQ: # U SER Q UESTIONS ;

# BI: # B ARGE -I NS; # SEM: # SYSTEM E RROR M ESSAGES ; # SQ:

# S YSTEM Q UESTIONS ; # UQ: # U SER Q UESTIONS

Trang 23

For the models calculated in Section 6.3.2, the amount of covered variancewas highly dependent on the system configuration As an example, the 2-layer hierarchical model has been calculated separately for configurations 1and 2 of experiment 6.3, see Figures 6.19 and 6.20 It can be seen that thevalues still differ considerably between the two configurations, depending

on the prediction target In both cases, good variance coverage is reachedfor communication efficiency, task efficiency, and MEAN(B) Communicationefficiency in particular can be predicted in a nearly ideal way It should however

be noted that the number of input parameters for this category is very high, andthe amount of target data is very restricted (20 dialogues for each connection).Thus, the optimization problem may be an easy one, even for linear regressionmodels

6.3.4 Conclusion of Modelling Approaches

The described modelling approaches perform a simple transformation ofinstrumentally or expert-derived interaction parameters in mean user judgmentswith respect to specific quality dimensions, or in global quality aspects likeuser satisfaction The amount of variance which can be covered in most casesdoes not exceed 50% Consequently, there seems to be a significant number

of contributors to perceived quality which are not covered by the interactionparameters For some quality aspects – like speech output quality – this fact

is obvious However, other aspects which seem to be well captured by therespective interaction parameters – like perceived system understanding – arestill quite difficult to predict Thus, there is strong evidence that direct judgmentsfrom the users are still the only reliable way for collecting information aboutperceived quality A description via interaction parameters can only be anadditional source of information, e.g in the system optimization phase.Because the traditional modelling approaches like PARADISE do not distin-guish between different quality dimensions, it was hoped that the incorporation

of knowledge about quality aspects into the model structure would lead to better

or more generic results At least the first target could not be reached by the

proposed – admittedly simple – hierarchical structures Although the 2-layermodel which reflects the interrelationships between quality categories showssome improvements with respect to the 3-layer model, both approaches still

do not provide any advantage in prediction accuracy with respect to a simplestraight-forward approach An increase in genericness is difficult to estimate,namely on the basis of experimental data which has been collected with a singlesystem All models –hierarchical as well as straight-forward, PARADISE-styleones – proved to be highly influenced by the system configuration This will be

a limiting factor of model usability: In order to estimate which level of qualitycan be reached with an improved system version, quality prediction modelsshould at least be able to extrapolate to higher recognition rates, other speech

Trang 24

Figure 6.19 2-layer hierarchical multivariate regression model for experiment 6.3 data, system configuration 1 of Table 6.2 Input parameters are indicated in the black boxes Missing cases are replaced by the mean value, forced inclusion of all input parameters # UQ: # U SER

Q UESTIONS ; # BI: # BARGE -I NS; # SEM : # S YSTEM

E RROR M ESSAGES ; # SQ: # S YSTEM Q UESTIONS; # UQ: # USER Q UESTIONS

Ngày đăng: 07/08/2014, 21:20

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm