1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "What Makes Evaluation Hard?" pdf

2 219 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 2
Dung lượng 192,44 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In testing PLANES Tennant, 1981], the users w h o s e questions were understood with the highest rates of success actually had less success at solving the problems they were trying to so

Trang 1

W h a t M a k e s E v a l u a t i o n Hard?

1.0 T H E G O A L OF E V A L U A T I O N

Ideally, an e v a l u a t i o n t e c h n i q u e s h o u l d

d e s c r i b e an a l g o r i t h m that an e v a l u a t o r c o u l d

use that w o u l d r e s u l t in a s c o r e or a v e c t o r

of s c o r e s that d e p i c t the level of

p e r f o r m a n c e of the n a t u r a l l a n g u a g e s y s t e m

u n d e r test T h e s c o r e s s h o u l d m i r r o r the

s u b j e c t i v e e v a l u a t i o n of the s y s t e m that a

q u a l i f i e d judge w o u l d make T h e e v a l u a t i o n

t e c h n i q u e s h o u l d y i e l d c o n s i s t e n t s c o r e s for

m u l t i p l e tests of one system, and the scores

for s e v e r a l s y s t e m s s h o u l d serve as a m e a n s

for c o m p a r i s o n a m o n g systems U n f o r t u n a t e l y ,

there is no such e v a l u a t i o n t e c h n i q u e for

n a t u r a l l a n g u a g e u n d e r s t a n d i n g systems In

the f o l l o w i n g sections, I w i l l a t t e m p t to

h i g h l i g h t some of the d i f f i c u l t i e s

2.0 P E R S P E C T I V E OF THE E V A L U A T I O N

T h e f i r s t p r o b l e m is to d e t e r m i n e w h o

the " q u a l i f i e d judge" is w h o s e j u d g e m e n t s are

to be m o d e l e d by the e v a l u a t i o n One v i e w is

that he be an e x p e r t in l a n g u a g e

u n d e r s t a n d i n g As such, h i s p r i m a r y i n t e r e s t

w o u l d be in the l i n g u i s t i c and c o n c e p t u a l

c o v e r a g e of the system H e m a y a t t a c h the

g r e a t e s t w e i g h t to the c o v e r a g e of

c o n s t r u c t i o n s and c o n c e p t s w h i c h he k n o w s to

be d i f f i c u l t to i n c l u d e in a c o m p u t e r

p r o g r a m

A n o t h e r v i e w of the j u d g e is that h e is

a user of the system H i s p r i m a r y i n t e r e s t

is in w h e t h e r the s y s t e m c a n u n d e r s t a n d h i m

w e l l e n o u g h to s a t i s f y his needs T h i s Judge

w i l l p u t g r e a t e s t w e i g h t on the s y s t e m ' s

a b i l i t y to h a n d l e his m o s t c r i t i c a l

l i n g u i s t i c and c o n c e p t u a l r e q u i r e m e n t s :

t h o s e used m o s t f r e q u e n t l y and t h o s e w h i c h

o c c u r i n f r e q u e n t l y but m u s t be satisfied

T h i s judge w i l l a l s o w a n t to c o m p a r e the

n a t u r a l l a n g u a g e s y s t e m to o t h e r

t e c h n o l o g i e s F u r t h e r m o r e , he m a y a t t a c h

s t r o n g w e i g h t to s y s t e m s w h i c h c a n be l e a r n e d

q u i c k l y , or w h o s e use m a y be e a s i l y

remembered, or w h i c h takes time to l e a r n but

p r o v i d e s the user w i t h c o n s i d e r a b l e p o w e r

o n c e it is learned

T h e c h a r a c t e r i s t i c s of the judge are not

an i m p e d i m e n t to e v a l u a t i o n , but if the

c h a r a c t e r i s t i c s are not c l e a r l y u n d e r s t o o d ,

the m e a n i n g of the r e s u l t s w i l l be confused

3.0 T E S T I N G W X T H USERS

3.1 W h o A r e T h e U s e r s ?

It is s u r p r i s i n g to think that n a t u r a l

l a n g u a g e r e s e a r c h has e x i s t e d as long as it

has and that the s t a t e m e n t of the g o a l s is

s t i l l as v a g u e as it is In p a r t i c u l a r ,

l i t t l e c o m m i t m e n t is m a d e on w h a t kind of

user a n a t u r a l l a n g u a g e u n d e r s t a n d i n g s y s t e m

is i n t e n d e d to serve In p a r t i c u l a r , l i t t l e

is s p e c i f i e d about what the u s e r s k n o w a b o u t

the d o m a i n and the l a n g u a g e u n d e r s t a n d i n g

system T h e t a x o n o m y b e l o w is p r e s e n t e d as

H a r r y T e n n a n t

PO B o x 225621, M / S 371

T e x a s I n s t r u m e n t s , Inc

D a l l a s , T e x a s 7 5 2 6 5

an e x a m p l e of user c h a r a c t e r i s t i c s b a s e d on

w h a t the user k n o w s a b o u t the d o m a i n and the system

C l a s s e s of U s e r s of d a t a b a s e q u e r y s y s t e m s

V F a m i l i a r w i t h the d a t a b a s e and its

s o f t w a r e

IV F a m i l i a r w i t h the d a t a b a s e and the

i n t e r a c t i o n l a n g u a g e Ill F a m i l i a r w i t h the c o n t e n t s of d a t a b a s e

II F a m i l i a r w i t h the d o m a i n of a p p l i c a t i o n

I P a s s i n g k n o w l e d g e of the d o m a i n of

a p p l i c a t i o n

O f course, as u s e r s g a i n e x p e r i e n c e w i t h

a system, they w i l l c o n t i n u a l l y a t t e m p t to

a d a p t to its q u i r k s If the p u r p o s e of the

e v a l u a t i o n is to d e m o n s t r a t e that the n a t u r a l

l a n g u a g e u n d e r s t a n d i n g s y s t e m is m e r e l y useable, a d a p t a t i o n r e s e n t s no p r o b l e m However, if n a t u r a l l a n g u a g e is b e i n g used to

a l l o w the user to e x p r e s s h i m s e l f in his

a c c u s t o m e d manner, a d a p t a t i o n does b e c o m e important Again, the g o a l s of n a t u r a l

l a n g u a g e s y s t e m s have b e e n left vague A r e

n a t u r a l l a n g u a g e s y s t e m s to be i) i m m e d i a t e l y useful, 2) e a s i l y l e a r n e d 3) h i g h l y

e x p r e s s i v e or 4) r e a d i l y r e m e m b e r e d t h r o u g h

p e r i o d s of d i s u s e ? T h e e v a l u a t i o n s h o u l d

a t t e m p t to test for these g o a l s s p e c i f i c a l l y , and m u s t c o n t r o l for f a c t o r s such as

a d a p t a t i o n

W h a t a user k n o w s (either t h r o u g h

i n s t r u c t i o n or experience) a b o u t the domain, the d a t a b a s e and the i n t e r a c t i o n l a n g u a g e

h a v e a s i g n i f i c a n t e f f e c t on how he w i l l

e x p r e s s himself D a t a b a s e q u e r y s y s t e m s

u s u a l l y e x p e c t a c e r t a i n level of u s e of

d o m a i n or d a t a b a s e s p e c i f i c jargon, and

f a m i l i a r i t y w i t h c o n s t r u c t i o n s that are

c h a r a c t e r i s t i c of the domain A s y s t e m m a y

p e r f o r m w e l l for c l a s s IV u s e r s w i t h q u e r i e s like,

i) W h a t are the N O R M U for A A F s in 71 b y

m o n t h ?

H o w e v e r , it m a y fare p o o r l y for c l a s s I u s e r s

w i t h q u e r i e s like, 2) I n e e d to find the l e n g t h of time that the a t t a c k p l a n e s c o u l d not be f l o w n in

1971 b e c a u s e they w e r e u n d e r g o i n g

m a i n t e n a n c e E x c l u d e all p r e v e n t a t i v e

m a i n t e n a n c e , a n d g i v e me totals for each

p l a n e for each month

3.2 W h a t D o e s S u c c e s s R a t e M e a n ?

A c o m m o n m e t h o d for g e n e r a t i n g d a t a

a g a i n s t w h i c h to test a s y s t e m is to have

u s e r s use it, then c a l c u l a t e h o w s u c c e s s f u l the s y s t e m was at s a t i s f y i n g user needs If the e v a l u a t i o n a t t e m p t s to c a l c u l a t e the

f r a c t i o n of q u e s t i o n s that the s y s t e m

u n d e r s t o o d , it is i m p o r t a n t to c h a r a c t e r i z e

h o w d i f f i c u l t the q u e r i e s w e r e to u n d e r s t a n d For example, t w e l v e q u e r i e s of the form,

37

Trang 2

3) How many hours of down time did plane 3

have in January, 1971

4) How many hours of down time did plane 3

have in February, 1971

will h~ip the success rate more than one

query like,

5) How many hours of down time did plane 3

have in each month of 1971,

However, ~ n e query like 5 returns as much

information as the other twelve In testing

PLANES (Tennant, 1981], the users w h o s e

questions were understood with the highest

rates of success actually had less success at

solving the problems they were trying to

solve They spent much of their time asking

m a n y easy, repetitive questions and so did

not have time to attempt some of the

problems Other users who asked more compact

questions had plenty of time to hammer away

at the queries that the system had the

greatest difficulty understanding

Another difficulty with success rate

m e a s u r e m e n t is the characteristics of the

problems given to users compared to the kind

of problems a n t i c i p a t e ~ by the system I

once asked a set of users to write some

problems for other users to attempt to solve

using PLANES The problem authors were

familiar with the general domain of discourse

of pLANES, but did not have any experience

using it The problems they devised were

~easonable given the domain, but were largely

beyond the scope of PLANES ~ conceptual

coverage Users had very low success rates

when attempting to solve these problems In

contrast, problems that I had devised, fully

aware of pLANES ~ areas of most complete

Coverage (and devised to be easy for PLANES},

yielded much higher success rates Small

wonder The point is that unless the match

conceptual coverage can be characterlsed,

success ~ates mean little

4°0 TAXONOMY OF CAPABILITIES

Testing a natural language system for

engineering approach Another approach is to

compare the elements that are known to be

involved in understanding language against

the capabilities of the system This has

been called "sharpshooting" by some of the

implementers of natural language systems An

evaluator probes the system under test to

find conditions under which it fails To

evaluator should base his probes on a

taxonomy of phenomena that are relevant to

language understanding A standard taxonomy

could be developed for doing evaluations

Our knowledge of language is incomplete

at best Any taxonomy is bound to generate

disagreement However, it seems that most of

the disagreements describing language are not

over what the phenomena of language are, but

over how we might best understand and model

those phenomena The taxonomy will become

quite large, but this is only representative

of the fact that understanding language is a

very complex process The taxonomy approach faces the problem of complexity directly The taxonomy approach to evaluation forces examination of the broad range of issues of natural language processing It provides a relatively o b j e c t i v e means for assessing the full range of capabilities of a natural language understanding system It also avoids the problems listed above inherent in evaluation through user testing

It does, however, have some u n p l e a s a n t attributes First, it does not provide an easy basis for c o m p a r i s o n of systems Ideally an evaluation would produce a metric

to allow one to say "system A is better than system B" Appealing as it is, natural language understanding is probably too complex for a simple metric to be meaningful Second, the taxonomy approach does not provide a means for c o m p a r i s o n of natural

language understanding to other technologies That comparison can be done rather well with user testing, however

Third, the taxonomy approach ignores the relative importance of p h e n o m e n a and the interaction between p h e n o m e n a and domains of discourse In response to this difficulty,

an evaluation should include the analysis of

a simulated natural language system The simulated system would consist of a htnnan Interprete~ who acts as an intermediary between users and the programs or data they are trying to use Dialogs are recorded, then those dialogs are analyzed in light of the taxonomies of features In this way, the

c a p a b i l i t i e s of the system can be compared to the needs of the users The relative importance of p h e n o m e n a can be d e t e r m i n e d this way Furthermore, users" language can

be studied without them adapting to the system's limitations

The ~axonomy of p h e n o m e n a m e n t i o n e d above is intended to Include both lingulstlc

p h e n o m e n a and concepts The linguistic

p h e n o m e n a relate to how ideas may be understood T h e r e is an e x t e n s i v e l i t e r a t u r e

on this The concepts are the ideas which

m u s t be understood This is much more extensive, and much more domain specific Work in knowledge representation is p a r t i a l l y focused on learning what concepts need to be represented, then attempting to represent them Consequently, ther~ is a taxonomy of

representation literature

Reference Tennant, Harry E v a l u a t i o n of Natural Language processors Ph.D Thesis,

U n i v e r s i t y of Illinois, Urbana, Illiniois,

1981

38

Ngày đăng: 17/03/2014, 19:20

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN