In testing PLANES Tennant, 1981], the users w h o s e questions were understood with the highest rates of success actually had less success at solving the problems they were trying to so
Trang 1W h a t M a k e s E v a l u a t i o n Hard?
1.0 T H E G O A L OF E V A L U A T I O N
Ideally, an e v a l u a t i o n t e c h n i q u e s h o u l d
d e s c r i b e an a l g o r i t h m that an e v a l u a t o r c o u l d
use that w o u l d r e s u l t in a s c o r e or a v e c t o r
of s c o r e s that d e p i c t the level of
p e r f o r m a n c e of the n a t u r a l l a n g u a g e s y s t e m
u n d e r test T h e s c o r e s s h o u l d m i r r o r the
s u b j e c t i v e e v a l u a t i o n of the s y s t e m that a
q u a l i f i e d judge w o u l d make T h e e v a l u a t i o n
t e c h n i q u e s h o u l d y i e l d c o n s i s t e n t s c o r e s for
m u l t i p l e tests of one system, and the scores
for s e v e r a l s y s t e m s s h o u l d serve as a m e a n s
for c o m p a r i s o n a m o n g systems U n f o r t u n a t e l y ,
there is no such e v a l u a t i o n t e c h n i q u e for
n a t u r a l l a n g u a g e u n d e r s t a n d i n g systems In
the f o l l o w i n g sections, I w i l l a t t e m p t to
h i g h l i g h t some of the d i f f i c u l t i e s
2.0 P E R S P E C T I V E OF THE E V A L U A T I O N
T h e f i r s t p r o b l e m is to d e t e r m i n e w h o
the " q u a l i f i e d judge" is w h o s e j u d g e m e n t s are
to be m o d e l e d by the e v a l u a t i o n One v i e w is
that he be an e x p e r t in l a n g u a g e
u n d e r s t a n d i n g As such, h i s p r i m a r y i n t e r e s t
w o u l d be in the l i n g u i s t i c and c o n c e p t u a l
c o v e r a g e of the system H e m a y a t t a c h the
g r e a t e s t w e i g h t to the c o v e r a g e of
c o n s t r u c t i o n s and c o n c e p t s w h i c h he k n o w s to
be d i f f i c u l t to i n c l u d e in a c o m p u t e r
p r o g r a m
A n o t h e r v i e w of the j u d g e is that h e is
a user of the system H i s p r i m a r y i n t e r e s t
is in w h e t h e r the s y s t e m c a n u n d e r s t a n d h i m
w e l l e n o u g h to s a t i s f y his needs T h i s Judge
w i l l p u t g r e a t e s t w e i g h t on the s y s t e m ' s
a b i l i t y to h a n d l e his m o s t c r i t i c a l
l i n g u i s t i c and c o n c e p t u a l r e q u i r e m e n t s :
t h o s e used m o s t f r e q u e n t l y and t h o s e w h i c h
o c c u r i n f r e q u e n t l y but m u s t be satisfied
T h i s judge w i l l a l s o w a n t to c o m p a r e the
n a t u r a l l a n g u a g e s y s t e m to o t h e r
t e c h n o l o g i e s F u r t h e r m o r e , he m a y a t t a c h
s t r o n g w e i g h t to s y s t e m s w h i c h c a n be l e a r n e d
q u i c k l y , or w h o s e use m a y be e a s i l y
remembered, or w h i c h takes time to l e a r n but
p r o v i d e s the user w i t h c o n s i d e r a b l e p o w e r
o n c e it is learned
T h e c h a r a c t e r i s t i c s of the judge are not
an i m p e d i m e n t to e v a l u a t i o n , but if the
c h a r a c t e r i s t i c s are not c l e a r l y u n d e r s t o o d ,
the m e a n i n g of the r e s u l t s w i l l be confused
3.0 T E S T I N G W X T H USERS
3.1 W h o A r e T h e U s e r s ?
It is s u r p r i s i n g to think that n a t u r a l
l a n g u a g e r e s e a r c h has e x i s t e d as long as it
has and that the s t a t e m e n t of the g o a l s is
s t i l l as v a g u e as it is In p a r t i c u l a r ,
l i t t l e c o m m i t m e n t is m a d e on w h a t kind of
user a n a t u r a l l a n g u a g e u n d e r s t a n d i n g s y s t e m
is i n t e n d e d to serve In p a r t i c u l a r , l i t t l e
is s p e c i f i e d about what the u s e r s k n o w a b o u t
the d o m a i n and the l a n g u a g e u n d e r s t a n d i n g
system T h e t a x o n o m y b e l o w is p r e s e n t e d as
H a r r y T e n n a n t
PO B o x 225621, M / S 371
T e x a s I n s t r u m e n t s , Inc
D a l l a s , T e x a s 7 5 2 6 5
an e x a m p l e of user c h a r a c t e r i s t i c s b a s e d on
w h a t the user k n o w s a b o u t the d o m a i n and the system
C l a s s e s of U s e r s of d a t a b a s e q u e r y s y s t e m s
V F a m i l i a r w i t h the d a t a b a s e and its
s o f t w a r e
IV F a m i l i a r w i t h the d a t a b a s e and the
i n t e r a c t i o n l a n g u a g e Ill F a m i l i a r w i t h the c o n t e n t s of d a t a b a s e
II F a m i l i a r w i t h the d o m a i n of a p p l i c a t i o n
I P a s s i n g k n o w l e d g e of the d o m a i n of
a p p l i c a t i o n
O f course, as u s e r s g a i n e x p e r i e n c e w i t h
a system, they w i l l c o n t i n u a l l y a t t e m p t to
a d a p t to its q u i r k s If the p u r p o s e of the
e v a l u a t i o n is to d e m o n s t r a t e that the n a t u r a l
l a n g u a g e u n d e r s t a n d i n g s y s t e m is m e r e l y useable, a d a p t a t i o n r e s e n t s no p r o b l e m However, if n a t u r a l l a n g u a g e is b e i n g used to
a l l o w the user to e x p r e s s h i m s e l f in his
a c c u s t o m e d manner, a d a p t a t i o n does b e c o m e important Again, the g o a l s of n a t u r a l
l a n g u a g e s y s t e m s have b e e n left vague A r e
n a t u r a l l a n g u a g e s y s t e m s to be i) i m m e d i a t e l y useful, 2) e a s i l y l e a r n e d 3) h i g h l y
e x p r e s s i v e or 4) r e a d i l y r e m e m b e r e d t h r o u g h
p e r i o d s of d i s u s e ? T h e e v a l u a t i o n s h o u l d
a t t e m p t to test for these g o a l s s p e c i f i c a l l y , and m u s t c o n t r o l for f a c t o r s such as
a d a p t a t i o n
W h a t a user k n o w s (either t h r o u g h
i n s t r u c t i o n or experience) a b o u t the domain, the d a t a b a s e and the i n t e r a c t i o n l a n g u a g e
h a v e a s i g n i f i c a n t e f f e c t on how he w i l l
e x p r e s s himself D a t a b a s e q u e r y s y s t e m s
u s u a l l y e x p e c t a c e r t a i n level of u s e of
d o m a i n or d a t a b a s e s p e c i f i c jargon, and
f a m i l i a r i t y w i t h c o n s t r u c t i o n s that are
c h a r a c t e r i s t i c of the domain A s y s t e m m a y
p e r f o r m w e l l for c l a s s IV u s e r s w i t h q u e r i e s like,
i) W h a t are the N O R M U for A A F s in 71 b y
m o n t h ?
H o w e v e r , it m a y fare p o o r l y for c l a s s I u s e r s
w i t h q u e r i e s like, 2) I n e e d to find the l e n g t h of time that the a t t a c k p l a n e s c o u l d not be f l o w n in
1971 b e c a u s e they w e r e u n d e r g o i n g
m a i n t e n a n c e E x c l u d e all p r e v e n t a t i v e
m a i n t e n a n c e , a n d g i v e me totals for each
p l a n e for each month
3.2 W h a t D o e s S u c c e s s R a t e M e a n ?
A c o m m o n m e t h o d for g e n e r a t i n g d a t a
a g a i n s t w h i c h to test a s y s t e m is to have
u s e r s use it, then c a l c u l a t e h o w s u c c e s s f u l the s y s t e m was at s a t i s f y i n g user needs If the e v a l u a t i o n a t t e m p t s to c a l c u l a t e the
f r a c t i o n of q u e s t i o n s that the s y s t e m
u n d e r s t o o d , it is i m p o r t a n t to c h a r a c t e r i z e
h o w d i f f i c u l t the q u e r i e s w e r e to u n d e r s t a n d For example, t w e l v e q u e r i e s of the form,
37
Trang 23) How many hours of down time did plane 3
have in January, 1971
4) How many hours of down time did plane 3
have in February, 1971
will h~ip the success rate more than one
query like,
5) How many hours of down time did plane 3
have in each month of 1971,
However, ~ n e query like 5 returns as much
information as the other twelve In testing
PLANES (Tennant, 1981], the users w h o s e
questions were understood with the highest
rates of success actually had less success at
solving the problems they were trying to
solve They spent much of their time asking
m a n y easy, repetitive questions and so did
not have time to attempt some of the
problems Other users who asked more compact
questions had plenty of time to hammer away
at the queries that the system had the
greatest difficulty understanding
Another difficulty with success rate
m e a s u r e m e n t is the characteristics of the
problems given to users compared to the kind
of problems a n t i c i p a t e ~ by the system I
once asked a set of users to write some
problems for other users to attempt to solve
using PLANES The problem authors were
familiar with the general domain of discourse
of pLANES, but did not have any experience
using it The problems they devised were
~easonable given the domain, but were largely
beyond the scope of PLANES ~ conceptual
coverage Users had very low success rates
when attempting to solve these problems In
contrast, problems that I had devised, fully
aware of pLANES ~ areas of most complete
Coverage (and devised to be easy for PLANES},
yielded much higher success rates Small
wonder The point is that unless the match
conceptual coverage can be characterlsed,
success ~ates mean little
4°0 TAXONOMY OF CAPABILITIES
Testing a natural language system for
engineering approach Another approach is to
compare the elements that are known to be
involved in understanding language against
the capabilities of the system This has
been called "sharpshooting" by some of the
implementers of natural language systems An
evaluator probes the system under test to
find conditions under which it fails To
evaluator should base his probes on a
taxonomy of phenomena that are relevant to
language understanding A standard taxonomy
could be developed for doing evaluations
Our knowledge of language is incomplete
at best Any taxonomy is bound to generate
disagreement However, it seems that most of
the disagreements describing language are not
over what the phenomena of language are, but
over how we might best understand and model
those phenomena The taxonomy will become
quite large, but this is only representative
of the fact that understanding language is a
very complex process The taxonomy approach faces the problem of complexity directly The taxonomy approach to evaluation forces examination of the broad range of issues of natural language processing It provides a relatively o b j e c t i v e means for assessing the full range of capabilities of a natural language understanding system It also avoids the problems listed above inherent in evaluation through user testing
It does, however, have some u n p l e a s a n t attributes First, it does not provide an easy basis for c o m p a r i s o n of systems Ideally an evaluation would produce a metric
to allow one to say "system A is better than system B" Appealing as it is, natural language understanding is probably too complex for a simple metric to be meaningful Second, the taxonomy approach does not provide a means for c o m p a r i s o n of natural
language understanding to other technologies That comparison can be done rather well with user testing, however
Third, the taxonomy approach ignores the relative importance of p h e n o m e n a and the interaction between p h e n o m e n a and domains of discourse In response to this difficulty,
an evaluation should include the analysis of
a simulated natural language system The simulated system would consist of a htnnan Interprete~ who acts as an intermediary between users and the programs or data they are trying to use Dialogs are recorded, then those dialogs are analyzed in light of the taxonomies of features In this way, the
c a p a b i l i t i e s of the system can be compared to the needs of the users The relative importance of p h e n o m e n a can be d e t e r m i n e d this way Furthermore, users" language can
be studied without them adapting to the system's limitations
The ~axonomy of p h e n o m e n a m e n t i o n e d above is intended to Include both lingulstlc
p h e n o m e n a and concepts The linguistic
p h e n o m e n a relate to how ideas may be understood T h e r e is an e x t e n s i v e l i t e r a t u r e
on this The concepts are the ideas which
m u s t be understood This is much more extensive, and much more domain specific Work in knowledge representation is p a r t i a l l y focused on learning what concepts need to be represented, then attempting to represent them Consequently, ther~ is a taxonomy of
representation literature
Reference Tennant, Harry E v a l u a t i o n of Natural Language processors Ph.D Thesis,
U n i v e r s i t y of Illinois, Urbana, Illiniois,
1981
38