Integrating the lexicon builder with the LSP, and writing preprocessors for dictionary data, will also be big jobs.. A major problem in the relational analysis of the dictionary is that
Trang 1T h o m a s E A h l s w e d e
C o m p u t e r S c i e n c e D e p a r t m e n t
I l l i n o i s I n s t i t u t e of T e c h n o l o g y
C h i c a g o , I l l i n o i s 6e616, U S A
ABSTRACT
This p a p e r d e s c r i b e s a set of
i n t e r a c t i v e r o u t i n e s that can be u s e d to
create, m a i n t a i n , a n d u p d a t e a c o m p u t e r
lexicon The r o u t i n e s are a v a i l a b l e to
the user as a set of c o m m a n d s r e s e m b l i n g a
s i m p l e o p e r a t i n g system The l e x i c o n pro-
d u c e d by this s y s t e m is b a s e d on lexi-
c a l - s e m a n t i c relations, but is c o m p a t i b l e
w i t h a v a r i e t y of o t h e r m o d e l s of l e x i c o n
structure The l e x i c o n b u i l d e r is suit-
able for the g e n e r a t i o n of m o d e r a t e - s i z e d
v o c a b u l a r i e s and has been used to
c o n s t r u c t a l e x i c o n for a small m e d i c a l
e x p e r t system A f u t u r e v e r s i o n of the
l e x i c o n b u i l d e r w i l l c r e a t e a m u c h l a r g e r
l e x i c o n by p a r s i n g d e f i n i t i o n s from
m a c h i n e - r e a d a b l e d i c t i o n a r i e s
I N T R O D U C T I O N
N a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s
need m u c h larger l e x i c o n s than t h o s e
a v a i l a b l e today F u r t h e r m o r e , a g o o d com-
puter l e x i c o n w i t h s e m a n t i c as w e l l as
s y n t a c t i c i n f o r m a t i o n is e l a b o r a t e and
hard to c o n s t r u c t We have c r e a t e d a
p r o g r a m w h i c h e n a b l e s its user to i n t e r -
a c t i v e l y b u i l d and e x t e n d a lexicon The
p r o g r a m sets up a user e n v i r o n m e n t s i m i l a r
to a s i m p l e i n t e r a c t i v e o p e r a t i n g system;
in this e n v i r o n m e n t l e x i c a l e n t r i e s can be
p r o d u c e d through a small set of c o m m a n d s ,
c o m b i n e d w i t h p r o m p t s s p e c i f i e d by the
user for the d e s i r e d kind of lexicon
The i n t e r a c t i v e l e x i c o n b u i l d e r is
being used to help c o n s t r u c t e n t r i e s for a
l e x i c o n to be used to p a r s e and g e n e r a t e
stroke case reports Many terms in this
m e d i c a l s u b l a n g u a g e e i t h e r do not a p p e a r
in s t a n d a r d d i c t i o n a r i e s or are used in
the s u b l a n g u a g e w i t h special m e a n i n g s
The d e s i g n of the l e x i c o n b u i l d e r is
inuended to be g e n e r a l e n o u g h to m a k e it
useful for o t h e r s b u i l d i n g l e x i c o n s for
large n a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s
involving d i f f e r e n t s u b l a n g u a g e s
The i n t e r a c t i v e l e x i c o n b u i l d e r w i l l
be the b a s i s for a f u l l y a u t o m a t i c l e x i c o n
b u i l d e r w h i c h uses S a g e r ' s L i n g u i s t i c
S t r i n g P a r s e r (LSP) to p a r s e m a c h i n e -
r e a d a b l e text into a r e l a t i o n a l n e t w o r k
b a s e d on a m o d i f i e d v e r s i o n of W e r n e r ' s NTQ ( M o d i f i c a t i o n - T a x o n o m y - Q u e u e i n g ) schema I n i t i a l l y this p r o g r a m will be
a p p l i e d to W e b s t e r ' s S e v e n t h C o l l e g i a t e
D i c t i o n a r y and the L o n g m a n D i c t i o n a r y of
C o n t e m p o r a r y English, b o t h of w h i c h are
a v a i l a b l e in m a c h i n e - r e a d a b l e form
L E X I C A L - S E N A N T I C R E L A T I O N S
The s e m a n t i c c o m p o n e n t of the l e x i c o n
p r o d u c e d by this s y s t e m c o n s i s t s p r i n c i -
p a l l y of a n e t w o r k of l e x i c a l - s e m a n t i c relations That is, the m e a n i n g of a w o r d
in the l e x i c o n is i n d i c a t e d as far as
p o s s i b l e by its r e l a t i o n s h i p s w i t h o t h e r words T h e s e r e l a t i o n s o f t e n have s e m a n - tic c o n t e n t t h e m s e l v e s and thus c o n t r i b u t e
to the d e f i n i t i o n of the w o r d s they link The two m o s t f a m i l i a r such r e l a t i o n s are s y n o n y m y and a n t o n y m y , but o t h e r s are
i n t e r e s t i n g and i m p o r t a n t For instance,
to take an e x a m p l e f r o m the v o c a b u l a r y of
s t r o k e reports, the c a r o t i d is a k i n d of
a r t e r y and an a r t e r y is a kind of b l o o d Vessel T h i s "is a kind of" r e l a t i o n is taxonomy We e x p r e s s the t a x o n o m i c rela- tions of "carotid', "artery" a n d "blood
v e s s e l " w i t h the r e l a t i o n a l a r c s
c a r o t i d T a r t e r y
a r t e r y T b l o o d v e s s e l
A n o t h e r i m p o r t a n t r e l a t i o n is that of the p a r t to the whole:
v e n t r i c l e P A R T h e a r t
B r o c a ' s area PART b r a i n
N o t e that t a x o n o m y is t r a n s i t i v e : if the c a r o t i d is an a r t e r y and an a r t e r y is
a b l o o d vessel, then the c a r o t i d is a
b l o o d vessel The p r e s e n c e or a b s e n c e of the p r o p e r t i e s of t r a n s i t i v i t y , r e f l e x i v - ity a n d s y m m e t r y a r e i m p o r t a n t in u s i n g
r e l a t i o n s to make inferences
Trang 2c o m p l i c a t e d than t a x o n o m y in its p r o p e r -
ties; some i n s t a n c e s of it are t r a n s i t i v e
and o t h e r s are not F r o m this and other
criteria, Iris et al (forthcoming)
d i s t i n g u i s h four d i f f e r e n t p a r t - w h o l e
relations
T a x o n o m y and p a r t - w h o l e are very
c o m m o n relations, by no m e a n s r e s t r i c t e d
to a n y p a r t i c u l a r s u b l a n g u a g e S u b l a n -
g u a g e s may, however, use r e l a t i o n s that
are rare or n o n e x i s t e n t in the g e n e r a l
language In the s t r o k e v o c a b u l a r y , there
are m a n y w o r d s for p a t h o l o g i c a l c o n d i t i o n s
i n v o l v i n g the failure of some p h y s i c a l or
m e n t a l function We have i n v e n t e d a rela-
tion N N A B L E to e x p r e s s the c o n n e c t i o n
b e t w e e n the c o n d i t i o n and the function:
a p h a s i a N N A B L E s p e e c h
a m n e s i a N N A B L E m e m o r y
R e l a t i o n s such as T, PART, and N N A B L E
are e s p e c i a l l y useful in m a k i n g infer-
ences For instance, if we have a n o t h e r
relation FUNC, d e s c r i b i n g the typical
f u n c t i o n of a body part, we m i g h t c o m b i n e
the r e l a t i o n a l arc
s p e e c h FUNC B r o c a ' s area
w i t h the arc
a p h a s i a N N A B L E s p e e c h
to infer that w h e n a p h a s i a is present, the
d i a g n o s t i c i a n should check for the p o s s l -
b i l i t y of d a m a g e to B r o c a ' s area (as w e l l
as to any other body p a r t w h i c h has s p e e c h
as a function)
F i g u r e i Part of a r e l a t i o n a l n e t w o r k
A n o t h e r k i n d of r e l a t i o n is the "col-
l o c a t i o n a l relation', w h i c h g o v e r n s the
c o m b i n i n g of words T h e s e are p a r t i c u -
l a r l y u s e f u l for g e n e r a t i n g i d i o m a t i c text C o n s i d e r the "typical p r e p o s i t i o n "
r e l a t i o n PREP:
on P R E P list
w h i c h says that an item may be "on a list"
as o p p o s e d to "in a list" or "at a list."
A l t h o u g h the l e x i c o n b u i l d e r is b a s e d
on a r e l a t i o n a l model, it can be a d a p t e d for use in c o n n e c t i o n w i t h a v a r i e t y of
m o d e l s of l e x i c o n s t r u c t u r e A s e m a n t i c - field a p p r o a c h can be h a n d l e d by the same
m e c h a n i s m as r e l a t i o n s ; the l e x i c o n
b u i l d e r also r e c o g n i z e s u n a r y a t t r i b u t e s
of words, and these a t t r i b u t e s can be
t r e a t e d as s e m a n t i c f e a t u r e s if one w i s h e s
to b u i l d a f e a t u r e - b a s e d lexicon
A P P L I C A T I O N S F O R T H E L E X I C O N B U I L D E R
This p r o j e c t was m o t i v a t e d p a r t l y by
t h e o r e t i c a l q u e s t i o n s of l e x i c o n d e s i g n
a n d p a r t l y by p r o j e c t s w h i c h r e q u i r e d the use of a lexicon
For i n s t a n c e , the M i c h a e l R e e s e Hos- pital S t r o k e R e g i s t r y i n c l u d e s a text
g e n e r a t i o n m o d u l e p o w e r e d by a r e l a t i o n a l
l e x i c o n (Evens et al., 1984) This a p p l i -
c a t i o n p r o v i d e d a f r a m e w o r k of g o a l s
w i t h i n w h i c h the i n t e r a c t i v e l e x i c o n
b u i l d e r was d e v e l o p e d The v o c a b u l a r y required for the S t r o k e R e g i s t r y text
g e n e r a t o r is of m o d e r a t e size, a b o u t 2000
w o r d s and phrases This is small e n o u g h thau a l e x i c o n for it can be built
i n t e r a c t i v e l y
O n e can imagine m a n y a p p l i c a t i o n s for
a large l e x i c o n such as the a u t o m a t i c
l e x i c o n b u i l d e r w i l l c o n s t r u c t Q u e s t i o n
a n s w e r i n g is one of our o r i g i n a l a r e a s of interest; a large, d e n s e l y c o n n e c t e d
v o c a b u l a r y will g r e a t l y add to the v a r i e t y
of i n f e r e n c e s a q u e s t i o n a n s w e r i n g s y s t e m can make A n o t h e r a r e a is i n f o r m a t i o n re- trieval, w h e r e e x p e r i m e n t s (Evens et al., forthcoming) have shown that the use of a
r e l a t i o n a l t h e s a u r u s leads to i m p r o v e m e n t s
in both recall and precision
On a more t h e o r e t i c a l level, the
a u t o m a t i c l e x i c o n b u i l d e r will add g r e a t l y
to our u n d e r s t a n d i n g of s u b l a n g u a g e s ,
n o t a b l y that of the d i c t i o n a r y itself We have n o t e d that a s p e c i a l i z e d r e l a t i o n such as NNABLE, u n u s u a l in the g e n e r a l language, may be i m p o r t a n t in a sub- language We b e l i e v e that such s p e c i f i c
r e l a t i o n s are p a r t of the d i s t i n c t i v e
c h a r a c t e r of e v e r y sublanguage The very
p o s s i b i l i t y of c r e a t i n g a large, g e n e r a l -
Trang 3l a n g u a g e l e x i c o n p o i n t s t o w a r d a time w h e n
s u b l a n g u a g e s w i l l be o b s o l e t e for m a n y of
the p u r p o s e s for w h i c h they are now used;
but they w i l l still be u s e f u l and
i n t e r e s t i n g for a long time to come, a n d
the a u t o m a t i c l e x i c o n b u i l d e r g i v e s us a
new tool for a n a l y z i n g them
T H E I N T E R A C T I V E L E X I C O N B U I L D E R
Commands
The i n t e r a c t i v e l e x i c o n b u i l d e r
c o n s i s t s of an o p e r a t l n g - s y s t e m - l i k e
e n v i r o n m e n t in w h i c h the user m a y i n v o k e
the f o l l o w i n g c o m m a n d s :
H E L P d i s p l a y s a set of o n e - l i n e
s u m m a r i e s of the c o m m a n d s , or a p a r a g r a p h -
l e n g t h d e s c r i p t i o n of a s p e c i f i e d command
T h i s p a r a g r a p h d e s c r i b e s the c o m m a n d - l i n e
a r g u m e n t s , o p t i o n a l or required, for the
g i v e n command, and b r i e f l y e x p l a i n s the
f u n c t i o n of the command
A D D E N T R Y p r o v i d e s a series of p r o m p t s
to e n a b l e the user to c r e a t e a l e x i c a l
entry Some of these p r o m p t s are hard
coded; o t h e r s can be set up in a d v a n c e by
the user so that the l e x i c o n can be
t a i l o r e d to the u s e r ' s needs
E D I T e n a b l e s the user to m o d i f y an
e x i s t i n g entry It d i s p l a y s the e x i s t i n g
c o n t e n t s of the e n t r y item by item,
p r o m p t i n g for c h a n g e s or a d d i t i o n s If
the d e s i r e d e n t r y is not a l r e a d y in the
lexicon, EDIT b e h a v e s in the same way as
ADDENTRY
D E L E T E lets the user d e l e t e one or
m o r e entries An entry is not p h y s i c a l l y
deleted; it is r e m o v e d from the d i r e c -
tory, and all e n t r i e s w i t h arcs p o i n t i n g
to it are m o d i f i e d to e l i m i n a t e t h o s e
arcs (This is s i m p l e to do, s i n c e for
every such arc there is an i n v e r s e arc
p o i n t i n g to that entry from the d e l e t e d
one.) On the next PACK o p e r a t i o n (see
below) the d e l e t e d e n t r y w i l l not be
p r e s e r v e d in the lexicon
This c o m m a n d can a l s o be used to
d e l e t e the d e f e c t i v e e n t r i e s that are
o c c a s i o n a l l y c a u s e d by u n r e s o l v e d bugs in
the e n t r y - c r e a t i n g routines, or w h i c h
m i g h t a r i s e from other c i r c u m s t a n c e s A
special o p t i o n w i t h this c o m m a n d s e a r c h e s
the d i r e c t o r y for a v a r i e t y of "illegal"
c o n d i t i o n s such as n o n p r i n t i n g c h a r a c t e r s ,
z e r o - l e n g t h names, etc
LIST g i v e s o n e - l i n e l i s t i n g s of some
or all of the e n t r i e s in the lexicon The
l i s t i n g for each entry includes the n a m e
(the w o r d itself), sense number, p a r t of
speech, and the first forty c h a r a c t e r s of
the d e f i n i t i o n if there is one
one or m o r e entries
R E L A T I O N S d i s p l a y s a t a b l e of the
l e x i c a l - s e m a n t i c r e l a t i o n s u s e d by the
l e x i c o n b u i l d e r T h i s t a b l e is c r e a t e d by the u s e r in a s e p a r a t e o p e r a t i o n
U N D E F is a s p e c i a l f o r m of EDIT In
c r e a t i n g a n entry, the u s e r may c r e a t e
r e l a t i o n a l a r c s from the c u r r e n t w o r d to
o t h e r w o r d s that are not in the lexicon The s y s t e m k e e p s a q u e u e of u n d e f i n e d words U N D E F i n v o k e s E D I T for the w o r d at the head of the queue, thus s a v i n g the user the t r o u b l e of l o o k i n g up u n d e f i n e d words
PACK p e r f o r m s file m a n a g e m e n t on the lexicon, s o r t i n g the e n t r i e s a n d e l i m i -
n a t i n g s p a c e left by d e l e t e d ones
This r o u t i n e w o r k s in two passes In the first pass, the e n t r i e s are c o p i e d from the e x i s t i n g l e x i c o n file to a new file in l e x i c o g r a p h i c o r d e r and a table is
c r e a t e d that m a p s the e n t r i e s f r o m their old l o c a t i o n s to their new ones At this stage, a r e l a t i o n a l arc from one e n t r y to
a n o t h e r still p o i n t s to the o t h e r e n t r y ' s old location The s e c o n d pass u p d a t e s the new lexicon, m o d i f y i n g all r e l a t i o n a l a r c s
to p o i n t to the c o r r e c t new l o c a t i o n s
Q U I T e x i t s from the l e x i c o n b u i l d e r
e n v i r o n m e n t Any n e w e n t r i e s or c h a n g e s
m a d e d u r i n g the l e x i c o n b u i l d i n g s e s s i o n are i n c o r p o r a t e d and the d i r e c t o r y is updated
E x t e n s i o n s t o the c o m m a n d s All of the c o m m a n d s can be a b b r e v i - ated; so far they all have d i s t i n c t i v e
i n i t i a l s and can thus be c a l l e d w i t h a
s i n g l e k e y s t r o k e Each c o m m a n d may be a c c o m p a n i e d by
c o m m a n d - l i n e a r g u m e n t s to d e f i n e its a c - tion m o r e p r e c i s e l y D i s p l a y c o m m a n d s ,
s u c h as HELP or SHOW, a l l o w the user to get a p r i n t o u t of the display W h e r e an
e n t r y name is to be s p e c i f i e d , the user can get m o r e than one entry by m e a n s of
"wild c a r d s " For instance, the c o m m a n d
"LIST p r o d u c = m i g h t y i e l d a list showing
e n t r i e s for "produce', "produced", "pro- duces", "producing', "product', a n d
"production ~
A d d i t i o n a l c o m m a n d s are c u r r e n t l y
b e i n g d e v e l o p e d to h e l p the user m a n a g e the r e l a t i o n table and the a t t r i b u t e list from w i t h i n the l e x i c o n b u i l d e r
e n v i r o n m e n t
Trang 4into a c c o u n t b o t h the a v a i l a b l e f a c i l i t i e s
and the e x p e c t e d users The l e x i c o n
b u i l d e r runs on a VAX 11-75B, n o r m a l l y
a c c e s s e d w i t h l i n e - e d l t i n g terminals
This s u g g e s t s that a s i n g l e - l i n e c o m m a n d
f o r m a t is m o s t a p p r o p r i a t e Since much of
the work w i t h the s y s t e m is d o n e over 3~0
b a u d t e l e p h o n e lines, c o n c i s e n e s s is a l s o
important The u s e r s have all had some
p r o g r a m m i n g e x p e r i e n c e (though not n e c e s -
s a r i l y very much) so an o p e r a t i n g - s y s t e m -
like i n t e r f a c e is easy for them to get
used to If the l e x i c o n b u i l d e r b e c o m e s
popular, we hope to have the o p p o r t u n i t y
to d e v e l o p a m o r e s o p h i s t i c a t e d interface,
p e r h a p s w i t h a c o m b i n a t i o n of f e a t u r e s for
b e g i n n e r s and m o r e e x p e r i e n c e d users
S t r u c t u r e of a l e x l c a l e n t r y
A c o m p l e t e lexical e n t r y c o n s i s t s of:
i The "name" of the entry its
c h a r a c t e r - s t r i n g form
2 Its sense We r e p r e s e n t senses
by simple numbers, not a t t e m p t i n g to
f o r m a l l y d i s t i n g u i s h p o l y s e m y and h o m o -
nymy, or any other d e g r e e of s e m a n t i c
d i f f e r e n c e The s y s t e m leaves to the user
the p r o b l e m of d i s t i n g u i s h i n g d i f f e r e n t
s e n s e s from e x t e n s i o n s of a s i n g l e sense:
that is, w h e r e a word has a l r e a d y been
e n t e r e d in some sense, the user must
d e c i d e w h e t h e r to m o d i f y the e n t r y for
that sense or c r e a t e a new entry for a new
sense
3 Part of speech, or "class." Our
c l a s s i f i c a t i o n of parts of s p e e c h is
b a s i c a l l y the t r a d i t i o n a l c l a s s i f i c a t i o n
w i t h some c o n v e n i e n t a d d i t i o n s , l a r g e l y
d r a w n from the c l a s s i f i c a t i o n used by
Sager in the LSP (Sager, 1981) Most of
the a d d i t i o n s are to the c a t e g o r y of
v e r b s : "verb" to the lexicon b u i l d e r de-
n o t e s the stem form, w h i l e the third
p e r s o n and past tense are d i s t i n g u i s h e d as
" f i n i t e verb', and the past and p r e s e n t
p a r t i c i p l e s are c l a s s i f i e d separately
4 The text of the definition,
e n t e r e d by the user
At t h i s stage in our work, the
d e f i n i t i o n is not p a r s e d or o t h e r w i s e ana-
lyzed, so its p r e s e n c e is m o r e for
p u r p o s e s of d o c u m e n t a t i o n than a n y t h i n g
else In future v e r s i o n s of the lexicon
builder, the d e f i n i t i o n will play an
i m p o r t a n t role in c o n s t r u c t i n g the entry
but in the entry itself will be replaced
by i n f o r m a t i o n d e r i v e d from its analysis
5 A list of a t t r i b u t e s (or s e m a n t i c
features), each with its value, w h i c h may
be b i n a r y or scalar
For example, for the m o s t c o m m o n sense of the verb "promise', the p r e d i c a t e c a l c u l u s
d e f i n i t i o n is e x p r e s s e d as
p r o m i s e i x , y , z ) = say(x,w,z) _eventiy) => w = w i l l happen(y) _ t h i n g ( y ) => w = w i l l receive(z,y)
or, in freer form,
ix p r o m i s e s y to z} = ix says w to z)
w h e r e w = (y will happen)
if y is an e v e n t (z w i l l r e c e i v e y)
if y is a p h y s i c a l object This is e n t e r e d by the user
We have been i n c l i n e d to think of the
r e l a t i o n a l l e x i c o n as a network, since the network r e p r e s e n t a t i o n v i v i d l y b r i n g s out
the i n t e r c o n n e c t e d q u a l i t y w h i c h the
r e l a t i o n a l model g i v e s to the lexicon
P r e d i c a t e c a l c u l u s is b e t t e r in o t h e r respects; for instance, it e x p r e s s e s the
a b o v e d e f i n i t i o n of "promise" m u c h more
e l e g a n t l y than any n e t w o r k n o t a t i o n could The two m e t h o d s of r e p r e s e n t a t i o n have
t r a d i t i o n a l l y b e e n seen as a l t e r n a t i v e s rather than as s u p p l e m e n t i n g each other;
we b e l i e v e that p r e d i c a t e c a l c u l u s has an
i m p o r t a n t s u p p l e m e n t a r y role to play in
d e f i n i n g the core v o c a b u l a r y of the lexicon, a l t h o u g h we are not sure yet how
to use it
7 Case s t r u c t u r e (for verbs) This
is a table d e s c r i b i n g , for each s y n t a c t i c slot a s s o c i a t e d w i t h the verb (subject,
d i r e c t object, etc.) the s e m a n t i c case or
c a s e s that may be used in that slot ('age,in, " e x p e r i e n c e r ' , etc.), w h e t h e r it
is required, o p t i o n a l , or may be e x p r e s s e d
e l l i p t i c a l l y (as w i t h the d i r e c t and
i n d i r e c t o b j e c t in "I p r o m i s e i " r e f e r r i n g
to an earlier statement)
Space is r e s e r v e d in this s t r u c t u r e for s e l e c t i o n r e s t r i c t i o n s A r e l a t i o n a l
m o d e l gives us the much more p o w e r f u l op- tion of i n d i c a t i n g t h r o u g h r e l a t i o n s such
as " p e r m i s s i b l e subject', " p e r m i s s i b l e object', etc., not only what w o r d s m a y go
w i t h what others, but w h e t h e r the usage is literal, a c o n v e n t i o n a l figure of speech, fanciful, or w h a t e v e r S e l e c t i o n restric- tions do, however, have the v i r t u e of
c o n c i s e n e s s , and they p e r m i t us to make
g e n e r a l i z a t i o n s R e l a t i o n a l a r c s may then
be used to mark e x c e p t i o n s
8 A list of zero or more relations, each w i t h one or m o r e p o i n t e r s to other entries, to w h i c h the c u r r e n t e n t r y is
c o n n e c t e d by that relation
Trang 5We find it c o n v e n i e n t to treat m o r -
p h o l o g i c a l d e r i v a t i o n s such as p l u r a l of
nouns, t e n s e s and p a r t i c i p l e s of verbs, as
r e l a t i o n s c o n n e c t i n g s e p a r a t e entries
The e n t r y for a r e g u l a r l y d e r i v e d f o r m
such as a n o u n p l u r a l is a m i n i m a l one,
c o n s i s t i n g of name, sense, part of speech,
and one r e l a t i o n a l arc, l i n k i n g the e n t r y
to the stem form The l e x i c o n b u i l d e r
g e n e r a t e s these r e g u l a r forms a u t o m a t i -
cally It a l s o d i s t i n g u i s h e s t h e s e "regu-
lar" e n t r i e s f r o m " u n d e f i n e d " e n t r i e s ,
w h i c h have b e e n e n t e r e d i n d i r e c t l y as
t a r g e t w o r d s of r e l a t i o n a l a r c s a n d w h i c h
are on the q u e u e a c c e s s e d by UNDEF, as
w e l l as from " d e f i n e d " entries
n a m e
s e n s e
c l a s s
text of
d e f i n i t i o n
a t t r i b u t e list
p r e d i c a t e
c a l c u l u s
d e f i n i t i o n
case s t r u c t u r e
table
r e l a t i o n ~
list
w2-
I w2 1.2[
l :I
F i g u r e 2, S t r u c t u r e of a l e x i c a l e n t r y
File s t r u c t u r e of the l e x i c o n
T h e r e are four data files
w i ~ h the lexicon
a s s o c i a t e d
The first is the l e x i c o n proper The
b i g g e s t c o m p l i c a t i n g factor in the d e s i g n
of the l e x i c o n is the e x t r e m e l y inter-
c o n n e c t e d n a t u r e of the data; a c h a n g e in
one p o r t i o n of the file may n e c e s s i t a t e
c h a n g e s in m a n y o t h e r p l a c e s in the file
Each entry is l i n k e d t h r o u g h r e l a t i o n a l
arcs to m a n y o t h e r entries; a n d for e v e r y
arc p o i n t i n g from w o r d l to word2, there
m u s t be an i n v e r s e arc f r o m w o r d 2 to
a n e w arc in the c o u r s e of b u i l d i n g or
m o d i f y i n g a n e n t r y for wordl, we m u s t
u p d a t e the e n t r y for w o r d 2 so that it w i l l
c o n t a i n the a p p r o p r i a t e i n v e r s e arc back
to wordl• W o r d 2 ~ s e n t r y has to be u p d a t e d
or c r e a t e d from scratch; we n e e d to
s t r u c t u r e the l e x i c o n file so that this
u p d a t i n 9 p r o c e s s , w h i c h may take p l a c e
a n y w h e r e in the file, can be d o n e w i t h the
l e a s t p o s s i b l e d i s l o c a t i o n
a p h a s i a (1) n
definition
a d i s o r d e r of l a n g u a g e due to i n j u r y
to the b r a i n
a t t r i b u t e s
n o n h u m a n
c o l l e c t i v e
p r e d i c a t e c a l c u l u s have(x, aphasia) " a b l e ( s p e a k ( x ) )
r e l a t i o n s
T A X [aphasia is a k i n d of x]
d e f i c i t
d i s o r d e r loss
i n a b i l i t y
"TAX
Ix is a kind of aphasia]
a n o m i c
g l o b a l
g e r s t m a n n ' s
s e m a n t i c
We rnicke ' s
S r o c a ' s
c o n d u c t i o n
t r a n s c o r t i c a l
S Y M P T O M [aphasia is a s y m p t o m of x]
s t r o k e
T I A
A S S O C [aphasia may be a s s o c i a t e d w i t h x]
a p r a x i a _ C A U S E [x is a c a u s e of aphasia]
injury
l e s i o n
N N A B L E [aphasia is the i n a b i l i t y to do x]
s p e e c h
l a n g u a g e
F i g u r e 3 L e x i c a l entry for " a p h a s i a "
The size of an e n t r y can vary
e n o r m o u s l y R e g u l a r d e r i v e d forms c o n t a i n
o n l y the name, sense, class a n d one rela-
t i o n a l arc (to the s t e m form), as w e l l as
a c e r t a i n a m o u n t of o v e r h e a d for the
d e f i n i t i o n , p r e d i c a t e c a l c u l u s d e f i n i t i o n
a n d a t t r i b u t e list a l t h o u g h these are not used The s m a l l e s t p o s s i b l e e n t r y t a k e s
up a b o u t thirty bytes At the o t h e r extreme, a w o r d may h a v e an e x t e n s i v e
a t t r i b u t e list, e l a b o r a t e t e x t and
p r e d i c a t e c a l c u l u s d e f i n i t i o n s , and d o z e n s
Trang 6tional arcs "Aphasia', a m o d e r a t e l y
large e n t r y w i t h 19 arcs, o c c u p i e s 322
bytes Like all e n t r i e s in the c u r r e n t
lexicon, it w i l l be s u b j e c t to u p d a t i n g
and w i l l c e r t a i n l y b e c o m e m u c h larger
W i t h this range of e n t r y sizes, the
c h o i c e b e t w e e n f i x e d - s i z e and v a r i a b l e -
size records b e c o m e s s o m e w h a t painful
V a r i a b l e - s i z e records w o u l d be h i g h l y
c o n v e n i e n t as w e l l as e f f i c i e n t e x c e p t for
the fact that w h e n we a d d a new e n t r y that
is related to e x i s t i n g entries, we m u s t
add new a r c s to those entries The
e x i s t i n g e n t r i e s thus no longer fit into
their p r e v i o u s space and m u s t be e i t h e r
b r o k e n up or m o v e d to a new space The
former o p t i o n c r e a t e s p r o b l e m s of
i d e n t i f y i n g the v a r i o u s p i e c e s of the
entry; the latter r e q u i r e s that yet m o r e
e x i s t i n g e n t r i e s be m o d i f i e d
B e c a u s e of t h e s e problems, we have
opted for a f i x e d - s i z e record Some space
is wasted, e i t h e r in e m p t y space if the
record is too large or t h r o u g h p r o l i f e r a -
tion of p o i n t e r s if the record is too
small; but the a m o u n t of n e c e s s a r y up-
d a t i n g is m u c h less, and the file can be
kept in order through f r e q u e n t use of the
PACK command The c h o i c e of record size
is c o n d i t i o n e d by m a n y factors, s y s t e m
r e q u i r e m e n t s as w e l l as the range of entry
sizes We are c u r r e n t l y w o r k i n g on d e t e r -
m i n i n g the best record size for the MRH
a p p l i c a t i o n
So far the user does not have the op-
tion of saving or rejecting the results of
a lexicon b u i l d i n g session, since e n t r i e s
are w r i t t e n to the file as soon as they
are created We are s t u d y i n g w a y s of
p r o v i d i n g this option A brute force w a y
w o u l d be to keep the e n t i r e l e x i c o n in
m e m o r y and rewrite it at the end of the
session This is f e a s i b l e if the host
c o m p u t e r is large and the l e x i c o n is
small The 2 ~ g 0 - w o r d l e x i c o n for the
M i c h a e l Reese stroke d a t a b a s e takes up
a b o u t a third of a megabyte, so this
a p p r o a c h w o u l d work on a m a i n f r a m e or a
large m i n i c o m p u t e r such as our Vax 75g,
but could not r e a d i l y be p o r t e d to a
smaller machine; nor c o u l d w e h a n d l e a
much larger v o c a b u l a r y such as we plan to
c r e a t e w i t h the a u t o m a t i c l e x i c o n builder
The second file is a d i r e c t o r y ,
showing each e n t r y ' s name, sense, and
status (defined, u n d e f i n e d or regular
d e r i v a u i v e ) , w i t h a pointer to the a p p r o -
p r i a t e entry in the l e x i c o n proper The
d i r e c t o r y e n t r i e s are l i n k e d in lexico-
g r a p h i c order When the l e x i c o n b u i l d e r
is invoked, the e n t i r e d i r e c t o r y is read
into a buffer in memory, and this b u f f e r
is u p d a t e ~ as e n t r i e s are created,
l e x i c o n b u i l d i n g session, the u p d a t e d
d i r e c t o r y is w r i t t e n out to disk
The third (optional) file is a table
of a t t r i b u t e s , w i t h p o i n t e r s into the
l e x i c o n proper This can be e x t e n d e d into
a f e a t u r e matrix
The f o u r t h (also optional) is a table
of p r e - d e f i n e d relations This t a b l e includes, for each relation:
(i) its m n e m o n i c name
(2) its p r o p e r t i e s A r e l a t i o n may
be reflexive, s y m m e t r i c or t r a n s i t i v e ; there may b e o t h e r p r o p e r t i e s w o r t h including
(3) a p o i n t e r to the r e l a t i o n ' s inverse If x R E L y, then we can d e f i n e some REL such that y REL x If REL is
r e f l e x i v e or symmetric, then REL = REL
(4) the a p p r o p r i a t e p a r t s of s p e e c h for the w o r d s l i n k e d by the relation For instance, the N N A B L E r e l a t i o n links two nouns, w h i l e the c o l l o c a t i o n a l PREP rela- tion links a p r e p o s i t i o n to a noun
T a x o n o m y can link any two w o r d s (apart from p r e p o s i t i o n s , c o n j u n c t i o n s , etc.) as long as they are of the same part of speech: n o u n s to nouns, verbs to verbs,
e t c
(5) the text of a prompt A D D E N T R Y uses this p r o m p t w h e n q u e r y i n g the user for the o c c u r r e n c e of r e l a t i o n a l arcs
i n v o l v i n g this relation For instance, if
we are e n t e r i n g the w o r d "promise" and our
a p p l i c a t i o n uses the t a x o n o m y relation, we
m i g h t c h o o s e a short prompt, in w h i c h case the q u e r y for t a x o n o m y m i g h t take the form
"promise" T: [user e n t e r s w o r d 2 here]
or we c o u l d use s o m e t h i n g m o r e explicit:
"promise" is a kind of:
Users familiar w i t h l e x i c a l - s e m a n t i c
r e l a t i o n s m i g h t p r e f e r the s h o r t e r
m n e m o n i c prompt, w h e r e a s other users m i g h t
p r e f e r a p r o m p t that better e x p r e s s e d the
s i g n i f i c a n c e of the relation
T H E A U T O M A T I C L E X I C O N B U I L D E R
B u i l d i n g a v e r y l a r g e l e x i c o n
T h e r e a r e n u m e r o u s l o g i s t i c a l p r o b -
l e m s i n i m p l e m e n t i n g t h e s o r t o f v e r y
73
Trang 7large lexicon that would result from anal-
ysis of an entire dictionary, as the work
of Amsler and White (1979) or Kelly and
Stone (1975) shows Integrating the
lexicon builder with the LSP, and writing
preprocessors for dictionary data, will
also be big jobs Fully automatic analy-
sis of dictionary material, then, is a
long-range goal
A major problem in the relational
analysis of the dictionary is that of
determining what relations to use Noun
and verb definitions rely on taxonomh ~ to a
great extent (e.g Amsler and White,
1979) but there are definitions that do
not clearly fit this pattern; further-
more, even in a taxonomic definition, much
semantic information is contained in the
qualifying or differentiating part of the
definition
Adjective definitions are another
problem area Adjectives are usually
defined in terms of nouns or verbs rather
than other adjectives, so simple taxonomy
does not work neatly In a sample of
about 7 , 0 ~ definitions from W7, we
identified nineteen major relations unique
to adjective definitions, and these
covered only half of the sample The
remaining definitions were much more
varied and would probably require far more
then nineteen additional relations And
for each relation, we had to identify
words or phrases (the "defining formulas')
that signaled the presence of the
relation
The M'~ model
For these reasons as well as
theoretical ones, we need a simplifying
model of relations, a model that enables
us either to avoid the endless identifica-
tion of new relations or to conduct the
identification within an orderly frame-
work Werner's MTQ schema (Werner, 1978;
Werner and Topper, 1976) seems to provide
the basis for such a model
Werner idennifies only three rela-
tions: modification, taxonomy and queue-
ing He asserts that all other relations
can be expressed as compounds of these
relations and of lexical items for
instance, the PART relation can be
expressed, with the help of the lexical
item "part', by the relational arcs
Broca's area T part
which say in effect that Broca's area is a
kind of part, specifically a "brain-part."
taxonomy reflects Aristotle's model of the definition as consisting of species, genus and differentiae taxonomy links the species to the genus and m o d i f i c a t i o n links the differentiae to the genus A study of definitions in W7 and LDOCE shows that they do indeed follow this pattern, although (as in adjective definitions) the pattern is not always obvious
The special power of MTQ in the analysis of definitions is that in a definition following the A r i s t o t e l i a n
structure, taxonomy and m o d i f i c a t i o n can
be identified by purely syntactic means One (or occasionally more than one) word
in the definition is modified directly or indirectly by all the other words The core word is linked to the defined word by taxonomy; all the others are linked to the core word by modification (Queueing
so far does not seem to be important in the analysis of definitions.)
In order to avoid certain ambiguities that arise in a very elaborate network such as that generated from a large dic- tionary, we have replaced the separate modification and taxonomy arcs with a single, ternary relational arc that keeps the species, genus and d i f f e r e n t i a t i n g items of any particular definition linked
to each other
The problem of identifying "higher level" relations such as PART and NNABLE
in an MT0 network still remains At this point it seems to be similar to the prob- lem of identifying higher level relations from defining formulas
Another pleasant discovery is that the Linguistic String Parser, which we have used successfully for some years, is exceptionally well suited for this strat- egy, since it is geared toward an analysis
of sentences and phrases in terms of
"centers" or "cores" with their modifying
"adjuncts', which is exactly the kind of analysis we need to do
Design of the automatic lexicon builder The automatic lexicon builder will contain at least the following suDsystems:
I The standard data structure f o r the lexical entry, as described for the interactive lexicon builder, with slight changes to adjust to the use of MTQ
The relation list is presently structured as a linked list of relations, each pointing to a linked list of wordis ('Wordi" refers to any word related to the
Trang 8gating.) Incorporating the ternary MTQ
model, we would have two relation lists:
a T list and an M list The T list would
would be identical to the present relation
tions Each of these lexical entry point-
ers would, like the relation nodes in the
existing implementation, point to a linked
list of word2s The word2s in the T list
would be connected to the T words by an
inverse-modification relation ('M) and the
word2s in the M list would be connected to
the M words by inverse taxonomy ('T)
preprocessor need not be intelligent; its
rating this from the definition proper
Part of the preprocessing phase is to
generate a "dictionary" for the LSP This
helpful but not necessary Sager and her
associates (198B) have created programs to
do this
file in standard form, perhaps optionally
noting where further information would be
version of the system and allows the user
to "improve" on dictionary data as well as
to observe the results of the dictionary
parse
module, the LSP will parse the definition
to produce a parse tree which will then
linked into the overall lexical network
like the preprocessor, can be tailored to
the user's needs
S U ~ X
lexicon for natural language processing to
generate lexical entries interactively and
link them automatically to other lexical
of commands that allow the user to create,
entries, among other operations
reports by a diagnostic expert system It
can equally well be used in any other sub-
possible, with models of lexicon structure other than the relational model on which
it is based
further intended as the starting point for
a fully automatic lexicon building program which will create a large, general purpose
dictionary text, using a slightly modified
Queueing relational model
REFERENCES
Ahlswede, Thomas E., and Evens, Martha W.,
1983 "Generating a Relational Lexicon
sity, Rochester, Michigan
Ahlswede, Thomas E., and Evens, Martha W.,
1984 "A Lexicon for a Medical Expert System." Presented at the Workshop on Relational Models, Coling ' 8 4 , Stanford University, Palo Alto, California
Definitions." In S Williams, ed Humans
Language, Ablex
tics Research Center, University of Texas
"Generating Case Reports from the Michael
Michigan, April
Evens, Martha W., Vandendorpe, James, and
Semantic Relations in Information Retriev-
Ablex
Trang 9Iris, Madelyn, Litowitz, Bonnie, and
Investigation of Semantic Primitives."
New York
Addison-Wesley,
198~ Research into Methods for Automatic
13, New York University
Lexical/Semantic Fields." In M Loflin
Mouton, The Hague
Ethnoscience Lexicography and Ethnoscience Ethnographies." In C Rameh, ed., Seman-
Language and Linguistics