1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "A TOOLKIT FOR LEXICON BUILDING" pdf

9 256 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 9
Dung lượng 715,51 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Integrating the lexicon builder with the LSP, and writing preprocessors for dictionary data, will also be big jobs.. A major problem in the relational analysis of the dictionary is that

Trang 1

T h o m a s E A h l s w e d e

C o m p u t e r S c i e n c e D e p a r t m e n t

I l l i n o i s I n s t i t u t e of T e c h n o l o g y

C h i c a g o , I l l i n o i s 6e616, U S A

ABSTRACT

This p a p e r d e s c r i b e s a set of

i n t e r a c t i v e r o u t i n e s that can be u s e d to

create, m a i n t a i n , a n d u p d a t e a c o m p u t e r

lexicon The r o u t i n e s are a v a i l a b l e to

the user as a set of c o m m a n d s r e s e m b l i n g a

s i m p l e o p e r a t i n g system The l e x i c o n pro-

d u c e d by this s y s t e m is b a s e d on lexi-

c a l - s e m a n t i c relations, but is c o m p a t i b l e

w i t h a v a r i e t y of o t h e r m o d e l s of l e x i c o n

structure The l e x i c o n b u i l d e r is suit-

able for the g e n e r a t i o n of m o d e r a t e - s i z e d

v o c a b u l a r i e s and has been used to

c o n s t r u c t a l e x i c o n for a small m e d i c a l

e x p e r t system A f u t u r e v e r s i o n of the

l e x i c o n b u i l d e r w i l l c r e a t e a m u c h l a r g e r

l e x i c o n by p a r s i n g d e f i n i t i o n s from

m a c h i n e - r e a d a b l e d i c t i o n a r i e s

I N T R O D U C T I O N

N a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s

need m u c h larger l e x i c o n s than t h o s e

a v a i l a b l e today F u r t h e r m o r e , a g o o d com-

puter l e x i c o n w i t h s e m a n t i c as w e l l as

s y n t a c t i c i n f o r m a t i o n is e l a b o r a t e and

hard to c o n s t r u c t We have c r e a t e d a

p r o g r a m w h i c h e n a b l e s its user to i n t e r -

a c t i v e l y b u i l d and e x t e n d a lexicon The

p r o g r a m sets up a user e n v i r o n m e n t s i m i l a r

to a s i m p l e i n t e r a c t i v e o p e r a t i n g system;

in this e n v i r o n m e n t l e x i c a l e n t r i e s can be

p r o d u c e d through a small set of c o m m a n d s ,

c o m b i n e d w i t h p r o m p t s s p e c i f i e d by the

user for the d e s i r e d kind of lexicon

The i n t e r a c t i v e l e x i c o n b u i l d e r is

being used to help c o n s t r u c t e n t r i e s for a

l e x i c o n to be used to p a r s e and g e n e r a t e

stroke case reports Many terms in this

m e d i c a l s u b l a n g u a g e e i t h e r do not a p p e a r

in s t a n d a r d d i c t i o n a r i e s or are used in

the s u b l a n g u a g e w i t h special m e a n i n g s

The d e s i g n of the l e x i c o n b u i l d e r is

inuended to be g e n e r a l e n o u g h to m a k e it

useful for o t h e r s b u i l d i n g l e x i c o n s for

large n a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s

involving d i f f e r e n t s u b l a n g u a g e s

The i n t e r a c t i v e l e x i c o n b u i l d e r w i l l

be the b a s i s for a f u l l y a u t o m a t i c l e x i c o n

b u i l d e r w h i c h uses S a g e r ' s L i n g u i s t i c

S t r i n g P a r s e r (LSP) to p a r s e m a c h i n e -

r e a d a b l e text into a r e l a t i o n a l n e t w o r k

b a s e d on a m o d i f i e d v e r s i o n of W e r n e r ' s NTQ ( M o d i f i c a t i o n - T a x o n o m y - Q u e u e i n g ) schema I n i t i a l l y this p r o g r a m will be

a p p l i e d to W e b s t e r ' s S e v e n t h C o l l e g i a t e

D i c t i o n a r y and the L o n g m a n D i c t i o n a r y of

C o n t e m p o r a r y English, b o t h of w h i c h are

a v a i l a b l e in m a c h i n e - r e a d a b l e form

L E X I C A L - S E N A N T I C R E L A T I O N S

The s e m a n t i c c o m p o n e n t of the l e x i c o n

p r o d u c e d by this s y s t e m c o n s i s t s p r i n c i -

p a l l y of a n e t w o r k of l e x i c a l - s e m a n t i c relations That is, the m e a n i n g of a w o r d

in the l e x i c o n is i n d i c a t e d as far as

p o s s i b l e by its r e l a t i o n s h i p s w i t h o t h e r words T h e s e r e l a t i o n s o f t e n have s e m a n - tic c o n t e n t t h e m s e l v e s and thus c o n t r i b u t e

to the d e f i n i t i o n of the w o r d s they link The two m o s t f a m i l i a r such r e l a t i o n s are s y n o n y m y and a n t o n y m y , but o t h e r s are

i n t e r e s t i n g and i m p o r t a n t For instance,

to take an e x a m p l e f r o m the v o c a b u l a r y of

s t r o k e reports, the c a r o t i d is a k i n d of

a r t e r y and an a r t e r y is a kind of b l o o d Vessel T h i s "is a kind of" r e l a t i o n is taxonomy We e x p r e s s the t a x o n o m i c rela- tions of "carotid', "artery" a n d "blood

v e s s e l " w i t h the r e l a t i o n a l a r c s

c a r o t i d T a r t e r y

a r t e r y T b l o o d v e s s e l

A n o t h e r i m p o r t a n t r e l a t i o n is that of the p a r t to the whole:

v e n t r i c l e P A R T h e a r t

B r o c a ' s area PART b r a i n

N o t e that t a x o n o m y is t r a n s i t i v e : if the c a r o t i d is an a r t e r y and an a r t e r y is

a b l o o d vessel, then the c a r o t i d is a

b l o o d vessel The p r e s e n c e or a b s e n c e of the p r o p e r t i e s of t r a n s i t i v i t y , r e f l e x i v - ity a n d s y m m e t r y a r e i m p o r t a n t in u s i n g

r e l a t i o n s to make inferences

Trang 2

c o m p l i c a t e d than t a x o n o m y in its p r o p e r -

ties; some i n s t a n c e s of it are t r a n s i t i v e

and o t h e r s are not F r o m this and other

criteria, Iris et al (forthcoming)

d i s t i n g u i s h four d i f f e r e n t p a r t - w h o l e

relations

T a x o n o m y and p a r t - w h o l e are very

c o m m o n relations, by no m e a n s r e s t r i c t e d

to a n y p a r t i c u l a r s u b l a n g u a g e S u b l a n -

g u a g e s may, however, use r e l a t i o n s that

are rare or n o n e x i s t e n t in the g e n e r a l

language In the s t r o k e v o c a b u l a r y , there

are m a n y w o r d s for p a t h o l o g i c a l c o n d i t i o n s

i n v o l v i n g the failure of some p h y s i c a l or

m e n t a l function We have i n v e n t e d a rela-

tion N N A B L E to e x p r e s s the c o n n e c t i o n

b e t w e e n the c o n d i t i o n and the function:

a p h a s i a N N A B L E s p e e c h

a m n e s i a N N A B L E m e m o r y

R e l a t i o n s such as T, PART, and N N A B L E

are e s p e c i a l l y useful in m a k i n g infer-

ences For instance, if we have a n o t h e r

relation FUNC, d e s c r i b i n g the typical

f u n c t i o n of a body part, we m i g h t c o m b i n e

the r e l a t i o n a l arc

s p e e c h FUNC B r o c a ' s area

w i t h the arc

a p h a s i a N N A B L E s p e e c h

to infer that w h e n a p h a s i a is present, the

d i a g n o s t i c i a n should check for the p o s s l -

b i l i t y of d a m a g e to B r o c a ' s area (as w e l l

as to any other body p a r t w h i c h has s p e e c h

as a function)

F i g u r e i Part of a r e l a t i o n a l n e t w o r k

A n o t h e r k i n d of r e l a t i o n is the "col-

l o c a t i o n a l relation', w h i c h g o v e r n s the

c o m b i n i n g of words T h e s e are p a r t i c u -

l a r l y u s e f u l for g e n e r a t i n g i d i o m a t i c text C o n s i d e r the "typical p r e p o s i t i o n "

r e l a t i o n PREP:

on P R E P list

w h i c h says that an item may be "on a list"

as o p p o s e d to "in a list" or "at a list."

A l t h o u g h the l e x i c o n b u i l d e r is b a s e d

on a r e l a t i o n a l model, it can be a d a p t e d for use in c o n n e c t i o n w i t h a v a r i e t y of

m o d e l s of l e x i c o n s t r u c t u r e A s e m a n t i c - field a p p r o a c h can be h a n d l e d by the same

m e c h a n i s m as r e l a t i o n s ; the l e x i c o n

b u i l d e r also r e c o g n i z e s u n a r y a t t r i b u t e s

of words, and these a t t r i b u t e s can be

t r e a t e d as s e m a n t i c f e a t u r e s if one w i s h e s

to b u i l d a f e a t u r e - b a s e d lexicon

A P P L I C A T I O N S F O R T H E L E X I C O N B U I L D E R

This p r o j e c t was m o t i v a t e d p a r t l y by

t h e o r e t i c a l q u e s t i o n s of l e x i c o n d e s i g n

a n d p a r t l y by p r o j e c t s w h i c h r e q u i r e d the use of a lexicon

For i n s t a n c e , the M i c h a e l R e e s e Hos- pital S t r o k e R e g i s t r y i n c l u d e s a text

g e n e r a t i o n m o d u l e p o w e r e d by a r e l a t i o n a l

l e x i c o n (Evens et al., 1984) This a p p l i -

c a t i o n p r o v i d e d a f r a m e w o r k of g o a l s

w i t h i n w h i c h the i n t e r a c t i v e l e x i c o n

b u i l d e r was d e v e l o p e d The v o c a b u l a r y required for the S t r o k e R e g i s t r y text

g e n e r a t o r is of m o d e r a t e size, a b o u t 2000

w o r d s and phrases This is small e n o u g h thau a l e x i c o n for it can be built

i n t e r a c t i v e l y

O n e can imagine m a n y a p p l i c a t i o n s for

a large l e x i c o n such as the a u t o m a t i c

l e x i c o n b u i l d e r w i l l c o n s t r u c t Q u e s t i o n

a n s w e r i n g is one of our o r i g i n a l a r e a s of interest; a large, d e n s e l y c o n n e c t e d

v o c a b u l a r y will g r e a t l y add to the v a r i e t y

of i n f e r e n c e s a q u e s t i o n a n s w e r i n g s y s t e m can make A n o t h e r a r e a is i n f o r m a t i o n re- trieval, w h e r e e x p e r i m e n t s (Evens et al., forthcoming) have shown that the use of a

r e l a t i o n a l t h e s a u r u s leads to i m p r o v e m e n t s

in both recall and precision

On a more t h e o r e t i c a l level, the

a u t o m a t i c l e x i c o n b u i l d e r will add g r e a t l y

to our u n d e r s t a n d i n g of s u b l a n g u a g e s ,

n o t a b l y that of the d i c t i o n a r y itself We have n o t e d that a s p e c i a l i z e d r e l a t i o n such as NNABLE, u n u s u a l in the g e n e r a l language, may be i m p o r t a n t in a sub- language We b e l i e v e that such s p e c i f i c

r e l a t i o n s are p a r t of the d i s t i n c t i v e

c h a r a c t e r of e v e r y sublanguage The very

p o s s i b i l i t y of c r e a t i n g a large, g e n e r a l -

Trang 3

l a n g u a g e l e x i c o n p o i n t s t o w a r d a time w h e n

s u b l a n g u a g e s w i l l be o b s o l e t e for m a n y of

the p u r p o s e s for w h i c h they are now used;

but they w i l l still be u s e f u l and

i n t e r e s t i n g for a long time to come, a n d

the a u t o m a t i c l e x i c o n b u i l d e r g i v e s us a

new tool for a n a l y z i n g them

T H E I N T E R A C T I V E L E X I C O N B U I L D E R

Commands

The i n t e r a c t i v e l e x i c o n b u i l d e r

c o n s i s t s of an o p e r a t l n g - s y s t e m - l i k e

e n v i r o n m e n t in w h i c h the user m a y i n v o k e

the f o l l o w i n g c o m m a n d s :

H E L P d i s p l a y s a set of o n e - l i n e

s u m m a r i e s of the c o m m a n d s , or a p a r a g r a p h -

l e n g t h d e s c r i p t i o n of a s p e c i f i e d command

T h i s p a r a g r a p h d e s c r i b e s the c o m m a n d - l i n e

a r g u m e n t s , o p t i o n a l or required, for the

g i v e n command, and b r i e f l y e x p l a i n s the

f u n c t i o n of the command

A D D E N T R Y p r o v i d e s a series of p r o m p t s

to e n a b l e the user to c r e a t e a l e x i c a l

entry Some of these p r o m p t s are hard

coded; o t h e r s can be set up in a d v a n c e by

the user so that the l e x i c o n can be

t a i l o r e d to the u s e r ' s needs

E D I T e n a b l e s the user to m o d i f y an

e x i s t i n g entry It d i s p l a y s the e x i s t i n g

c o n t e n t s of the e n t r y item by item,

p r o m p t i n g for c h a n g e s or a d d i t i o n s If

the d e s i r e d e n t r y is not a l r e a d y in the

lexicon, EDIT b e h a v e s in the same way as

ADDENTRY

D E L E T E lets the user d e l e t e one or

m o r e entries An entry is not p h y s i c a l l y

deleted; it is r e m o v e d from the d i r e c -

tory, and all e n t r i e s w i t h arcs p o i n t i n g

to it are m o d i f i e d to e l i m i n a t e t h o s e

arcs (This is s i m p l e to do, s i n c e for

every such arc there is an i n v e r s e arc

p o i n t i n g to that entry from the d e l e t e d

one.) On the next PACK o p e r a t i o n (see

below) the d e l e t e d e n t r y w i l l not be

p r e s e r v e d in the lexicon

This c o m m a n d can a l s o be used to

d e l e t e the d e f e c t i v e e n t r i e s that are

o c c a s i o n a l l y c a u s e d by u n r e s o l v e d bugs in

the e n t r y - c r e a t i n g routines, or w h i c h

m i g h t a r i s e from other c i r c u m s t a n c e s A

special o p t i o n w i t h this c o m m a n d s e a r c h e s

the d i r e c t o r y for a v a r i e t y of "illegal"

c o n d i t i o n s such as n o n p r i n t i n g c h a r a c t e r s ,

z e r o - l e n g t h names, etc

LIST g i v e s o n e - l i n e l i s t i n g s of some

or all of the e n t r i e s in the lexicon The

l i s t i n g for each entry includes the n a m e

(the w o r d itself), sense number, p a r t of

speech, and the first forty c h a r a c t e r s of

the d e f i n i t i o n if there is one

one or m o r e entries

R E L A T I O N S d i s p l a y s a t a b l e of the

l e x i c a l - s e m a n t i c r e l a t i o n s u s e d by the

l e x i c o n b u i l d e r T h i s t a b l e is c r e a t e d by the u s e r in a s e p a r a t e o p e r a t i o n

U N D E F is a s p e c i a l f o r m of EDIT In

c r e a t i n g a n entry, the u s e r may c r e a t e

r e l a t i o n a l a r c s from the c u r r e n t w o r d to

o t h e r w o r d s that are not in the lexicon The s y s t e m k e e p s a q u e u e of u n d e f i n e d words U N D E F i n v o k e s E D I T for the w o r d at the head of the queue, thus s a v i n g the user the t r o u b l e of l o o k i n g up u n d e f i n e d words

PACK p e r f o r m s file m a n a g e m e n t on the lexicon, s o r t i n g the e n t r i e s a n d e l i m i -

n a t i n g s p a c e left by d e l e t e d ones

This r o u t i n e w o r k s in two passes In the first pass, the e n t r i e s are c o p i e d from the e x i s t i n g l e x i c o n file to a new file in l e x i c o g r a p h i c o r d e r and a table is

c r e a t e d that m a p s the e n t r i e s f r o m their old l o c a t i o n s to their new ones At this stage, a r e l a t i o n a l arc from one e n t r y to

a n o t h e r still p o i n t s to the o t h e r e n t r y ' s old location The s e c o n d pass u p d a t e s the new lexicon, m o d i f y i n g all r e l a t i o n a l a r c s

to p o i n t to the c o r r e c t new l o c a t i o n s

Q U I T e x i t s from the l e x i c o n b u i l d e r

e n v i r o n m e n t Any n e w e n t r i e s or c h a n g e s

m a d e d u r i n g the l e x i c o n b u i l d i n g s e s s i o n are i n c o r p o r a t e d and the d i r e c t o r y is updated

E x t e n s i o n s t o the c o m m a n d s All of the c o m m a n d s can be a b b r e v i - ated; so far they all have d i s t i n c t i v e

i n i t i a l s and can thus be c a l l e d w i t h a

s i n g l e k e y s t r o k e Each c o m m a n d may be a c c o m p a n i e d by

c o m m a n d - l i n e a r g u m e n t s to d e f i n e its a c - tion m o r e p r e c i s e l y D i s p l a y c o m m a n d s ,

s u c h as HELP or SHOW, a l l o w the user to get a p r i n t o u t of the display W h e r e an

e n t r y name is to be s p e c i f i e d , the user can get m o r e than one entry by m e a n s of

"wild c a r d s " For instance, the c o m m a n d

"LIST p r o d u c = m i g h t y i e l d a list showing

e n t r i e s for "produce', "produced", "pro- duces", "producing', "product', a n d

"production ~

A d d i t i o n a l c o m m a n d s are c u r r e n t l y

b e i n g d e v e l o p e d to h e l p the user m a n a g e the r e l a t i o n table and the a t t r i b u t e list from w i t h i n the l e x i c o n b u i l d e r

e n v i r o n m e n t

Trang 4

into a c c o u n t b o t h the a v a i l a b l e f a c i l i t i e s

and the e x p e c t e d users The l e x i c o n

b u i l d e r runs on a VAX 11-75B, n o r m a l l y

a c c e s s e d w i t h l i n e - e d l t i n g terminals

This s u g g e s t s that a s i n g l e - l i n e c o m m a n d

f o r m a t is m o s t a p p r o p r i a t e Since much of

the work w i t h the s y s t e m is d o n e over 3~0

b a u d t e l e p h o n e lines, c o n c i s e n e s s is a l s o

important The u s e r s have all had some

p r o g r a m m i n g e x p e r i e n c e (though not n e c e s -

s a r i l y very much) so an o p e r a t i n g - s y s t e m -

like i n t e r f a c e is easy for them to get

used to If the l e x i c o n b u i l d e r b e c o m e s

popular, we hope to have the o p p o r t u n i t y

to d e v e l o p a m o r e s o p h i s t i c a t e d interface,

p e r h a p s w i t h a c o m b i n a t i o n of f e a t u r e s for

b e g i n n e r s and m o r e e x p e r i e n c e d users

S t r u c t u r e of a l e x l c a l e n t r y

A c o m p l e t e lexical e n t r y c o n s i s t s of:

i The "name" of the entry its

c h a r a c t e r - s t r i n g form

2 Its sense We r e p r e s e n t senses

by simple numbers, not a t t e m p t i n g to

f o r m a l l y d i s t i n g u i s h p o l y s e m y and h o m o -

nymy, or any other d e g r e e of s e m a n t i c

d i f f e r e n c e The s y s t e m leaves to the user

the p r o b l e m of d i s t i n g u i s h i n g d i f f e r e n t

s e n s e s from e x t e n s i o n s of a s i n g l e sense:

that is, w h e r e a word has a l r e a d y been

e n t e r e d in some sense, the user must

d e c i d e w h e t h e r to m o d i f y the e n t r y for

that sense or c r e a t e a new entry for a new

sense

3 Part of speech, or "class." Our

c l a s s i f i c a t i o n of parts of s p e e c h is

b a s i c a l l y the t r a d i t i o n a l c l a s s i f i c a t i o n

w i t h some c o n v e n i e n t a d d i t i o n s , l a r g e l y

d r a w n from the c l a s s i f i c a t i o n used by

Sager in the LSP (Sager, 1981) Most of

the a d d i t i o n s are to the c a t e g o r y of

v e r b s : "verb" to the lexicon b u i l d e r de-

n o t e s the stem form, w h i l e the third

p e r s o n and past tense are d i s t i n g u i s h e d as

" f i n i t e verb', and the past and p r e s e n t

p a r t i c i p l e s are c l a s s i f i e d separately

4 The text of the definition,

e n t e r e d by the user

At t h i s stage in our work, the

d e f i n i t i o n is not p a r s e d or o t h e r w i s e ana-

lyzed, so its p r e s e n c e is m o r e for

p u r p o s e s of d o c u m e n t a t i o n than a n y t h i n g

else In future v e r s i o n s of the lexicon

builder, the d e f i n i t i o n will play an

i m p o r t a n t role in c o n s t r u c t i n g the entry

but in the entry itself will be replaced

by i n f o r m a t i o n d e r i v e d from its analysis

5 A list of a t t r i b u t e s (or s e m a n t i c

features), each with its value, w h i c h may

be b i n a r y or scalar

For example, for the m o s t c o m m o n sense of the verb "promise', the p r e d i c a t e c a l c u l u s

d e f i n i t i o n is e x p r e s s e d as

p r o m i s e i x , y , z ) = say(x,w,z) _eventiy) => w = w i l l happen(y) _ t h i n g ( y ) => w = w i l l receive(z,y)

or, in freer form,

ix p r o m i s e s y to z} = ix says w to z)

w h e r e w = (y will happen)

if y is an e v e n t (z w i l l r e c e i v e y)

if y is a p h y s i c a l object This is e n t e r e d by the user

We have been i n c l i n e d to think of the

r e l a t i o n a l l e x i c o n as a network, since the network r e p r e s e n t a t i o n v i v i d l y b r i n g s out

the i n t e r c o n n e c t e d q u a l i t y w h i c h the

r e l a t i o n a l model g i v e s to the lexicon

P r e d i c a t e c a l c u l u s is b e t t e r in o t h e r respects; for instance, it e x p r e s s e s the

a b o v e d e f i n i t i o n of "promise" m u c h more

e l e g a n t l y than any n e t w o r k n o t a t i o n could The two m e t h o d s of r e p r e s e n t a t i o n have

t r a d i t i o n a l l y b e e n seen as a l t e r n a t i v e s rather than as s u p p l e m e n t i n g each other;

we b e l i e v e that p r e d i c a t e c a l c u l u s has an

i m p o r t a n t s u p p l e m e n t a r y role to play in

d e f i n i n g the core v o c a b u l a r y of the lexicon, a l t h o u g h we are not sure yet how

to use it

7 Case s t r u c t u r e (for verbs) This

is a table d e s c r i b i n g , for each s y n t a c t i c slot a s s o c i a t e d w i t h the verb (subject,

d i r e c t object, etc.) the s e m a n t i c case or

c a s e s that may be used in that slot ('age,in, " e x p e r i e n c e r ' , etc.), w h e t h e r it

is required, o p t i o n a l , or may be e x p r e s s e d

e l l i p t i c a l l y (as w i t h the d i r e c t and

i n d i r e c t o b j e c t in "I p r o m i s e i " r e f e r r i n g

to an earlier statement)

Space is r e s e r v e d in this s t r u c t u r e for s e l e c t i o n r e s t r i c t i o n s A r e l a t i o n a l

m o d e l gives us the much more p o w e r f u l op- tion of i n d i c a t i n g t h r o u g h r e l a t i o n s such

as " p e r m i s s i b l e subject', " p e r m i s s i b l e object', etc., not only what w o r d s m a y go

w i t h what others, but w h e t h e r the usage is literal, a c o n v e n t i o n a l figure of speech, fanciful, or w h a t e v e r S e l e c t i o n restric- tions do, however, have the v i r t u e of

c o n c i s e n e s s , and they p e r m i t us to make

g e n e r a l i z a t i o n s R e l a t i o n a l a r c s may then

be used to mark e x c e p t i o n s

8 A list of zero or more relations, each w i t h one or m o r e p o i n t e r s to other entries, to w h i c h the c u r r e n t e n t r y is

c o n n e c t e d by that relation

Trang 5

We find it c o n v e n i e n t to treat m o r -

p h o l o g i c a l d e r i v a t i o n s such as p l u r a l of

nouns, t e n s e s and p a r t i c i p l e s of verbs, as

r e l a t i o n s c o n n e c t i n g s e p a r a t e entries

The e n t r y for a r e g u l a r l y d e r i v e d f o r m

such as a n o u n p l u r a l is a m i n i m a l one,

c o n s i s t i n g of name, sense, part of speech,

and one r e l a t i o n a l arc, l i n k i n g the e n t r y

to the stem form The l e x i c o n b u i l d e r

g e n e r a t e s these r e g u l a r forms a u t o m a t i -

cally It a l s o d i s t i n g u i s h e s t h e s e "regu-

lar" e n t r i e s f r o m " u n d e f i n e d " e n t r i e s ,

w h i c h have b e e n e n t e r e d i n d i r e c t l y as

t a r g e t w o r d s of r e l a t i o n a l a r c s a n d w h i c h

are on the q u e u e a c c e s s e d by UNDEF, as

w e l l as from " d e f i n e d " entries

n a m e

s e n s e

c l a s s

text of

d e f i n i t i o n

a t t r i b u t e list

p r e d i c a t e

c a l c u l u s

d e f i n i t i o n

case s t r u c t u r e

table

r e l a t i o n ~

list

w2-

I w2 1.2[

l :I

F i g u r e 2, S t r u c t u r e of a l e x i c a l e n t r y

File s t r u c t u r e of the l e x i c o n

T h e r e are four data files

w i ~ h the lexicon

a s s o c i a t e d

The first is the l e x i c o n proper The

b i g g e s t c o m p l i c a t i n g factor in the d e s i g n

of the l e x i c o n is the e x t r e m e l y inter-

c o n n e c t e d n a t u r e of the data; a c h a n g e in

one p o r t i o n of the file may n e c e s s i t a t e

c h a n g e s in m a n y o t h e r p l a c e s in the file

Each entry is l i n k e d t h r o u g h r e l a t i o n a l

arcs to m a n y o t h e r entries; a n d for e v e r y

arc p o i n t i n g from w o r d l to word2, there

m u s t be an i n v e r s e arc f r o m w o r d 2 to

a n e w arc in the c o u r s e of b u i l d i n g or

m o d i f y i n g a n e n t r y for wordl, we m u s t

u p d a t e the e n t r y for w o r d 2 so that it w i l l

c o n t a i n the a p p r o p r i a t e i n v e r s e arc back

to wordl• W o r d 2 ~ s e n t r y has to be u p d a t e d

or c r e a t e d from scratch; we n e e d to

s t r u c t u r e the l e x i c o n file so that this

u p d a t i n 9 p r o c e s s , w h i c h may take p l a c e

a n y w h e r e in the file, can be d o n e w i t h the

l e a s t p o s s i b l e d i s l o c a t i o n

a p h a s i a (1) n

definition

a d i s o r d e r of l a n g u a g e due to i n j u r y

to the b r a i n

a t t r i b u t e s

n o n h u m a n

c o l l e c t i v e

p r e d i c a t e c a l c u l u s have(x, aphasia) " a b l e ( s p e a k ( x ) )

r e l a t i o n s

T A X [aphasia is a k i n d of x]

d e f i c i t

d i s o r d e r loss

i n a b i l i t y

"TAX

Ix is a kind of aphasia]

a n o m i c

g l o b a l

g e r s t m a n n ' s

s e m a n t i c

We rnicke ' s

S r o c a ' s

c o n d u c t i o n

t r a n s c o r t i c a l

S Y M P T O M [aphasia is a s y m p t o m of x]

s t r o k e

T I A

A S S O C [aphasia may be a s s o c i a t e d w i t h x]

a p r a x i a _ C A U S E [x is a c a u s e of aphasia]

injury

l e s i o n

N N A B L E [aphasia is the i n a b i l i t y to do x]

s p e e c h

l a n g u a g e

F i g u r e 3 L e x i c a l entry for " a p h a s i a "

The size of an e n t r y can vary

e n o r m o u s l y R e g u l a r d e r i v e d forms c o n t a i n

o n l y the name, sense, class a n d one rela-

t i o n a l arc (to the s t e m form), as w e l l as

a c e r t a i n a m o u n t of o v e r h e a d for the

d e f i n i t i o n , p r e d i c a t e c a l c u l u s d e f i n i t i o n

a n d a t t r i b u t e list a l t h o u g h these are not used The s m a l l e s t p o s s i b l e e n t r y t a k e s

up a b o u t thirty bytes At the o t h e r extreme, a w o r d may h a v e an e x t e n s i v e

a t t r i b u t e list, e l a b o r a t e t e x t and

p r e d i c a t e c a l c u l u s d e f i n i t i o n s , and d o z e n s

Trang 6

tional arcs "Aphasia', a m o d e r a t e l y

large e n t r y w i t h 19 arcs, o c c u p i e s 322

bytes Like all e n t r i e s in the c u r r e n t

lexicon, it w i l l be s u b j e c t to u p d a t i n g

and w i l l c e r t a i n l y b e c o m e m u c h larger

W i t h this range of e n t r y sizes, the

c h o i c e b e t w e e n f i x e d - s i z e and v a r i a b l e -

size records b e c o m e s s o m e w h a t painful

V a r i a b l e - s i z e records w o u l d be h i g h l y

c o n v e n i e n t as w e l l as e f f i c i e n t e x c e p t for

the fact that w h e n we a d d a new e n t r y that

is related to e x i s t i n g entries, we m u s t

add new a r c s to those entries The

e x i s t i n g e n t r i e s thus no longer fit into

their p r e v i o u s space and m u s t be e i t h e r

b r o k e n up or m o v e d to a new space The

former o p t i o n c r e a t e s p r o b l e m s of

i d e n t i f y i n g the v a r i o u s p i e c e s of the

entry; the latter r e q u i r e s that yet m o r e

e x i s t i n g e n t r i e s be m o d i f i e d

B e c a u s e of t h e s e problems, we have

opted for a f i x e d - s i z e record Some space

is wasted, e i t h e r in e m p t y space if the

record is too large or t h r o u g h p r o l i f e r a -

tion of p o i n t e r s if the record is too

small; but the a m o u n t of n e c e s s a r y up-

d a t i n g is m u c h less, and the file can be

kept in order through f r e q u e n t use of the

PACK command The c h o i c e of record size

is c o n d i t i o n e d by m a n y factors, s y s t e m

r e q u i r e m e n t s as w e l l as the range of entry

sizes We are c u r r e n t l y w o r k i n g on d e t e r -

m i n i n g the best record size for the MRH

a p p l i c a t i o n

So far the user does not have the op-

tion of saving or rejecting the results of

a lexicon b u i l d i n g session, since e n t r i e s

are w r i t t e n to the file as soon as they

are created We are s t u d y i n g w a y s of

p r o v i d i n g this option A brute force w a y

w o u l d be to keep the e n t i r e l e x i c o n in

m e m o r y and rewrite it at the end of the

session This is f e a s i b l e if the host

c o m p u t e r is large and the l e x i c o n is

small The 2 ~ g 0 - w o r d l e x i c o n for the

M i c h a e l Reese stroke d a t a b a s e takes up

a b o u t a third of a megabyte, so this

a p p r o a c h w o u l d work on a m a i n f r a m e or a

large m i n i c o m p u t e r such as our Vax 75g,

but could not r e a d i l y be p o r t e d to a

smaller machine; nor c o u l d w e h a n d l e a

much larger v o c a b u l a r y such as we plan to

c r e a t e w i t h the a u t o m a t i c l e x i c o n builder

The second file is a d i r e c t o r y ,

showing each e n t r y ' s name, sense, and

status (defined, u n d e f i n e d or regular

d e r i v a u i v e ) , w i t h a pointer to the a p p r o -

p r i a t e entry in the l e x i c o n proper The

d i r e c t o r y e n t r i e s are l i n k e d in lexico-

g r a p h i c order When the l e x i c o n b u i l d e r

is invoked, the e n t i r e d i r e c t o r y is read

into a buffer in memory, and this b u f f e r

is u p d a t e ~ as e n t r i e s are created,

l e x i c o n b u i l d i n g session, the u p d a t e d

d i r e c t o r y is w r i t t e n out to disk

The third (optional) file is a table

of a t t r i b u t e s , w i t h p o i n t e r s into the

l e x i c o n proper This can be e x t e n d e d into

a f e a t u r e matrix

The f o u r t h (also optional) is a table

of p r e - d e f i n e d relations This t a b l e includes, for each relation:

(i) its m n e m o n i c name

(2) its p r o p e r t i e s A r e l a t i o n may

be reflexive, s y m m e t r i c or t r a n s i t i v e ; there may b e o t h e r p r o p e r t i e s w o r t h including

(3) a p o i n t e r to the r e l a t i o n ' s inverse If x R E L y, then we can d e f i n e some REL such that y REL x If REL is

r e f l e x i v e or symmetric, then REL = REL

(4) the a p p r o p r i a t e p a r t s of s p e e c h for the w o r d s l i n k e d by the relation For instance, the N N A B L E r e l a t i o n links two nouns, w h i l e the c o l l o c a t i o n a l PREP rela- tion links a p r e p o s i t i o n to a noun

T a x o n o m y can link any two w o r d s (apart from p r e p o s i t i o n s , c o n j u n c t i o n s , etc.) as long as they are of the same part of speech: n o u n s to nouns, verbs to verbs,

e t c

(5) the text of a prompt A D D E N T R Y uses this p r o m p t w h e n q u e r y i n g the user for the o c c u r r e n c e of r e l a t i o n a l arcs

i n v o l v i n g this relation For instance, if

we are e n t e r i n g the w o r d "promise" and our

a p p l i c a t i o n uses the t a x o n o m y relation, we

m i g h t c h o o s e a short prompt, in w h i c h case the q u e r y for t a x o n o m y m i g h t take the form

"promise" T: [user e n t e r s w o r d 2 here]

or we c o u l d use s o m e t h i n g m o r e explicit:

"promise" is a kind of:

Users familiar w i t h l e x i c a l - s e m a n t i c

r e l a t i o n s m i g h t p r e f e r the s h o r t e r

m n e m o n i c prompt, w h e r e a s other users m i g h t

p r e f e r a p r o m p t that better e x p r e s s e d the

s i g n i f i c a n c e of the relation

T H E A U T O M A T I C L E X I C O N B U I L D E R

B u i l d i n g a v e r y l a r g e l e x i c o n

T h e r e a r e n u m e r o u s l o g i s t i c a l p r o b -

l e m s i n i m p l e m e n t i n g t h e s o r t o f v e r y

73

Trang 7

large lexicon that would result from anal-

ysis of an entire dictionary, as the work

of Amsler and White (1979) or Kelly and

Stone (1975) shows Integrating the

lexicon builder with the LSP, and writing

preprocessors for dictionary data, will

also be big jobs Fully automatic analy-

sis of dictionary material, then, is a

long-range goal

A major problem in the relational

analysis of the dictionary is that of

determining what relations to use Noun

and verb definitions rely on taxonomh ~ to a

great extent (e.g Amsler and White,

1979) but there are definitions that do

not clearly fit this pattern; further-

more, even in a taxonomic definition, much

semantic information is contained in the

qualifying or differentiating part of the

definition

Adjective definitions are another

problem area Adjectives are usually

defined in terms of nouns or verbs rather

than other adjectives, so simple taxonomy

does not work neatly In a sample of

about 7 , 0 ~ definitions from W7, we

identified nineteen major relations unique

to adjective definitions, and these

covered only half of the sample The

remaining definitions were much more

varied and would probably require far more

then nineteen additional relations And

for each relation, we had to identify

words or phrases (the "defining formulas')

that signaled the presence of the

relation

The M'~ model

For these reasons as well as

theoretical ones, we need a simplifying

model of relations, a model that enables

us either to avoid the endless identifica-

tion of new relations or to conduct the

identification within an orderly frame-

work Werner's MTQ schema (Werner, 1978;

Werner and Topper, 1976) seems to provide

the basis for such a model

Werner idennifies only three rela-

tions: modification, taxonomy and queue-

ing He asserts that all other relations

can be expressed as compounds of these

relations and of lexical items for

instance, the PART relation can be

expressed, with the help of the lexical

item "part', by the relational arcs

Broca's area T part

which say in effect that Broca's area is a

kind of part, specifically a "brain-part."

taxonomy reflects Aristotle's model of the definition as consisting of species, genus and differentiae taxonomy links the species to the genus and m o d i f i c a t i o n links the differentiae to the genus A study of definitions in W7 and LDOCE shows that they do indeed follow this pattern, although (as in adjective definitions) the pattern is not always obvious

The special power of MTQ in the analysis of definitions is that in a definition following the A r i s t o t e l i a n

structure, taxonomy and m o d i f i c a t i o n can

be identified by purely syntactic means One (or occasionally more than one) word

in the definition is modified directly or indirectly by all the other words The core word is linked to the defined word by taxonomy; all the others are linked to the core word by modification (Queueing

so far does not seem to be important in the analysis of definitions.)

In order to avoid certain ambiguities that arise in a very elaborate network such as that generated from a large dic- tionary, we have replaced the separate modification and taxonomy arcs with a single, ternary relational arc that keeps the species, genus and d i f f e r e n t i a t i n g items of any particular definition linked

to each other

The problem of identifying "higher level" relations such as PART and NNABLE

in an MT0 network still remains At this point it seems to be similar to the prob- lem of identifying higher level relations from defining formulas

Another pleasant discovery is that the Linguistic String Parser, which we have used successfully for some years, is exceptionally well suited for this strat- egy, since it is geared toward an analysis

of sentences and phrases in terms of

"centers" or "cores" with their modifying

"adjuncts', which is exactly the kind of analysis we need to do

Design of the automatic lexicon builder The automatic lexicon builder will contain at least the following suDsystems:

I The standard data structure f o r the lexical entry, as described for the interactive lexicon builder, with slight changes to adjust to the use of MTQ

The relation list is presently structured as a linked list of relations, each pointing to a linked list of wordis ('Wordi" refers to any word related to the

Trang 8

gating.) Incorporating the ternary MTQ

model, we would have two relation lists:

a T list and an M list The T list would

would be identical to the present relation

tions Each of these lexical entry point-

ers would, like the relation nodes in the

existing implementation, point to a linked

list of word2s The word2s in the T list

would be connected to the T words by an

inverse-modification relation ('M) and the

word2s in the M list would be connected to

the M words by inverse taxonomy ('T)

preprocessor need not be intelligent; its

rating this from the definition proper

Part of the preprocessing phase is to

generate a "dictionary" for the LSP This

helpful but not necessary Sager and her

associates (198B) have created programs to

do this

file in standard form, perhaps optionally

noting where further information would be

version of the system and allows the user

to "improve" on dictionary data as well as

to observe the results of the dictionary

parse

module, the LSP will parse the definition

to produce a parse tree which will then

linked into the overall lexical network

like the preprocessor, can be tailored to

the user's needs

S U ~ X

lexicon for natural language processing to

generate lexical entries interactively and

link them automatically to other lexical

of commands that allow the user to create,

entries, among other operations

reports by a diagnostic expert system It

can equally well be used in any other sub-

possible, with models of lexicon structure other than the relational model on which

it is based

further intended as the starting point for

a fully automatic lexicon building program which will create a large, general purpose

dictionary text, using a slightly modified

Queueing relational model

REFERENCES

Ahlswede, Thomas E., and Evens, Martha W.,

1983 "Generating a Relational Lexicon

sity, Rochester, Michigan

Ahlswede, Thomas E., and Evens, Martha W.,

1984 "A Lexicon for a Medical Expert System." Presented at the Workshop on Relational Models, Coling ' 8 4 , Stanford University, Palo Alto, California

Definitions." In S Williams, ed Humans

Language, Ablex

tics Research Center, University of Texas

"Generating Case Reports from the Michael

Michigan, April

Evens, Martha W., Vandendorpe, James, and

Semantic Relations in Information Retriev-

Ablex

Trang 9

Iris, Madelyn, Litowitz, Bonnie, and

Investigation of Semantic Primitives."

New York

Addison-Wesley,

198~ Research into Methods for Automatic

13, New York University

Lexical/Semantic Fields." In M Loflin

Mouton, The Hague

Ethnoscience Lexicography and Ethnoscience Ethnographies." In C Rameh, ed., Seman-

Language and Linguistics

Ngày đăng: 08/03/2014, 18:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN