1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "Sense-Linking in a Machine Readable Dictionary" potx

3 224 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 3
Dung lượng 313,59 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

T h e purpose of the research is to obtain a bet- ter understanding of the relationships between word meanings, and to provide data for our work on word- sense disambiguation and informa

Trang 1

S e n s e - L i n k i n g in a M a c h i n e R e a d a b l e D i c t i o n a r y

R o b e r t K r o v e t z

D e p a r t m e n t o f C o m p u t e r S c i e n c e

U n i v e r s i t y o f M a s s a c h u s e t t s , A m h e r s t , M A 0 1 0 0 3

A b s t r a c t ( L D O C E ) , is a dictionary for learners of English as Dictionaries contain a rich set of relation-

ships between their senses, but often these

relationships are only implicit We report

on our experiments to automatically iden-

tify links between the senses in a machine-

readable dictionary In particular, we au-

tomatically identify instances of zero-affix

morphology, and use that information to

find specific linkages between senses This

work has provided insight into the perfor-

mance of a stochastic tagger

1 I n t r o d u c t i o n

Machine-readable dictionaries contain a rich set

of relationships between their senses, and indicate

t h e m in a variety of ways Sometimes the relation-

ship is provided explicitly, such as with a s y n o n y m or

a n t o n y m reference M o r e c o m m o n l y the relationship

is only implicit, and needs to be uncovered through

outside mechanisms This paper describes our ef-

forts at identifying these links

T h e purpose of the research is to obtain a bet-

ter understanding of the relationships between word

meanings, and to provide data for our work on word-

sense disambiguation and information retrieval Our

hypothesis is that retrieving documents on the basis

of word senses (instead of words) will result in bet-

ter performance O u r approach is to treat the in-

formation associated with dictionary senses (part of

speech, subcategorization, subject area codes, etc.)

as multiple sources of evidence (cf Krovetz [3])

This process is fundamentally a divisive one, and

each of the sources of evidence has exceptions (i.e.,

instances in which senses are related in spite of be-

ing separated by part of speech, subcategorization,

or morphology) Identifying related senses will help

us to test the hypothesis that unrelated meanings

will be more effective at separating relevant from

nonrelevant documents than meanings which are re-

lated

We will first discuss some of the explicit indica-

tions of sense relationships as found in usage notes

and deictic references We will then describe our

efforts at uncovering the implicit relationships via

stochastic tagging and word collocation

2 Explicit Sense Links

T h e dictionary we are using in our research,

the Longman Dictionary of C o n t e m p o r a r y English

a second language As such, it provides a great deal of information about word meanings in the form of example sentences, usage notes, and gram- mar codes T h e L o n g m a n dictionary is also unique

a m o n g learner's dictionaries in that its definitions are generally written using a controlled vocabulary

of approximately 2200 words W h e n exceptions oc- cur they are indicated by means of a different font For example, consider the definition of the word

gravity:

• g r a v i t y n lb worrying importance: H e doesn't u n d e r s t a n d the g r a v i t y o f his illness -

s e e GRAVE 2

• g r a v e adj 2 i m p o r t a n t and needing attention and (often) worrying: T h i s is grave n e w s - - The sick m a n ' s condition is grave

These definitions serve to illustrate how words can be synonymous 1 even though they have different parts of speech T h e y also indicate how the Long-

m a n dictionary not only indicates that a word is a synonym, but sometimes specifies the s e n s e of that word (indicated in this example by the superscript following the word *GRAVE') This is extremely im- portant because s y n o n y m y is not a relation that holds between words, but between the s e n s e s of

w o r d s Unfortunately these explicit sense indications are not always consistently provided For example, the definition of *marbled' provides an explicit indica- tion of the appropriate sense of *marble' (the stone instead of the child's toy), but this is not done within the definition of *marbles'

L D O C E also provides explicit indications of sense relationships via usage notes For example, the def- inition for a r g u m e n t mentions t h a t it derives from both senses of argue - to quarrel (to have an ar- gument), and to reason (to present an argument)

T h e notes also provide advice regarding similar look- ing variants (e.g., the difference between distinct and

distinctive, or the fact that an a t t e n d a n t is not some- one who attends a play, concert, or religious ser- vice) Usage notes can also specify information t h a t

is shared among some word meanings, but not others (e.g., the note for v e n t u r e mentions that both verb and noun carry a connotation of risk, but this isn't necessarily true for adventure)

Finally, L D O C E provides explicit connections be- tween senses via deictic reference (links created by 1We take two words to be synonymous if they have the same or closely related meanings

Trang 2

'this', 'these', ' t h a t ' , 'those', 'its', 'itself', and 'such

a / a n ' ) T h a t is, some of the senses use these words

to refer to a previous sense (e.g., ' t h e fruit of this

tree', or ' a plant bearing these seeds') These rela-

tionships are i m p o r t a n t because they allow us to get

a better understanding of the nature of polysemy

(related word meanings) Most of the literature on

polysemy only provides anecdotal examples; it usu-

ally does not provide information a b o u t how to de-

termine whether word meanings are related, what

kind of relationships there are, or how frequently

they occur T h e grouping of senses in a dictionary

is generally based on p a r t of speech and etymology,

but p a r t of speech is orthogonal to a semantic rela-

tionship (cf Krovetz [3]), and word senses can be re-

lated etymologically, but be perceived as distinct at

the present time (e.g., the 'cardinal' of a church and

'cardinal' n u m b e r s are etymologically related) By

examining deictic reference we gain a better under-

standing of senses t h a t are truly related, and it also

helps us to understand how language can be used

creatively (i.e., how senses can be productively ex-

tended) Deictic references are also i m p o r t a n t in the

design of an algorithm for word-sense disambigua-

tion (e.g., exceptions to subcategorization)

T h e p r i m a r y relations we have identified so

far are: s u b s t a n c e / p r o d u c t (tree:fruit or wood,

plant:flower or seeds), substance/color (jade, a m b e r ,

rust), o b j e c t / s h a p e (pyramid, globe, lozenge), ani-

m a l / f o o d (chicken, lamb, tuna), c o u n t - n o u n / m a s s -

noun, 2 l a n g u a g e / p e o p l e (English, Spanish, Dutch),

a n i m a l / s k i n or fur (crocodile, beaver, rabbit), and

m u s i c / d a n c e (waltz, conga, tango) 3

3 Z e r o - A f f i x M o r p h o l o g y

Deictic reference provides us with different types of

relationships within the same p a r t of speech We can

also get related senses t h a t differ in p a r t of speech,

and these are referred to as instances of zero-affix

morphology or functional shift T h e Longman dic-

tionary explicitly indicates some of these relation-

ships by h o m o g r a p h s t h a t have more t h a n one p a r t

of speech It usually provides an indication of the

relationship by a leading parenthesized expression

For example, the word bay is defined as N,ADJ, and

the definition reads ' ( a horse whose color is) reddish-

brown' However, out of the 41122 homographs de-

fined, there are only 695 t h a t have more than one

p a r t of speech Another way in which L D O C E pro-

vides these links is by an explicit sense reference for

a word outside the controlled vocabulary; the def-

~These may or may not be related; consider 'com-

puter vision' vs 'visions of computers' The related

senses are usually indicated by the defining formula: 'an

example of this'

3The related senses are sometimes merged into one;

for example, the definition of/oztrot is '(a piece of music

for) a type of formal d a n c e '

inition of anchor (v) reads: ' t o lower an a n c h o r 1

(1) to keep (a ship) from moving' This indicates a reference to sense 1 of the first h o m o g r a p h

Zero-affix morphology is also present implicitly, and we conducted an experiment to try to identify instances of it using a probabilistic tagger [2] T h e hypothesis is t h a t if the word t h a t ' s being defined (the definiendum) occurs within the text of its own definition, but occurs with a different p a r t of speech, then it will be an instance of zero-affix morphology

T h e question is: H o w do w e tell whether or not w e have an instance of zero-affix morphology w h e n there

is no explicit indication of a suffix? Part of the an- swer is to rely on subjective judgment, but w e can also support these judgments by m a k i n g an anal- ogy with derivational morphology For example, the word wad is defined as ' t o m a k e a w a d of' T h a t is,

the noun bears the semantic relation of formation to

the verb t h a t defines it This is similar to the effect

t h a t the m o r p h e m e -ize has on the noun union in

order to m a k e the verb unionize (cf Marchand [5])

T h e experiment not only gives us insight into se- mantic relatedness across p a r t of speech, it also en- abled us to determine the effectiveness of tagging

We initially examined the results of the tagger on all words starting with the letter ' W ' ; this letter was chosen because it provided a sufficient n u m b e r of words for examination, but w a s n ' t so small as to be trivial There were a total of 1141 words t h a t were processed, which a m o u n t e d to 1309 h o m o g r a p h s a n d

2471 word senses; of these senses, 209 were identified

by the tagger as containing the definiendum with a different p a r t of speech We analyzed these instances and the result was t h a t only 51 of the 209 instances were found to be correct (i.e., actual zero-morphs)

T h e instances t h a t are indicated as correct are currently based on our subjective j u d g m e n t ; we are

in the process of examining t h e m to identify the t y p e

of semantic relation and a n y analog to a derivational suffix T h e instances t h a t were not found to be cor- rect (78 percent of the total) were due to incorrect tagging; t h a t is, we had a large n u m b e r of false pos- itives because the tagger did not correctly identify the p a r t of speech We were surprised t h a t the n u m - ber of incorrect tags was so high given the perfor- mance figures cited in the literature (more t h a n a

90 percent accuracy rate) However, the figures re- ported in the literature were based on word tokens, and 60 percent of all word tokens have only one p a r t

of speech to begin with We feel t h a t the perfor- mance figures should be supplemented with the tag- ger's performance on word types as well Most word types are rare, and the stochastic m e t h o d s do not perform as well on t h e m because they do not have sufficient information Church has plans for i m p r o v - ing the smoothing algorithms used in his tagger, a n d this would help on these low frequency words In addition, we conducted a failure analysis a n d it in- dicated t h a t 91% the errors occurred in idiomatic

Trang 3

expressions (45 instances) or example sentences (98

instances) We therefore eliminated these from fur-

ther processing and tagged the rest of the dictionary

We are still in the process of analyzing these results

4 D e r i v a t i o n a l M o r p h o l o g y

Word collocation is one m e t h o d that has been pro-

posed as a means for identifying word meanings

T h e basic idea is to take two words in context, and

find the definitions t h a t have the most words in com-

mon This strategy was tried by Lesk using the Ox-

ford Advanced Learner's Dictionary [4] For exam-

ple, the word 'pine' can have two senses: a tree,

or sadness (as in 'pine away'), and the word 'cone'

m a y be a geometric structure, or a fruit of a tree

Lesk's program computes the overlap between the

senses of 'pine' and 'cone', and finds that the senses

meaning 'tree' and 'fruit of a tree' have the most

words in common Lesk gives a success rate of fifty

to seventy percent in disambiguating the words over

a small collection of text Later work by Becker on

the New O E D indicated t h a t Lesk's algorithm did

not perform as well as expected [1]

T h e difficulty with the word overlap approach is

t h a t a wide range of vocabulary can be used in defin-

ing a word's meaning It is possible that we will be

more likely to have an overlap in a dictionary with

a restricted defining vocabulary When the senses

to be m a t c h e d are further restricted to be morpho-

logical variants, the approach seems to work very

well For example, consider the definitions of the

word 'appreciate' and 'appreciation':

* a p p r e c i a t e

I to be thankful or grateful for

2 to understand and enjoy the good qualities

of

3 to understand fully

4 to understand the high worth of

5 (of property, possessions, etc.) to increase

in value

a p p r e c i a t i o n

I judgment, as of the quality, worth, or facts

of something

2 a written account of the worth of something

3 understanding of the qualities or worth of

something

4 grateful feelings

5 rise in value, esp of land or possessions

T h e word overlap approach pairs up sense 1 with

sense 4 (grateful), sense 2 with sense 3 (understand;

qualities), sense 3 with sense 3 (understand), sense 4

with sense 1 (worth), and sense 5 with sense 5 (value;

possessions) T h e matcher we are using ignores

closed class words, and makes use of a simple mor-

phological analyzer (for inflectional morphology) It

ignores words found in example sentences (prelim- inary experiments indicated that this didn't help and sometimes made matches worse), and it also ignores typographical codes and usage labels (for- real/informal, poetic, literary, etc.) It also doesn't try to make matches between word senses t h a t are idiomatic (these are identified by font codes) We are currently in the process of determining the effec- tiveness of the approach The experiment involves comparing the morphological variations for a set of queries used in an information retrieval test collec- tion We have manually identified all variations of the words in the queries as well as the root forms Those variants that appear in L D O C E will be com- pared against all root forms and the result will be examined to see how well the overlap m e t h o d was able to identify the correct sense of the variant with the correct sense of the root

5 C o n c l u s i o n The purpose of this work is to gain a b e t t e r under- standing of the relationships between word mean- ings, and to help in development of an algorithm for word sense disambiguation Our approach is based

on treating the information associated with dictio- nary senses (part of speech, subcategorization, sub- ject area codes, etc.) as multiple sources of evidence (of Krovetz [3]) This process is fundamentally a divisive one, and each of the sources of evidence has exceptions (i.e., instances in which senses are related

in spite of being separated by part of speech, sub- categorization, or morphology) Identifying the rela- tionships we have described will help us to determine these exceptions

R e f e r e n c e s [1] Becker B., "Sense Disambiguation using the

sis, University of Waterloo, 1989

[2] Church K., "A Stochastic Parts P r o g r a m and Noun Phrase Parser for Unrestricted T e x t " ,

1988

[3] Krovetz R., "Lexical Acquisition and Informa- tion Retrieval", in Lezical Acquisition: Build- ing the Lezicon Using On-Line Resources, U

Zernik (ed), pp 45-64, 1991

[4] Lesk M., "Automatic Sense Disambiguation Us- ing Machine Readable Dictionaries: How to tell

a Pine Cone from an Ice Cream Cone", Proceed-

[5] Marchand H, "On a Question of C o n t r a r y Anal- ysis with Derivational Connected but Mor- phologically Uncharacterized Words", English

Ngày đăng: 31/03/2014, 06:20

🧩 Sản phẩm bạn có thể quan tâm