Tài liệu Báo cáo khoa học: "RAISINS, CLASSIFICATION CURRANTS: VIA LEXICAL CONTEXT PRIMING ABSTRACTION" pptx

A metric is then applied to the nodes in the network in order to discover those pairs of items related by high indices of similarity.. Thus, just as an individual lexical item has associ

Trang 1

R A I S I N S , S U L T A N A S ~ A N D C U R R A N T S : L E X I C A L

C L A S S I F I C A T I O N A N D A B S T R A C T I O N V I A C O N T E X T P R I M I N G

David J Hutches

D e p a r t m e n t of C o m p u t e r Science and Engineering, Mail Code 0114

University of California, San Diego

La Jolla, CA 92093-0114 dhutches@ucsd.edu

A b s t r a c t

In this p a p e r we discuss the results of experiments

which use a context, essentially an ordered set of

lexical items, as the seed from which to build a

network representing statistically i m p o r t a n t rela-

tionships a m o n g lexical items in some corpus A

metric is then applied to the nodes in the network

in order to discover those pairs of items related by

high indices of similarity T h e goal of this research

is to i n s t a n t i a t e a class of items corresponding to

each item in the priming context We believe t h a t

this instantiation process is ultimately a special

case of a b s t r a c t i o n over the entire network; in this

abstraction, similar nodes are collapsed into m e t a -

nodes which m a y then function as if they were sin-

gle lexical items

I M o t i v a t i o n a n d B a c k g r o u n d

W i t h respect to the processing of language,

one of the tasks at which h u m a n beings seem rel-

atively a d e p t is the ability to determine when it is

a p p r o p r i a t e to m a k e generalizations and when it is

a p p r o p r i a t e to preserve distinctions T h e process

of a b s t r a c t i o n and knowing when it might reason-

ably be used is a necessary tool in reducing the

complexity of the task of processing n a t u r a l lan-

guage P a r t of our current research is an investi-

gation into how the process of a b s t r a c t i o n might

be realized using relatively low-level statistical in-

f o r m a t i o n e x t r a c t e d from large textual corpora

Our e x p e r i m e n t s are an a t t e m p t to discover

a m e t h o d by which class information a b o u t the

m e m b e r s of some sequence of lexical items m a y

be o b t a i n e d using strictly statistical methods For

our purposes, the class to which a lexical item be-

longs is defined by its instantiation Given some

context such as he w a l k e d a c r o s s t h e room, we

would like to be able to instantiate classes of items

corresponding to each item in the context (e.g., the

class associated with w a l k e d might include items

such as p a c e d , s t e p p e d , or s a u n t e r e d )

T h e c o r p o r a used in our experiments are the

L a n c a s t e r - O s l o - B e r g e n (LOB) corpus and a sub-

set of the A C L / D C I Wall Street Journal (WSJ)

corpus T h e L O B corpus consists of a total

of 1,008,035 words, composed of 49,174 unique

words T h e subset of the W S J corpus t h a t we use has been pre-processed such t h a t all letters are folded to lower case, and n u m b e r s have been collapsed to a single token; the subset consists of 18,188,548 total words and 159,713 unique words

I I C o n t e x t P r i m i n g

It is not an u n c o m m o n notion t h a t a word

m a y be defined not rigourously as by the as- signment of static syntactic and s e m a n t i c classes, but dynamically as a function of its usage (Firth

1957, 11) Such usage m a y be derived from co- occurrence information over the course of a large

b o d y of text For each unique lexical item in a corpus, there exists an "association n e i g h b o u r h o o d "

in which t h a t item lives; such a n e i g h b o u r h o o d

is the probability distribution of the words with which the item has co-occurred If one posits t h a t similar lexical items will have similar neighbourhoods, one possible m e t h o d of i n s t a n t i a t i n g a class

of lexical items would be to e x a m i n e all unique items in a corpus and find those whose neighbourhoods are most similar to the n e i g h b o u r h o o d of the item whose class is being instantiated How- ever, the potential c o m p u t a t i o n a l p r o b l e m s of such

an a p p r o a c h are clear In the context of our approach to this problem, most lexical items in the search space are not even r e m o t e l y similar to the item for which a class is being instantiated Fur- thermore, a substantial p a r t of a lexical i t e m ' s association neighbourhood provides only superficial information a b o u t t h a t item W h a t is required

is a process whereby the search space is reduced dramatically One m e t h o d of accomplishing this pruning is via context priming

In context priming, we view a context as the seed upon which to build a network describing t h a t

p a r t of the corpus which is, in some sense, close

to the context Thus, just as an individual lexical item has associated with it a unique neighbourhood, so too does a context have such a neighbourhood T h e basic process of building a network is straightforward Each item in the p r i m i n g context has associated with it a unique neighbourhood defined in t e r m s of those lexical items with which it has co-occurred Similarly, each of these

Trang 2

latter items also has a unique association neigh-

bourhood Generating a network based on some

context consists in simply expanding nodes (lexi-

cM items) further and further away from the con-

text until some threshold, called the depth of the

network, is reached

Just as we prune the total set of unique lexical

items by context priming, we also prune the neigh-

b o u r h o o d of each node in the network by using a

statistical metric which provides some indication

of how i m p o r t a n t the relationship is between each

lexical item and the items in its neighbourhood

In the results we describe here, we use m u t u a l in-

f o r m a t i o n (Fano 1961, 27-28; Church and Hanks

1990) as the metric for neighbourhood pruning,

pruning which occurs as the network is being gen-

erated Yet, a n o t h e r p a r a m e t e r controlling the

topology of the network is the extent of the "win-

dow" which defines the neighbourhood of a lexi-

cal item (e.g., does the neighbourhood of a lexical

item consist of only those items which have co-

occurred at a distance of up to 3, 5, 10, or 1000

words from the item)

I I I O p e r a t i o n s o n t h e N e t w o r k

T h e network primed by a context consists

merely of those lexical items which are closely

reachable via co-occurrence from the priming con-

text Nodes in the network are lexical items; arcs

represent co-occurrence relations and carry the

value of the statistical metric mentioned above

and the distance of co-occurrence With such a

network we a t t e m p t to a p p r o x i m a t e the statisti-

cally relevant neighbourhood in which a particular

context might be found

In the tests performed on the network thus

far we use the similarity metric

S(x, y) - IA n BI 2

IA u BI where x and y are two nodes representing lexical

items, the neighbourhoods of which are expressed

as the sets of arcs A and B respectively T h e met-

ric S is thus defined in t e r m s of the cardinalities of

sets of arcs T w o arcs are said to be equal if they

reference (point to) the same lexical item at the

same offset distance Our metric is a modification

of the T a n i m o t o coefficient (Bensch and Savitch

1992); the n u m e r a t o r is squared in order to assign

a higher index of similarity to those nodes which

have a higher percentage of arcs in common

Our first set of tests concentrated directly on

items in the seed context Using the metric above,

we a t t e m p t e d to instantiate classes of lexical items

for each item in the context In those cases where there were matches, the results were often encouraging For example, in the L O B corpus, using the seed context John walked across the room, a network depth of 6, a m u t u a l information threshold

of 6.0 for neighbourhood pruning, and a window

of 5, for the item John, we instantiated the class {Edward, David, C h a r l e s , Thomas} A similar test

on the W S J corpus yielded the following class for

j o h n

r i c h a r d , p a u l , t h o m a s , e d w a r d , d a v i d ,

d o n a l d , d a n i e l , f r a n k , m i c h a e l , d e n n i s ,

j o s e p h , j i m , a l a n , d a n , r o g e r Recall t h a t the subset of the W S J corpus we use has had all items folded to lower case as p a r t of the pre-processing phase, thus all items in an instantiated class will also be folded to lower case

In other tests, the instantiated classes were less satisfying, such as the following class generated for w i f e using the p a r a m e t e r s above, the LOB, and the context his wife walked across the room

mouth,father,uncle,lordship, } finger s,mother,husband,f ather ' s, shoulder,mother ' s,brother

In still other cases, a class could not be instantiated at all, typically for items whose neighbourhoods were too small to provide meaningful matching information

I V A b s t r a c t i o n

It is clear t h a t even the most perfectly derived lexical classes will have m e m b e r s in c o m m o n T h e different senses of b a n k are often given as the clas- sic example of a lexically ambiguous word From our own data, we observed this p r o b l e m because of our preprocessing of the W S J corpus; the instantiation of the class associated with mark included some proper names, but also included items such

as marks, c u r r e n c i e s , yen, and d o l l a r , a con- founding of class information t h a t would not have occurred had not case folding taken place Ide- ally, it would be useful if a context could be m a d e

to exert a more constraining influence during the course of instantiating classes For example, if it

is reasonably clear from a context, such as mark

l o v e s mary, t h a t the "mark" in question is the

h u m a n rather t h a n the financial variety, how m a y

we ensure t h a t the context provides the p r o p e r constraining information if l o v e s has never co- occurred with mark in the original corpus?

Trang 3

In the case of the ambiguous mark above,

while this item does not appear in the neighbour-

hood of l o v e s , other lexical items do (e.g., e v e r y -

one, who, him, mr), items which may be members

of a class associated with mark W h a t is proposed,

then, is to construct incrementally classes of items

over the network, such that these classes may then

function as a single item for the purpose of deriv-

ing indices of similarity In this way, we would

not be looking for a specific match between mark

and l o v e s , but rather a match among items in

the same class as mark; items in the same class as

l o v e s , and items in the same class as mary With

this in mind, our second set of experiments con-

centrated not specifically on items in the priming

context, but on the entire network, searching for

candidate items to be collapsed into meta-nodes

representing classes of items

Our initial experiments in the generation of

pairs of items which could be collapsed into meta-

nodes were more successful than the tests based

on items in the priming context Using the LOB

corpus, the same parameters as before, and the

priming context John walked a c r o s s t h e room,

the following set of pairs represents some of the

good matches over the generated network

(minut e s , d a y s ) , ( t h r e e , f i v e ) , ( f ew, f i v e ) ,

(2,3),(f i g , t a b l e ) , ( d a y s , y e a r s ) , ( 4 0 , 5 0 ) ,

( m e , h i m ) , ( t h r e e , f ew),(4,5),(50,100),

(currants,sultanas),(sultanas,raisins),

(currants,raisins),

Using the W S J corpus, again the same parameters,

and the context john walked across the room,

part of the set of good matches generated was

(months,weeks),(rose,f ell),(days,weeks),

(s ingle-a-plus,t riple-b-plus),

(single-a-minus,t riple-b-plus),

(lawsuit ,complaint),(analyst ,economist)

(j ohn,robert), (next ,past ), ( s ix,f ive),

(lower,higher),(goodyear,f irest one),

(prof it,loss),(billion,million),

It should be noted that the sets given above repre-

sent the best good matches Empirically, we found

that a value of S > 1.0 tends to produce the most

meaningful pairings At S < 1.0, the amount of

"noisy" pairings increases dramatically This is

not an absolute threshold, however, as apparently

unacceptable pairings do occur at S > 12, such

as, for example, the pairs (catching, teamed), (accumulating, rebuffed), and (father, mind)

V Future Research The results of our initial experiments in generating classes of lexical items are encouraging, though not conclusive We believe t h a t by incrementally collapsing pairs of very similar items into meta-nodes, we may accomplish a kind of abstraction over the network which will ultimately allow the more accurate instantiation of classes for the priming context T h e notion of incrementally merging classes of lexical items is intuitively satisfying and is explored in detail in (Brown,

et al 1992) T h e approach taken in the cited work is somewhat different than ours and while our m e t h o d is no less computationally complex than that of Brown, et al., we believe t h a t it is somewhat more manageable because of the pruning effect provided by context priming On the other hand, unlike the work described by Brown,

et al., we as yet have no clear criterion for stopping the merging process, save an a r b i t r a r y threshold Finally, it should be noted that our goal is not, strictly speaking, to generate classes over an entire vocabulary, but only that portion of the vocabulary relevant for a particular context It is hoped that, by priming with a context, we may be able to effect some manner of word sense disambiguation

in those cases where the meaning of a potentially ambiguous item ,nay be resolved by hints in the context

V I R e f e r e n c e s

Bensch, P e t e r A and Walter J Savitch 1992

"An Occurrence-Based Model of Word Cat- egorization" Third Meeting on Mathemat- ics of Language Austin, Texas: Association for Computational Linguistics, Special Inter- est Group on the Mathematics of Language Brown, P e t e r F., et al 1992 "Class-Based n- gram Models of Natural Language" Compu- tational Linguistics 18.4: 467-479

Church, Kenneth Ward, and Patrick Hanks 1990

"Word Association Norms, Mutual Informa- tion, and Lexicography" Computational Lin- guistics 16.1: 22-29

Fano, R o b e r t M 1961 Transmission o/ In]or- marion: A Statistical Theory o] Communica- tions New York: M I T Press

Firth, J[ohn] R[upert] 1957 "A Synopsis of Lin- guistic Theory, 1930-55." Studies in Linguis- tic Analysis Philological Society, London Oxford, England: Basil Blackwelh 1-32

Tiêu đề	Raisins, Classification Currants: Via Lexical Context Priming Abstraction
Tác giả	David J. Hutches
Trường học	University of California, San Diego
Chuyên ngành	Computer Science and Engineering
Thể loại	Báo cáo khoa học
Thành phố	La Jolla

Định dạng
Số trang	3
Dung lượng	299,46 KB