Báo cáo khoa học: "FROM N-GRAMS TO COLLOCATIONS AN EVALUATION OF XTRACT" pot

Mainly, the produced collocations do not in- clude any kind of functional information and many of them are invalid.. These methods are implemented in an added third stage to X t r a c t

Trang 1

F R O M N - G R A M S T O C O L L O C A T I O N S

A N E V A L U A T I O N O F X T R A C T

Frank A Smadja

D e p a r t m e n t of C o m p u t e r Science

C o l u m b i a University

N e w York, N Y 1 0 0 2 7

A b s t r a c t

In previous papers we presented methods for

retrieving collocations from large samples of

texts We described a tool, X t r a c t , that im-

plements these methods and able to retrieve

a wide range of collocations in a two stage

process These methods a.s well as other re-

lated methods however have some limitations

Mainly, the produced collocations do not in-

clude any kind of functional information and

many of them are invalid In this paper we

i n t r o d u c e methods that address these i s s u e s

These methods are implemented in an added

third stage to X t r a c t that examines the set of

collocations retrieved during the previous two

stages to both filter out a number of invalid col-

locations and add useful syntactic information

to the retained ones By combining parsing and

statistical techniques the addition of this third

stage has raised the overall precision level of

X t r a c t from 40% to 80% With a precision of

94% In the paper we describe the methods

and the evaluation experiments

1 I N T R O D U C T I O N

In the past, several approaches have been proposed to

retrieve various types of collocations from the analysis

of large samples of textual data Pairwise associations

(bigrams or 2-grams) (e.g., [Smadja, 1988], [Church and

Hanks, 1989]) as well as n-word (n > 2) associations

(or n-grams) (e.g., [Choueka el al., 1983], [Smadja and

McKeown, 1990]) were retrieved These techniques auto-

matically produced large numbers of collocations along

with statistical figures intended to reflect their relevance

However, none of these techniques provides functional in-

formation along with the collocation Also, the results

produced often contained improper word associations re-

flecting some spurious aspect of the training corpus that

did not stand for true collocations This paper addresses

these two problems

1990]) introduced a set of tecl)niques and a tool, X t r a c t ,

that produces various types of collocations from a two- stage statistical analysis of large textual corpora briefly sketched in the next section In Sections 3 and 4, we show h o w robust parsing technology can be used to both filter out a number of invalid collocations as well as add useful syntactic information to the retained ones This filter/analyzer is implemented in a third stage of Xtract that automatically goes over a the output collocations to reject the invalid ones and label the valid ones with syntactic information For example, if the first two stages

of Xtract produce the collocation "make-decision," the goal of this third stage'is to identify it as a verb-object collocation If no such syntactic relation is observed, then the collocation is rejected In Section 5 we present

an evaluation of Xtract as a collocation retrieval system The addition of the third stage of Xtract has been evaluated to raise the precision of X t r a c t from 40% to 80°£ and it has a recall of 94% In this paper we use examples related to the word "takeover" from a 10 million word corpus containing stock market reports originating from the Associated Press newswire

2 F I R S T 2 S T A G E S O F X T R A C T ,

P R O D U C I N G N - G R A M S

In a f i r s t stage, X t r a c t uses statistical techniques to retrieve pairs of words (or bigrams) whose common a p - pearances within a single sentence are correlated in the corpus A bigram is retrieved if its frequency of occurrence is above a certain threshold and if the words are used in relatively rigid ways Some bigrams produced

by the first stage of X t r a c t are given in T a b l e 1: the bigrams all contain the word "takeover" and an adjective In the table, the distance parameter indicates the usual distance between the two w o r d s For example,

distance = 1 indicates that the two words are fre- quently adjacent in the corpus

In a second stage, X t r a c t uses the o u t p u t bigrams to produce collocations involving more than two words (or n-grams) It examines all the sentences containing the bigram and analyzes the statistical distribution of words and parts of speech for each position around the pair It retains words (or parts of speech) oc- cupying a position with probability greater than a given

Trang 2

threshold For example, the bigram "average-industrial"

produces the n-gram "the Dow Jones industrial average"

since the words are always used within this compound

in the training corpus Example outputs of the second

stage of X t r a e t are given in Figure 1 In the figure, the

numbers on the left indicate the frequency of the n-grams

in the corpus, NN indicates that a noun is expected at

this position, AT indicates that an article is expected,

NP stands for a proper noun and VBD stands for a verb

in the past tense See [Smadja and McKeown, 1990] and

[Smadja, 1991] for more details on these two stages

Table 1: O u t p u t of Stage 1

Wi

hostile

corporate

hostile

unwanted

potential

unsolicited

unsuccessful

friendly

takeover

big

wj

takeovers takeover takeovers takeovers takeover takeover takeover takeover takeover expensive big takeover

distance

1

2

1

2

4

1

3 S T A G E T H R E E : S Y N T A C T I C A L L Y

L A B E L I N G C O L L O C A T I O N S

In the past, Debili [Debili, 1982] parsed corpora of French

texts to identify non-ambiguous predicate argument rela-

tions He then used these relations for disambiguation in

parsing Since then, the advent of robust parsers such as

C a s s [Abney, 1990], F i d d i t e h [Itindle, 1983] has made it

possible to process large amounts of text with good per-

formance This enabled Itindle and Rooth [Hindle and

Rooth, 1990], to improve Debili's work by using bigram

statistics to enhance the task of prepositional phrase at-

tachment Combining statistical and parsing methods

has also been done by Church and his colleagues In

[Church et al., 1989] and [Church'et ai., 1991] they con-

sider predicate argument relations in the form of ques-

tions such as What does a boat typically do? They are

preprocessing a corpus with the F i d d l t e h parser in order

to statistically analyze the distribution of the predicates

used with a given argument such as "boat."

Our goal is different, since we analyze a set of

collocations automatically produced by X t r a c t to either

enrich them with syntactic information or reject them

For example, i f , bigram collocation produced by X t r a c t

involves a noun and a verb, the role of Stage 3 of X t r a c t

is to determine whether it is a subject-verb or a verb-

object collocation If no such relation can be identified,

then the collocation is rejected This section presents

the algorithm for X t r a c t Stage 3 in some detail For

illustrative purposes we use the example words takeover

and thwart with a distance of 2

3.1 D E S C R I P T I O N O F T H E A L G O R I T H M

I n p u t : A bigram with some distance information in- dicating the most probable distance between the two words For example, takeover and thwart with a distance

of 2

O u t p u t / G o a h Either a syntactic label for the bigram

or a rejection In the case of takeover and thwart the collocation is accepted and its produced label is V O for

verb-object

The algorithm works in the following 3 steps:

3.1.1 S t e p 1: P R O D U C E T A G G E D

C O N C O R D A N C E S

All the sentences in the corpus that contain the

two words in this given position are produced This

is done with a concord,acing program which is part of

X t r a e t (see [Smadja, 1991]) The sentences are labeled with part of speech information by preprocessing the corpus with an automatic stochastic tagger 1

3.1.2 S t e p 2: P A R S E T H E S E N T E N C E S

Each sentence is then processed by C a s s , a bottom-up incremental parser [Abney, 1990] 2 C a s s

takes input sentences labeled with part of speech and attempts to identify syntactic structure One of C a s s modules identifies predicate argument relations We use this module to produce binary syntactic relations (or labels) such as "verb-object" (VO), %erb-subject" (VS),

"noun-adjective" ( N J), and "noun-noun" ( N N ) Con- sider Sentence (1) below and all the labels as produced

by C a s s on it

(1) "Under the recapitalization plan it proposed to

t h w a r t the t a k e o v e r "

S V it proposed

N N recapitalization plan

V O thwart takeover

For each sentence in the concordance set, from the output of C a s s , X t r a c t determines the syntactic relation of the two words among VO, SV, N J, N N and assigns this label to the sentence If no such relation is observed, X t r a c t associates the label U (for undefined)

to the sentence We note label[ia~ the label associated

1For this, we use the part of speech tagger described in [Church, 1988] This program was developed at Bell Labora- tories by Ken Church

UThe parser has been developed at Bell Communication Research by Steve Abney, C a s s stands for Cascaded Analysis

of Syntactic Structure I am much grateful to Steve Abney

to help us use and customize C a s s for this work

Trang 3

681 takeover bid

310 takeover offer

258 takeover attempt

177 takeover battle

154 NN NN takeover defense

153 takeover target

119 a possible takeover NN

118 takeover law

109 takeover rumors

102 takeover speculation

84 takeover strategist

69 AT takeover fight

62 corporate t a k e o v e r

50 takeover proposals

40 Federated's poison pill takeover defense

33 NN VBD a sweetened takeover offer from N P

F i g u r e 1: Some n - g r a m s containing "takeover"

with Sentence id For e x a m p l e , the label for Sentence (1)

is: label[l] - VO

4 A L E X I C O G R A P H I C

E V A L U A T I O N

3 1 3 S t e p 3: R E J E C T O R L A B E L

C O L L O C A T I O N

T h i s last s t e p consists o f deciding on a label for

the b i g r a m from the set of label[i~'.s For this, we count

the frequency of each label for the b i g r a m and perform

a s t a t i s t i c a l analysis o f this d i s t r i b u t i o n A collocation

is accepted if the two seed words are consistently used

with the s a m e s y n t a c t i c relation More precisely, the

collocation is accepted if and only if there is a label 12 ~:

U satisfying the following inequation:

[probability(labeliid ] = £ ) > T I

in which T is a given threshold to be d e t e r m i n e d

by the e x p e r i m e n t e r A collocation is thus rejected if no

valid label satisfies the inequation or if U satisfies it

F i g u r e 2 lists some accepted collocations in the

f o r m a t p r o d u c e d by X t r a c t with their s y n t a c t i c labels

For these examples, the threshold T was set to 80%

For each collocation, the first line is the o u t p u t of the

first stage of X t r a c t It is the seed b i g r a m with the

distance between the two words T h e second line is the

o u t p u t of the second stage of X t r a c t , it is a multiple

word collocation (or n-gram) T h e n u m b e r s on the left

indicate the frequency of occurrence of the n - g r a m in

the corpus T h e t h i r d line indicates the s y n t a c t i c label

as d e t e r m i n e d b y the t h i r d stage of X t r a c t Finally,

the last lines s i m p l y list an e x a m p l e sentence and the

position of the collocation in the sentence

Such collocations can then be used for vari-

ous purposes including lexicography, spelling correction,

speech recognition and language generation Ill [Smadja

and McKeown, 1990] and [Smadja, 1991] we describe

how they are used to build a lexicon for language gener-

ation in the d o m a i n of stock m a r k e t reports

T h e third s t a g e of X t r a c t can t h u s be considered as a retrieval s y s t e m which r e t r i e v e s valid collocations from

a set of c a n d i d a t e s T h i s section describes an evaluation

e x p e r i m e n t of the t h i r d s t a g e of X t r a c t as a retrieval system E v a l u a t i o n of retrieval s y s t e m s is usually done with the help of two p a r a m e t e r s : precision and recall

[Salton, 1989] Precision of a r e t r i e v a l s y s t e m is defined

as the r a t i o o f r e t r i e v e d valid e l e m e n t s d i v i d e d by the

t o t a l n u m b e r o f r e t r i e v e d e l e m e n t s [Salton, 1989] It measures the q u a l i t y o f the r e t r i e v e d m a t e r i a l Recall

is defined as the r a t i o of r e t r i e v e d valid e l e m e n t s divided

by the t o t a l n u m b e r of valid elements I t measures the effectiveness o f the s y s t e m T h i s section p r e s e n t s an eval-

u a t i o n of the retrieval p e r f o r m a n c e o f the t h i r d s t a g e of

X t r a c t

4 1 T H E E V A L U A T I O N E X P E R I M E N T

Deciding w h e t h e r a given word c o m b i n a t i o n is a valid or invahd collocation is a c t u a l l y a difficult task

t h a t is b e s t done b y a lexicographer Jeffery Triggs is

a lexicographer working for O x f o r d English D i c t i o n a r y ( O E D ) c o o r d i n a t i n g the N o r t h A m e r i c a n Readers pro-

g r a m of O E D at Bell C o m m u n i c a t i o n Research Jef- fery Triggs agreed to m a n u a l l y go over several t h o u s a n d s collocations, a

We r a n d o m l y selected a s u b s e t o f a b o u t 4,000 collocations t h a t c o n t a i n e d t h e i n f o r m a t i o n compiled by

X t r a c t after the first 2 stages T h i s d a t a set was then the s u b j e c t of t h e following e x p e r i m e n t

We gave the 4,000 collocations to e v a l u a t e to the lexicographer, asking him to select the ones t h a t he 3I am grateful to Jeffery whose professionalism and kind- ness helped me understand some of the difficulty of lexicography Without him this evaluation would not have been possible

Trang 4

t a k e o v e r bid -1

681 t a k e o v e r bid IN

S y n t a c t i c Label: NN

10 11

A n i n v e s t m e n t p a r t n e r s h i p on Friday offered to sweeten its

t a k e o v e r bid for G e n c o r p Inc

t a k e o v e r fight -1

69 A T t a k e o v e r fight IN 69

S y n t a c t i c Label: NN

10 11

L a t e r l a s t y e a r H a n s o n won a hostile 3.9 billion t a k e o v e r fight for I m p e r i a l G r o u p

t h e g i a n t B r i t i s h food t o b a c c o a n d brewing c o n g l o m e r a t e a n d raised m o r e t h a n 1.4 billion p o u n d s from t h e sale of I m p e r i a l s C o u r a g e brewing o p e r a t i o n a n d

i t s leisure p r o d u c t s businesses

t a k e o v e r t h w a r t 2

44 to t h w a r t A T t a k e o v e r NN 44

S y n t a c t i c Label: V O

13 11

T h e 48.50 a share offer a n n o u n c e d S u n d a y is designed to t h w a r t a t a k e o v e r bid

b y G A F Corp

t a k e o v e r m a k e 2

68 MD m a k e a t a k e o v e r NN J J 68

S y n t a c t i c Label: V O

14 12

M e a n w h i l e t h e N o r t h C a r o l i n a S e n a t e a p p r o v e d a bill T u e s d a y t h a t would m a k e a

t a k e o v e r of N o r t h C a r o l i n a b a s e d c o m p a n i e s m o r e difficult a n d t h e House was

e x p e c t e d t o a p p r o v e t h e m e a s u r e before t h e end of t h e week

t a k e o v e r r e l a t e d -1

59 t a k e o v e r related 59

S y n t a c t i c Label: SV

2 3

A m o n g t a k e o v e r r e l a t e d issues Kidde j u m p e d 2 t o 66

F i g u r e 2: S o m e e x a m p l e s o f c o l l o c a t i o n s w i t h "takeover"

T w 9 4 % T = 9 4 %

U = 9,5%

Y t 0 %

Y Y = 4 0 %

N - - 9 2 %

F i g u r e 3: O v e r l a p of t h e m a n u a l a n d a u t o m a t i c e v a l u a t i o n s

Trang 5

would consider for a domain specific dictionary and to

cross out the others The lexicographer came up with

three simple tags, Y Y , Y and N Both Y and Y Y are

good collocations, and N are bad collocations The dif-

ference between Y Y and Y is that Y collocations are of

better quality than Y Y collocations Y Y collocations

are often too specific to be included in a dictionary, or

some words are missing, etc After Stage 2, about 20%

of the collocations are Y, about 20% are Y Y , and about

60% are N This told us that the precision of X t r a c t at

Stage 2 was only about 40 %

Although this would seem like a poor precision,

one should compare it with the much lower rates cur-

rently in practice in lexicography For the OED, for

example, the first stage roughly consists of reading nu-

merous documents to identify new or interesting expres-

sions This task is performed by professional readers

For the OED, the readers for the American program

alone produce some 10,000 expressions a month These

lists are then sent off to the dictionary and go through

several rounds of careful analysis before actually being

submitted to the dictionary The ratio of proposed can-

didates to good candidates is usually low For example,

out of the 10,000 expressions proposed each month, less

than 400 are serious candidate for the OED, which rep-

resents a current rate of 4% Automatically producing

lists of candidate expressions could actually be of great

help to lexicographers and even a precision of 40% would

be helpful Such lexicographic tools could, for example,

help readers retrieve sublanguage specific expressions by

providing them with lists of candidate collocations The

lexicographer then manually examines the list to remove

the irrelevant data Even low precision is useful for

lexicographers as manual filtering is much faster than

manual scanning of the documents [Marcus, 1990] Such

techniques are not able to replace readers though, as they

are not designed to identify low frequency expressions,

whereas a human reader immediately identifies interest-

ing expressions with as few as one occurrence

The second stage of this experiment was to use

X t r a c t Stage 3 to filter out and label the sample set of

collocations As described in Section 3, there are several

valid labels (VO, VS, N N , etc.) In this experiment, we

grouped them under a single label: T There is only one

non-valid label: U (for unlabeled} A T collocation is

thus accepted by X t r a c t Stage 3, and a U collocation is

rejected The results of the use of Stage 3 on the sample

set of collocations are similar to the manual evaluation

in terms of numbers: about 40% of the collocations were

labeled (T) by X t r a c t Stage 3, and about 60% were

rejected (U)

Figure 3 shows the overlap of the classifications

made by X t r a c t and the lexicographer In the figure,

the first diagram on the left represents the breakdown in

T and U of each of the manual categories (Y - YY and

N) The diagram on the right represents the breakdown

in Y - YY and N of the the T and U categories For

example, the first column of the diagram on the left rep-

resents the application of X t r a c t Stage 3 on the Y Y col-

locations It shows that 94% of the collocations accepted

by the lexicographer were also accepted by X t r a c t In other words, this means that the recall o f t h e third stage

of X t r a c t is 94% The first column of the diagram on the right represents the lexicographic evaluation of the collocations automatically accepted by X t r a c t It shows that about 80% of the T collocations were accepted by the lexicographer and that about 20% were rejected This shows that precision was raised from 40% to 80% with the addition of X t r a c t Stage 3 In summary, these experiments allowed us to evaluate Stage 3 as a retrieval system The results are:

I P r e c i s i o n = 80% R e c a l l = 94% ]

5 S U M M A R Y A N D

C O N T R I B U T I O N S

In this paper, we described a new set of techniques for syntactically filtering and labeling collocations Using such techniques for post processing the set of collocations produced by X t r a c t has two major results First,

it adds syntax to the collocations which is necessary for computational use Second, it provides considerable im- provement to the quality of the retrieved collocations as the precision of X t r a c t is raised from 40% to 80% with

a recall of 94%

By combining statistical techniques with a sophis- ticated robust parser we have been able to design and implement some original techniques for the automatic extraction of collocations Results so far are very en- couraging and they indicate that more efforts should be made at combining statistical techniques with more sym- bolic ones

A C K N O W L E D G M E N T S

The research reported in this paper was partially sup- ported by DARPA grant N00039-84-C-0165, by NSF grant IRT-84-51438 and by O N R grant N00014-89-J-

1782 Most of this work is also done in collaboration with Bell Communication Research, 445 South Street, Mor- ristown, N3 07960-1910 I wish to express my thanks

to Kathy McKeown for her comments on the research presented in this paper I also wish to thank Dor~e Seligmann and Michael Elhadad for the time they spent discussing this paper and other topics with me

R e f e r e n c e s

Text Research, 1990

iomatic and Collocational Expressions in a Large Cot-

Trang 6

pus Journal for Literary and Linguistic computing,

4:34-38, 1983

[Church and Hanks, 1989] K Church and K Hanks

Word Association Norms, Mutual Information, and

Lexicography In Proceedings of the 27th meeting of

the A CL, pages 76-83 Association for Computational

Linguistics, 1989 Also in Computational Linguistics,

vol 16.1, March 1990

[Church et at., 1989] K.W Church, W Gale, P Hanks,

and D Hindle Parsing, Word Associations and Typ-

ical Predicate-Argument Relations In Proceedings of

the International Workshop on Parsing Technologies,

pages 103-112, Carnegie Mellon University, Pitts-

burgh, PA, 1989 Also appears in Masaru Tomita

(ed.), Current Issues in Parsing Technology, pp 103-

112, Kluwer Academic Publishers, Boston, MA, 1991

[Church et at., 1991] K.W Church, W Gale, P Hanks,

and D Hindle Using Statistics in Lexical Analysis In

Uri ~ernik, editor, Lexical Acquisition: Using on-line

resources to build a lexicon Lawrence Erlbaum, 1991

In press

[Church, 1988] K Church Stochastic Parts Prograln

and Noun Phrase Parser for Unrestricted Text In

Proceedings of the Second Conference on Applied Nat-

ural Language Processing, Austin, Texas, 1988

[Debili, 1982] F Debili Analyse Syntactico-Sdmantique

Fondde sur une Acquisition Automatique de Relations

Lexicales Sdmantiques PhD thesis, Paris XI Univer-

sity, Orsay, France, 1982 Th~se de Doctorat D'~tat

[Hindle and Rooth, 1990] D Hindle and M Rooth

Structural Ambiguity and Lexieal Relations In

DARPA Speech and Natural Language Workshop, Hid-

den Valley, PA, June 1990

[Hindle, 1983] D Hindle User Manual for Fidditch, a

Deterministic Parser Technical Memorandum 7590-

142, Naval Research laboratory, 1983

[Marcus, 1990] M Marcus Tutorial on Tagging and

Processing Large Textual Corpora Presented at the

28th annual meeting of the ACL, June 1990

[Salton, 1989] J Salton Automatic Text Processing,

The Transformation, Analysis, and Retrieval of In-

formation by Computer Addison-Wesley Publishing

Company, NY, 1989

[Smadja and McKeown, 1990] F Smadja and K McKe-

own Automatically Extracting and Representing Col-

locations for Language Generation In Proceedings of

the 28th annual meeting of the ACL, Pittsburgh, PA,

June 1990 Association for Computational Linguistics

[Smadja, 1988] F Smadja Lexical Co-occurrence, The

Missing Link in Language Acquisition Ill Program

and abstracts of the 15 th International ALLC, Con-

ference of the Association for Literary and Linguistic

Computing, Jerusalem, Israel, June 1988

[Smadja, 1991] F Smadja Retrieving Collocational

Knowledge from Textual Corpora An Application:

Language Generation PhD thesis, Computer Science

Department, Columbia University, New York, NY, April 1991

Định dạng
Số trang	6
Dung lượng	484,82 KB