Mainly, the produced collocations do not in- clude any kind of functional information and many of them are invalid.. These methods are implemented in an added third stage to X t r a c t
Trang 1F R O M N - G R A M S T O C O L L O C A T I O N S
A N E V A L U A T I O N O F X T R A C T
Frank A Smadja
D e p a r t m e n t of C o m p u t e r Science
C o l u m b i a University
N e w York, N Y 1 0 0 2 7
A b s t r a c t
In previous papers we presented methods for
retrieving collocations from large samples of
texts We described a tool, X t r a c t , that im-
plements these methods and able to retrieve
a wide range of collocations in a two stage
process These methods a.s well as other re-
lated methods however have some limitations
Mainly, the produced collocations do not in-
clude any kind of functional information and
many of them are invalid In this paper we
i n t r o d u c e methods that address these i s s u e s
These methods are implemented in an added
third stage to X t r a c t that examines the set of
collocations retrieved during the previous two
stages to both filter out a number of invalid col-
locations and add useful syntactic information
to the retained ones By combining parsing and
statistical techniques the addition of this third
stage has raised the overall precision level of
X t r a c t from 40% to 80% With a precision of
94% In the paper we describe the methods
and the evaluation experiments
1 I N T R O D U C T I O N
In the past, several approaches have been proposed to
retrieve various types of collocations from the analysis
of large samples of textual data Pairwise associations
(bigrams or 2-grams) (e.g., [Smadja, 1988], [Church and
Hanks, 1989]) as well as n-word (n > 2) associations
(or n-grams) (e.g., [Choueka el al., 1983], [Smadja and
McKeown, 1990]) were retrieved These techniques auto-
matically produced large numbers of collocations along
with statistical figures intended to reflect their relevance
However, none of these techniques provides functional in-
formation along with the collocation Also, the results
produced often contained improper word associations re-
flecting some spurious aspect of the training corpus that
did not stand for true collocations This paper addresses
these two problems
1990]) introduced a set of tecl)niques and a tool, X t r a c t ,
that produces various types of collocations from a two- stage statistical analysis of large textual corpora briefly sketched in the next section In Sections 3 and 4, we show h o w robust parsing technology can be used to both filter out a number of invalid collocations as well as add useful syntactic information to the retained ones This filter/analyzer is implemented in a third stage of Xtract that automatically goes over a the output collocations to reject the invalid ones and label the valid ones with syn- tactic information For example, if the first two stages
of Xtract produce the collocation "make-decision," the goal of this third stage'is to identify it as a verb-object collocation If no such syntactic relation is observed, then the collocation is rejected In Section 5 we present
an evaluation of Xtract as a collocation retrieval sys- tem The addition of the third stage of Xtract has been evaluated to raise the precision of X t r a c t from 40% to 80°£ and it has a recall of 94% In this paper we use ex- amples related to the word "takeover" from a 10 million word corpus containing stock market reports originating from the Associated Press newswire
2 F I R S T 2 S T A G E S O F X T R A C T ,
P R O D U C I N G N - G R A M S
In a f i r s t stage, X t r a c t uses statistical techniques to retrieve pairs of words (or bigrams) whose common a p - pearances within a single sentence are correlated in the corpus A bigram is retrieved if its frequency of occur- rence is above a certain threshold and if the words are used in relatively rigid ways Some bigrams produced
by the first stage of X t r a c t are given in T a b l e 1: the bigrams all contain the word "takeover" and an adjec- tive In the table, the distance parameter indicates the usual distance between the two w o r d s For example,
distance = 1 indicates that the two words are fre- quently adjacent in the corpus
In a second stage, X t r a c t uses the o u t p u t bi- grams to produce collocations involving more than two words (or n-grams) It examines all the sentences con- taining the bigram and analyzes the statistical distri- bution of words and parts of speech for each position around the pair It retains words (or parts of speech) oc- cupying a position with probability greater than a given
Trang 2threshold For example, the bigram "average-industrial"
produces the n-gram "the Dow Jones industrial average"
since the words are always used within this compound
in the training corpus Example outputs of the second
stage of X t r a e t are given in Figure 1 In the figure, the
numbers on the left indicate the frequency of the n-grams
in the corpus, NN indicates that a noun is expected at
this position, AT indicates that an article is expected,
NP stands for a proper noun and VBD stands for a verb
in the past tense See [Smadja and McKeown, 1990] and
[Smadja, 1991] for more details on these two stages
Table 1: O u t p u t of Stage 1
Wi
hostile
hostile
corporate
hostile
unwanted
potential
unsolicited
unsuccessful
friendly
takeover
takeover
big
wj
takeovers takeover takeovers takeovers takeover takeover takeover takeover takeover expensive big takeover
distance
1
1
1
2
1
1
1
1
1
2
4
1
3 S T A G E T H R E E : S Y N T A C T I C A L L Y
L A B E L I N G C O L L O C A T I O N S
In the past, Debili [Debili, 1982] parsed corpora of French
texts to identify non-ambiguous predicate argument rela-
tions He then used these relations for disambiguation in
parsing Since then, the advent of robust parsers such as
C a s s [Abney, 1990], F i d d i t e h [Itindle, 1983] has made it
possible to process large amounts of text with good per-
formance This enabled Itindle and Rooth [Hindle and
Rooth, 1990], to improve Debili's work by using bigram
statistics to enhance the task of prepositional phrase at-
tachment Combining statistical and parsing methods
has also been done by Church and his colleagues In
[Church et al., 1989] and [Church'et ai., 1991] they con-
sider predicate argument relations in the form of ques-
tions such as What does a boat typically do? They are
preprocessing a corpus with the F i d d l t e h parser in order
to statistically analyze the distribution of the predicates
used with a given argument such as "boat."
Our goal is different, since we analyze a set of
collocations automatically produced by X t r a c t to either
enrich them with syntactic information or reject them
For example, i f , bigram collocation produced by X t r a c t
involves a noun and a verb, the role of Stage 3 of X t r a c t
is to determine whether it is a subject-verb or a verb-
object collocation If no such relation can be identified,
then the collocation is rejected This section presents
the algorithm for X t r a c t Stage 3 in some detail For
illustrative purposes we use the example words takeover
and thwart with a distance of 2
3.1 D E S C R I P T I O N O F T H E A L G O R I T H M
I n p u t : A bigram with some distance information in- dicating the most probable distance between the two words For example, takeover and thwart with a distance
of 2
O u t p u t / G o a h Either a syntactic label for the bigram
or a rejection In the case of takeover and thwart the collocation is accepted and its produced label is V O for
verb-object
The algorithm works in the following 3 steps:
3.1.1 S t e p 1: P R O D U C E T A G G E D
C O N C O R D A N C E S
All the sentences in the corpus that contain the
two words in this given position are produced This
is done with a concord,acing program which is part of
X t r a e t (see [Smadja, 1991]) The sentences are labeled with part of speech information by preprocessing the cor- pus with an automatic stochastic tagger 1
3.1.2 S t e p 2: P A R S E T H E S E N T E N C E S
Each sentence is then processed by C a s s , a bottom-up incremental parser [Abney, 1990] 2 C a s s
takes input sentences labeled with part of speech and attempts to identify syntactic structure One of C a s s modules identifies predicate argument relations We use this module to produce binary syntactic relations (or la- bels) such as "verb-object" (VO), %erb-subject" (VS),
"noun-adjective" ( N J), and "noun-noun" ( N N ) Con- sider Sentence (1) below and all the labels as produced
by C a s s on it
(1) "Under the recapitalization plan it proposed to
t h w a r t the t a k e o v e r "
S V it proposed
N N recapitalization plan
V O thwart takeover
For each sentence in the concordance set, from the output of C a s s , X t r a c t determines the syntactic relation of the two words among VO, SV, N J, N N and assigns this label to the sentence If no such relation is observed, X t r a c t associates the label U (for undefined)
to the sentence We note label[ia~ the label associated
1For this, we use the part of speech tagger described in [Church, 1988] This program was developed at Bell Labora- tories by Ken Church
UThe parser has been developed at Bell Communication Research by Steve Abney, C a s s stands for Cascaded Analysis
of Syntactic Structure I am much grateful to Steve Abney
to help us use and customize C a s s for this work
Trang 3681 takeover bid
310 takeover offer
258 takeover attempt
177 takeover battle
154 NN NN takeover defense
153 takeover target
119 a possible takeover NN
118 takeover law
109 takeover rumors
102 takeover speculation
84 takeover strategist
69 AT takeover fight
62 corporate t a k e o v e r
50 takeover proposals
40 Federated's poison pill takeover defense
33 NN VBD a sweetened takeover offer from N P
F i g u r e 1: Some n - g r a m s containing "takeover"
with Sentence id For e x a m p l e , the label for Sentence (1)
is: label[l] - VO
4 A L E X I C O G R A P H I C
E V A L U A T I O N
3 1 3 S t e p 3: R E J E C T O R L A B E L
C O L L O C A T I O N
T h i s last s t e p consists o f deciding on a label for
the b i g r a m from the set of label[i~'.s For this, we count
the frequency of each label for the b i g r a m and perform
a s t a t i s t i c a l analysis o f this d i s t r i b u t i o n A collocation
is accepted if the two seed words are consistently used
with the s a m e s y n t a c t i c relation More precisely, the
collocation is accepted if and only if there is a label 12 ~:
U satisfying the following inequation:
[probability(labeliid ] = £ ) > T I
in which T is a given threshold to be d e t e r m i n e d
by the e x p e r i m e n t e r A collocation is thus rejected if no
valid label satisfies the inequation or if U satisfies it
F i g u r e 2 lists some accepted collocations in the
f o r m a t p r o d u c e d by X t r a c t with their s y n t a c t i c labels
For these examples, the threshold T was set to 80%
For each collocation, the first line is the o u t p u t of the
first stage of X t r a c t It is the seed b i g r a m with the
distance between the two words T h e second line is the
o u t p u t of the second stage of X t r a c t , it is a multiple
word collocation (or n-gram) T h e n u m b e r s on the left
indicate the frequency of occurrence of the n - g r a m in
the corpus T h e t h i r d line indicates the s y n t a c t i c label
as d e t e r m i n e d b y the t h i r d stage of X t r a c t Finally,
the last lines s i m p l y list an e x a m p l e sentence and the
position of the collocation in the sentence
Such collocations can then be used for vari-
ous purposes including lexicography, spelling correction,
speech recognition and language generation Ill [Smadja
and McKeown, 1990] and [Smadja, 1991] we describe
how they are used to build a lexicon for language gener-
ation in the d o m a i n of stock m a r k e t reports
T h e third s t a g e of X t r a c t can t h u s be considered as a retrieval s y s t e m which r e t r i e v e s valid collocations from
a set of c a n d i d a t e s T h i s section describes an evaluation
e x p e r i m e n t of the t h i r d s t a g e of X t r a c t as a retrieval system E v a l u a t i o n of retrieval s y s t e m s is usually done with the help of two p a r a m e t e r s : precision and recall
[Salton, 1989] Precision of a r e t r i e v a l s y s t e m is defined
as the r a t i o o f r e t r i e v e d valid e l e m e n t s d i v i d e d by the
t o t a l n u m b e r o f r e t r i e v e d e l e m e n t s [Salton, 1989] It measures the q u a l i t y o f the r e t r i e v e d m a t e r i a l Recall
is defined as the r a t i o of r e t r i e v e d valid e l e m e n t s divided
by the t o t a l n u m b e r of valid elements I t measures the effectiveness o f the s y s t e m T h i s section p r e s e n t s an eval-
u a t i o n of the retrieval p e r f o r m a n c e o f the t h i r d s t a g e of
X t r a c t
4 1 T H E E V A L U A T I O N E X P E R I M E N T
Deciding w h e t h e r a given word c o m b i n a t i o n is a valid or invahd collocation is a c t u a l l y a difficult task
t h a t is b e s t done b y a lexicographer Jeffery Triggs is
a lexicographer working for O x f o r d English D i c t i o n a r y ( O E D ) c o o r d i n a t i n g the N o r t h A m e r i c a n Readers pro-
g r a m of O E D at Bell C o m m u n i c a t i o n Research Jef- fery Triggs agreed to m a n u a l l y go over several t h o u s a n d s collocations, a
We r a n d o m l y selected a s u b s e t o f a b o u t 4,000 collocations t h a t c o n t a i n e d t h e i n f o r m a t i o n compiled by
X t r a c t after the first 2 stages T h i s d a t a set was then the s u b j e c t of t h e following e x p e r i m e n t
We gave the 4,000 collocations to e v a l u a t e to the lexicographer, asking him to select the ones t h a t he 3I am grateful to Jeffery whose professionalism and kind- ness helped me understand some of the difficulty of lexicog- raphy Without him this evaluation would not have been possible
Trang 4t a k e o v e r bid -1
681 t a k e o v e r bid IN
S y n t a c t i c Label: NN
10 11
A n i n v e s t m e n t p a r t n e r s h i p on Friday offered to sweeten its
t a k e o v e r bid for G e n c o r p Inc
t a k e o v e r fight -1
69 A T t a k e o v e r fight IN 69
S y n t a c t i c Label: NN
10 11
L a t e r l a s t y e a r H a n s o n won a hostile 3.9 billion t a k e o v e r fight for I m p e r i a l G r o u p
t h e g i a n t B r i t i s h food t o b a c c o a n d brewing c o n g l o m e r a t e a n d raised m o r e t h a n 1.4 billion p o u n d s from t h e sale of I m p e r i a l s C o u r a g e brewing o p e r a t i o n a n d
i t s leisure p r o d u c t s businesses
t a k e o v e r t h w a r t 2
44 to t h w a r t A T t a k e o v e r NN 44
S y n t a c t i c Label: V O
13 11
T h e 48.50 a share offer a n n o u n c e d S u n d a y is designed to t h w a r t a t a k e o v e r bid
b y G A F Corp
t a k e o v e r m a k e 2
68 MD m a k e a t a k e o v e r NN J J 68
S y n t a c t i c Label: V O
14 12
M e a n w h i l e t h e N o r t h C a r o l i n a S e n a t e a p p r o v e d a bill T u e s d a y t h a t would m a k e a
t a k e o v e r of N o r t h C a r o l i n a b a s e d c o m p a n i e s m o r e difficult a n d t h e House was
e x p e c t e d t o a p p r o v e t h e m e a s u r e before t h e end of t h e week
t a k e o v e r r e l a t e d -1
59 t a k e o v e r related 59
S y n t a c t i c Label: SV
2 3
A m o n g t a k e o v e r r e l a t e d issues Kidde j u m p e d 2 t o 66
F i g u r e 2: S o m e e x a m p l e s o f c o l l o c a t i o n s w i t h "takeover"
T w 9 4 % T = 9 4 %
U = 9,5%
Y t 0 %
Y Y = 4 0 %
N - - 9 2 %
F i g u r e 3: O v e r l a p of t h e m a n u a l a n d a u t o m a t i c e v a l u a t i o n s
Trang 5would consider for a domain specific dictionary and to
cross out the others The lexicographer came up with
three simple tags, Y Y , Y and N Both Y and Y Y are
good collocations, and N are bad collocations The dif-
ference between Y Y and Y is that Y collocations are of
better quality than Y Y collocations Y Y collocations
are often too specific to be included in a dictionary, or
some words are missing, etc After Stage 2, about 20%
of the collocations are Y, about 20% are Y Y , and about
60% are N This told us that the precision of X t r a c t at
Stage 2 was only about 40 %
Although this would seem like a poor precision,
one should compare it with the much lower rates cur-
rently in practice in lexicography For the OED, for
example, the first stage roughly consists of reading nu-
merous documents to identify new or interesting expres-
sions This task is performed by professional readers
For the OED, the readers for the American program
alone produce some 10,000 expressions a month These
lists are then sent off to the dictionary and go through
several rounds of careful analysis before actually being
submitted to the dictionary The ratio of proposed can-
didates to good candidates is usually low For example,
out of the 10,000 expressions proposed each month, less
than 400 are serious candidate for the OED, which rep-
resents a current rate of 4% Automatically producing
lists of candidate expressions could actually be of great
help to lexicographers and even a precision of 40% would
be helpful Such lexicographic tools could, for example,
help readers retrieve sublanguage specific expressions by
providing them with lists of candidate collocations The
lexicographer then manually examines the list to remove
the irrelevant data Even low precision is useful for
lexicographers as manual filtering is much faster than
manual scanning of the documents [Marcus, 1990] Such
techniques are not able to replace readers though, as they
are not designed to identify low frequency expressions,
whereas a human reader immediately identifies interest-
ing expressions with as few as one occurrence
The second stage of this experiment was to use
X t r a c t Stage 3 to filter out and label the sample set of
collocations As described in Section 3, there are several
valid labels (VO, VS, N N , etc.) In this experiment, we
grouped them under a single label: T There is only one
non-valid label: U (for unlabeled} A T collocation is
thus accepted by X t r a c t Stage 3, and a U collocation is
rejected The results of the use of Stage 3 on the sample
set of collocations are similar to the manual evaluation
in terms of numbers: about 40% of the collocations were
labeled (T) by X t r a c t Stage 3, and about 60% were
rejected (U)
Figure 3 shows the overlap of the classifications
made by X t r a c t and the lexicographer In the figure,
the first diagram on the left represents the breakdown in
T and U of each of the manual categories (Y - YY and
N) The diagram on the right represents the breakdown
in Y - YY and N of the the T and U categories For
example, the first column of the diagram on the left rep-
resents the application of X t r a c t Stage 3 on the Y Y col-
locations It shows that 94% of the collocations accepted
by the lexicographer were also accepted by X t r a c t In other words, this means that the recall o f t h e third stage
of X t r a c t is 94% The first column of the diagram on the right represents the lexicographic evaluation of the collo- cations automatically accepted by X t r a c t It shows that about 80% of the T collocations were accepted by the lexicographer and that about 20% were rejected This shows that precision was raised from 40% to 80% with the addition of X t r a c t Stage 3 In summary, these ex- periments allowed us to evaluate Stage 3 as a retrieval system The results are:
I P r e c i s i o n = 80% R e c a l l = 94% ]
5 S U M M A R Y A N D
C O N T R I B U T I O N S
In this paper, we described a new set of techniques for syntactically filtering and labeling collocations Using such techniques for post processing the set of colloca- tions produced by X t r a c t has two major results First,
it adds syntax to the collocations which is necessary for computational use Second, it provides considerable im- provement to the quality of the retrieved collocations as the precision of X t r a c t is raised from 40% to 80% with
a recall of 94%
By combining statistical techniques with a sophis- ticated robust parser we have been able to design and implement some original techniques for the automatic extraction of collocations Results so far are very en- couraging and they indicate that more efforts should be made at combining statistical techniques with more sym- bolic ones
A C K N O W L E D G M E N T S
The research reported in this paper was partially sup- ported by DARPA grant N00039-84-C-0165, by NSF grant IRT-84-51438 and by O N R grant N00014-89-J-
1782 Most of this work is also done in collaboration with Bell Communication Research, 445 South Street, Mor- ristown, N3 07960-1910 I wish to express my thanks
to Kathy McKeown for her comments on the research presented in this paper I also wish to thank Dor~e Seligmann and Michael Elhadad for the time they spent discussing this paper and other topics with me
R e f e r e n c e s
Text Research, 1990
iomatic and Collocational Expressions in a Large Cot-
Trang 6pus Journal for Literary and Linguistic computing,
4:34-38, 1983
[Church and Hanks, 1989] K Church and K Hanks
Word Association Norms, Mutual Information, and
Lexicography In Proceedings of the 27th meeting of
the A CL, pages 76-83 Association for Computational
Linguistics, 1989 Also in Computational Linguistics,
vol 16.1, March 1990
[Church et at., 1989] K.W Church, W Gale, P Hanks,
and D Hindle Parsing, Word Associations and Typ-
ical Predicate-Argument Relations In Proceedings of
the International Workshop on Parsing Technologies,
pages 103-112, Carnegie Mellon University, Pitts-
burgh, PA, 1989 Also appears in Masaru Tomita
(ed.), Current Issues in Parsing Technology, pp 103-
112, Kluwer Academic Publishers, Boston, MA, 1991
[Church et at., 1991] K.W Church, W Gale, P Hanks,
and D Hindle Using Statistics in Lexical Analysis In
Uri ~ernik, editor, Lexical Acquisition: Using on-line
resources to build a lexicon Lawrence Erlbaum, 1991
In press
[Church, 1988] K Church Stochastic Parts Prograln
and Noun Phrase Parser for Unrestricted Text In
Proceedings of the Second Conference on Applied Nat-
ural Language Processing, Austin, Texas, 1988
[Debili, 1982] F Debili Analyse Syntactico-Sdmantique
Fondde sur une Acquisition Automatique de Relations
Lexicales Sdmantiques PhD thesis, Paris XI Univer-
sity, Orsay, France, 1982 Th~se de Doctorat D'~tat
[Hindle and Rooth, 1990] D Hindle and M Rooth
Structural Ambiguity and Lexieal Relations In
DARPA Speech and Natural Language Workshop, Hid-
den Valley, PA, June 1990
[Hindle, 1983] D Hindle User Manual for Fidditch, a
Deterministic Parser Technical Memorandum 7590-
142, Naval Research laboratory, 1983
[Marcus, 1990] M Marcus Tutorial on Tagging and
Processing Large Textual Corpora Presented at the
28th annual meeting of the ACL, June 1990
[Salton, 1989] J Salton Automatic Text Processing,
The Transformation, Analysis, and Retrieval of In-
formation by Computer Addison-Wesley Publishing
Company, NY, 1989
[Smadja and McKeown, 1990] F Smadja and K McKe-
own Automatically Extracting and Representing Col-
locations for Language Generation In Proceedings of
the 28th annual meeting of the ACL, Pittsburgh, PA,
June 1990 Association for Computational Linguistics
[Smadja, 1988] F Smadja Lexical Co-occurrence, The
Missing Link in Language Acquisition Ill Program
and abstracts of the 15 th International ALLC, Con-
ference of the Association for Literary and Linguistic
Computing, Jerusalem, Israel, June 1988
[Smadja, 1991] F Smadja Retrieving Collocational
Knowledge from Textual Corpora An Application:
Language Generation PhD thesis, Computer Science
Department, Columbia University, New York, NY, April 1991