Luận án tiến sĩ: The study on automatic Chinese collocation extraction

Collocation extraction algorithms are then desigried to target on different types of collocations using different features and criteria associated with these different types.. Inaddition

Basic Concept 1n

Even though collocation occurs in text naturally and can be easily understood by human beings, it is difficult defined clearly [Benson 1989; Manning et al 1999] Previous work on collocation proposed various definitions According to [Gitsak 2000] and [Lin 1998],

“collocation is habitual and recurrent word combination” [Manning et al 1999] defined “a collocation is an expression consisting of two or more words that correspond to some conventional way of saying things” Many researchers adopted Benson’s definition on collocation That is, “a collocation is an arbitrary and recurrent word combination” [Benson

1990] This definition is adopted by [Smadja 1993] and [Sun 1997] Some researchers restrict collocation to be consecutive word sequences, while others also consider interrupted forms such as “make decision” Semantically, some researchers considered that collocations should only be the word combinations whose whole meanings can not be predicted from their components, whereas others thought that such a restriction is too strict

[Allerton 1984] and are not operatable.

In this study, collocation is defined based on Manning’s definition [Manning 1999] and is stated as follows:

A collocation is a recurrent and conventional expression containing two or more content word combinations that hold syntactic and/or semantic relations.

More specifically, content words in Chinese include noun, verb, adjective, adverb, determiner, directional word, gerund, and so on [Zhang et al 1992].

Collocations have certain properties which can be used to identify them in collocation extraction algorithms Firstly, a collocation is the co-occurrence of two or more words within a limited context [Sinclair 1991] These two or more words can appear adjacent, referred to as uninterrupted collocation, or distant, referred as interrupted collocation [Manning et al.

1999] According to the number of components, collocations are divided into bi-gram collocation, which contains two words, and n-gram collocation, which contains more than two words [Smadja et al 1990].

Secondly, collocation is recurrent [Smadja 1993] As rightfully pointed out by [Hoey

1991], that “collocation has long been the name given to the relationship a lexical item has with items that appear with greater than random probability in its context” This means that the collocations should occur frequently in similar context because they are of conventional use Thus word combinations with certain occurrences can be considered as collocations.

Thirdly, collocation is limitedly compositional according to [Benson 1989; Brundage

1992; Manning et al 1999] [Brundage 1992] introduced the term “compositional” as a natural language expression where the meaning of the expression can be predicted from the meaning of its component Generally speaking, free word combinations can be generated by linguistic rules and the meaning is the combination of its components Collocation should be limited compositional In other words, collocations are expected to have additional meanings beyond, i.e., the meaning can not be derived directly from the meaning of its components.

On the other hand, for those word combinations that have little additional meaning over the combination of words, they are also regarded as collocations if they show the close semantic restriction between their components For example, the meaning of the components of “BAR sk 44” is to climb a tree to catch fish, however, the exact meaning of this collocation is “a fruitless approach”.

Fourthly, collocation is limited substitutable and limited modifiable Limited substitutable here means that a word can not be replaced freely by its synonyms in a context.For example, the word “strong” in a collocation “strong tea” can not be substituted by “hot”,

Chapter 2 Basic Concepts and Thesis Scope 7 also, the word “#,#%” (baggage) in the collocation “7 5 3#” can not be substituted by its synonymy “47 4# ”(uggage) Also, collocations can not be modified freely by adding modifiers or through grammatical transformations For example, a collocation of KR K

(strong tea) does not allow be further modification or allow the insertion.

Lastly, collocation is domain-dependent [Smadja 1993] In some specific domain, many collocations tend to be terms Furthermore, some word combinations are regarded as collocations only in certain domains where they tend to be free combinations with high co- occurrences.

Motivation and Problem Sta†€Tn€T( 1 KH TH 0 7 2.3 Research Objectives and Thesis Scope Tá ng H1 km 11

As much as everyone understands the importance of collocation knowledge, collocation knowledge can not be compiled easily into a collocation dictionary Traditionally, linguists manually identified and complied collocations from printed text [Sinclair 1995; Mei 1999], However, the coverage and consistency of traditional manual work were not ideal [Smadja

1993] and they were not suitable for use by natural language processing systems With the increase availability of text information in electronic form, there are a number of reported works on automatic collocation extractions in the past two decades or so [Choueka 1983; Shimohata 1997; Sun et al 1997; Yu et al 2003].

Most of these works can be generally categorized into two main types, the window-based statistical approach and the syntax-based approach In the window-based statistical approach, a headword is first identified The lexical statistics are then collected within a fixed context window of the headword to identify significant word combinations as collocations Since collocations are recurrent in a limited context as stated in its properties, a fixed window can be used to extract collocation in a fixed context Lexical statistics used co- occurrence frequency as the main discriminative feature, and different statistical metric and extraction strategy were employed such as absolute and relative co-occurrence frequency by

[Choueka 1983; Choueka 1988; Justeson 1995], point-wise mutual information [Church

1990; Sun et al 1997]), z-score mean and variance [Rogghe 1973; Smadja 1993; Fung 1998].

Hypothesis Testing was widely employed to estimate the probability of two word combinations and discover the bi-gram collocations [Church et al 1989] firstly applied test extraction, and later, [Church et al 1993] and [Manning et al 1999] improved this method.

Pearson’s x test (chi-square test) was employed by [Manning et al 1999; Yuet al 2003] and was proven better than /-test [Dunning 1993] and [Manning et al 1999] suggested to use likelihood ratio test in collocation extraction, and [Zhou et al 2001; Lv et al 2004] showed this test is an appropriate metric especially on sparse data In the window-based statistical approach, lexical statistics based on co-occurrence distribution is another important discriminative feature because the distribution histograms of collocated words usually have peaks [Smadja 1993].

Generally speaking, the systems followed window-based approach may extract both bi- gram and n-gram collocations easily However, this approach require huge training corpus Individually employed window-based statistical algorithms-can not achieve a high precision because they would extract many frequently-occurred word combinations that are not true collocations, which we call pseudo collocations, such as ‘&-i7’ pair (doctor-nurse).

The most-promising collocation candidates extracted by [Yu et al 2003] are listed below. i #R FR RO ARH FF Re

Jiang Zemin Jobless worker News from this papers re-employment

Fl BR ME Ar 2H AX ws FPR

Zhou Enlai attach a picture National People’s Reform and open up

Newhua News Agency, Board of Nationai Li Peng This is

AR AE AR UF BR WT 4®

Can not Reporter oƒ the papers Relevant Departments Can

BX AS T8 4b Ui the biggest the leaders He said

Table 2.1 The most-promising collocation candidates in [Yu et al 2003]

Chapter 2 Basic Concepts and Thesis Scope 9

It is observed that even for these most-promising collocation candidates, there are 8 of them (Bolded) are not true Furthermore, as reported by [Smadja 1993], the accuracy of

English collocation extraction based on lexical statistics was only 40% As reported by [Sun et al 1997], the accuracy of Chinese collocation extraction using a method similar to

Smadja’s was 29.3% [Sun 1998] improved the performance by changing the “watch window” in statistics and got a 46.1% accuracy rate and a 64.5% recall Meanwhile, many collocations with low co-occurrence frequency can not be extracted because of the lack of statistical significance |

With recent increases in parsing efficiency and accuracy, the syntax and dependency information are naturally employed in syntax-based collocation extraction approach [Justenson et al 1995] used a set of consecutive part-of-speech tag patterns to filter out pseudo collocations with high co-occurrence frequency [Goldman 2001] emphasized the necessity to use dependency information in collocation extraction By applying dependency parser to the text, the dependency triples that consist of two words with syntactic relations are identified [Lin 1997] used point-wise mutual information as the metric to find out the significant dependency triples, like “carefully-check-V:jvab:A” (V:jvab:A means the dependency of a verb and its adverbial modifier) as collocations [Wu et al 2003] and [Lv et al 2004] conducted similar work, and the log-likelihood ratio is selected as statistical metric in their systems By restricting the candidate search to syntactically depended word combinations rather than all context words in window-based approach, syntax-based approach achieves good accuracy even for low-frequency collocations However, the performance of this approach highly relies on the quality and capability of the dependency parser employed because dependency analysis error will propagate thus influence performance of collocation extraction In Lin’s work [Lin 1997], the employed dependency parser may only parse 22% text of the whole 100-million-word corpus On the other hand,even for the state-of-art dependency parser, NLPWinParser by Microsoft corp., the parsing accuracy on a large corpus is only around 90% [Lv et al 2004] Naturally, the parsing errors will be propagated to collocation extraction stage, and the precision and recall performance are decreased Meanwhile, current dependency parsers can process only identified types of dependency relations, thus, n-gram collocations and bi-gram collocations with unfamiliar dependencies are lost Especially, outputs by existing collocation extraction systems can not distinguish true collocations from insignificant collocations, since they did not take semantic association into consideration.

Semantic information can also be employed in collocation extraction systems Since a true collocation should be limited substitutable, its components can not be freely substituted by other words including their synonyms [Pearce 2001] and [Li 2005] used synonyms substitution test to improve a syntax-based collocation extraction system and a window- based one, respectively Another effective way is to make use of translation information if available If a word combination can not be translated word by word, it tends to be a true collocation [Manning et al 1999] gave examples of some English collocations which can be identified by comparing the translation results in French Furthermore, [Wu et al 2003] applied synonyms relationships between Chinese and English to automatically acquire synonyms collocations in English, and the result is promising.

Even though there are many attempts in collocation extraction, but the performance is not good enough to be used directly in practical natural language related applications The problems affecting the performances of current systems are summarized as follows.

1 The pure lexical statistics based extraction systems retum too many pseudo collocations The precision in all the reported work are no more than 50% making them unreliable to be used in natural language processing systems.

2 The sparseness of lexical statistics and some abnormalities of collocation lead to unsatisfactory recall Even in a huge corpus consisting nearly one hundred million words,there are still many true collocations with very low occurrence, like the occurrence of “US inim” is only 4 (venture, increase and decrease) However, these are true collocations and should be extracted in a good extraction system.

Chapter 2 Basic Concepts and Thesis Scope 11

3 Most existing collocation extraction systems can not distinguish between true collocations from free combinations If a word combination can be generated by syntactic rules without any additional meanings, this kind of free word combination is regarded as an

= insignificant collocation even though it has high occurrence For example, "% (ear) can be combined with a lot of foods, like 3#5R(apple), Ti ®(bread), {& (rice), and so on, yet they are not useful for extraction as they are completely compositional and can be generated by some rules Since current algorithms do not consider the semantic associations between words good precision can not be achieved.

4 Most existing algorithms identify collocations by using single criterion and single threshold Only the word combinations with confidence higher than a pre-defined thresholds, like with value of mutual information higher than threshold, are extracted as collocations, and others are eliminated Such a strategy is based on an assumption that true collocations can be classified out by a linear discriminator Obviously, such an assumption is not true, and true collocations vary from lexical statistics, syntactic and semantic Naturally, the performance of these algorithms is limited.

Review of Automatic Collocation Extraction Techn1que€s - <<-<<<<ô 16

Window-based Statistical Collocation Extraction Approach

Using a headword as the start point, its co-occurred words within the context windows are retrieved from the corpus Window-based approach then identifies the word combinations with lexical statistic significances as collocations No doubt, the statistical features based on lexical co-occurrence frequency are the main discriminative features in window-based approach Thus, the previous works following window-based approach are reviewed here according to the employed statistical metric based on co-occurrence frequency.

The early work on automatic collocation extraction was introduced by Choueka[Choueka 1983; Choueka 1988] He proposed a simple method to identify collocations in a text corpus by counting the frequencies of word combinations In the experiment, he retrieved sequences of adjacent two to six words that occur more frequently than a given threshold Similarly, Justenson conducted an experiment on extracting bi-gram collocations based on co-occurrence frequency [Justeson et al 1995] The advantage of these works was that its design and implementation are clear and simple However, the accuracy was not good.

Furthermore, individually employing a fixed value of absolute frequency as the discriminative threshold made these systems sensitive to the corpus size.

Sayori Shimohata proposed a method to automatically retrieve collocations by using feature of relative frequency [Shimohata 1999] He extracted recurrent combinations of strings, in accordance with their word order in a corpus, as collocations In order to retrieve both interrupted and uninterrupted collocations, the frequencies of co-occurrences and word order constraints from corpus were employed Firstly, the recurrent word strings surround the headword were retrieved from the corpus Secondly every two string, string i and string j, are examined Thirdly, these strings are refined by the following steps.

Combining string i and string j when they overlap or adjoin each other, and satisfy, freq(string i, string j) >=T =0.8 freq(string i) ratio “ or filter out string i if string j subsumes string i and satisfy ing j 3.2 where freq(string i), freq(string j) and freq(string i, string j) are respectively the occurrence frequency of string i and 7 and their co-occurrence frequency in the corpus, Tyato is a pre- defined threshold Here, relative frequency was used to examine the probability that string i and string j can be combined Finally, the identified string i and headword constructed a collocation.

Another work to be mentioned here is Richardson’s work He was concerned with extracting semantic relationships from machine readable dictionaries (Richardson et al.

1998] The problem for assigning weights to extract semantic relationships is very similar to that of ranking the extracted collocations He proposed a technique to use a fitted exponential

Chapter 3 Literature Reveiw 19 curve, instead of observed frequency, to estimate the joint probabilities of word pairs The result was good.

Church and Hanks used a correlation-based metric to extract collocations In their works, the two-word pairs that appear together more than expected by chance were regarded as bi- gram collocations [Church et al 1989; Hindle 1990; Church et al 1991] The point wise mutual information, borrowed from the definition in information theory area [Fano 1961], was employed to estimate correlation between word pairs.

If two words, w, and w;, have occurrence probabilities of P(w,) and P(w,) in the corpus, and their co-occurrence probability is P(wa w,), then their mutual information /(W„ ws) is defined as,

It is observed that: J (w, w; ) >> 0 means w, and w, are highly related; J (w, w; ) 0 means w, and w, are nearly independent; and 7 (w, w,)

Tiêu đề	Automatic Chinese Collocation Extraction
Tác giả	Xu Ruifeng
Trường học	The Hong Kong Polytechnic University
Chuyên ngành	Computing
Thể loại	Thesis
Năm xuất bản	2006

Định dạng
Số trang	214
Dung lượng	23,37 MB