Extraction of vietnamese collocation from text corpora

This paper focused on researching some method of extracting collocations methods to find efficient model for the Vietnamese collocations extraction.. By running the program with differen

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

DO THI NGOC QUYNH

EXTRACTION OF VIETNAMESE COLLOCATION

FROM TEXT CORPORA

MASTER THESIS

Hanoi – 2011

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

DO THI NGOC QUYNH

EXTRACTION OF VIETNAMESE COLLOCATION

FROM TEXT CORPORA

Major : Computer Science

Code : 60 48 01

MASTER THESIS

SUPERVISOR: Doctor Le Anh Cuong

Hanoi – 2011

Trang 3

Table of Contents

1.1 Definitions 2

1.2 Related works and motivation 3

1.3 Contribution of the thesis 6

2 Collocation: concept, roles and applications 7 2.1 Collocations’ characteristics 7

2.1.1 Recurrent 8

2.1.2 Arbitrary 8

2.1.3 Domain-dependent 8

2.1.4 Non-substitutability (the closely linked in terms of vocabulary) 9 2.2 Classification of collocations 9

2.2.1 Idiomatic Phrases 10

2.2.2 Support Verb Construction 10

2.2.3 Fixed Phrases 10

2.3 Applications 11

2.4 Vietnamese collocations 12

3 Basic methods in Collocation extraction 14 3.1 Frequency 15

3.2 Hypothesis testing 16

3.2.1 T-Test 17

3.2.2 Chi-Square 18

3.3 Point-wise Mutual Information (PMI) 20

4 Our proposal for extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23

4.2 The Linguistic Measure 24

v

Trang 4

vi TABLE OF CONTENTS

4.3 Designed model 25

5 Experiments 27 5.1 Data preparation 27

5.1.1 Collecting corpora 27

5.1.2 Extracting bi-grams 28

5.1.3 Adding syntactic information to bi-grams 28

5.2 The test models 29

5.3 Experimental results with statistical methods 30

5.3.1 Bi-grams with syntactic information 31

5.4 The experiments of our proposal 32

Trang 5

-1-

ABSTRACT

Collocations have wide application in the fields of languages, compiled a dictionary as well as the problem of natural language processing Therefore, the extraction of collocations in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier However, in Vietnam, the study of collocation is quite a new field This paper focused on researching some method of extracting collocations methods to find efficient model for the Vietnamese collocations extraction The mentioned methods were based on some classic statistical methods commonly used such as frequency, t-test, chi-square, mutual information We also suggested some general method using linguistic measure to increase the accuracy of the process of extraction Input data included the data has been through a POS-tagging and data has been parsed By running the program with different methods and combination of multiple methods together, comparing the accuracy of the method, we draw out the efficient method of extracting of Vietnamese Collocation from Text Corpora

Trang 6

-2-

CHAPTER 1: Introduction

Firth [7] defines the concept of collocation is an abstract syntax, not directly related to the meaning of words constitute it Choueka [5] said that the concept of collocation is a sequence of two or more consecutive words which has the characteristics of a syntactic unit means, and its meaning could not be inferred directly from the meaning of words components According to Benson [2], a collocation is a combination of the fixed and repeated words Thus, Firth paid attention to the lexical of collocation, and Choueka tend to aspects of syntactic function of collocation in the text The definition of Benson is one of the most used defining, but it ignores a number of features and attributes of collocation applications in machine translation such as a collocation could not be translated in English into Vietnamese word by word

Collocations are an expression of two or more words that correspond to a conventional way of saying things They are also known as a class of word groups which lie between idioms and free word combination [4] However, it is typical to draw a line between a phrase and a collocation Idioms and phrase may be defined as expression in the language that is peculiar to itself either grammatically or especially in having a meaning that cannot be derived from the sum

of the meanings of its elements It becomes well impossible to guess the meaning of an idiom from the words it contains And, moreover, the meanings that idioms have are often stronger than the meanings of non-idiomatic phrases

There have been many studies of collocation to be conducted in English, but there is no standard definition of collocation is made, and the definition of collocation depends on the point and purpose of each of the researchers

In this thesis, we accept the definition: collocation is a combination of words that often appear together in the normal range in the text, position and grammatical relations are relatively fixed Collocations have wide application in the fields of languages [2, 21, 23], compiled a dictionary [11] as well as the problem of natural language processing [4, 16, 18, 25, 27] Therefore, the extraction of collocations selected in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier In addition, collocation translation improves the quality of machine translation Automatic identification of important collocations to be listed in a dictionary is the task of computational lexicography The knowledge of collocations can improve the performance

of information retrieval system Statistical methods have shown a remarkable presence in collocation extraction Frequency measure was used to identify a particular type of collocations Mutual information was used to extract word pairs that tend to co-occur within a fixed size window (normally 5 words), in which extracted words may not be directly related The use of t-test to find words whose co-occurrence patterns best distinguish between two words was suggested before They also applied likelihood ratio test to collocation discovery

Trang 7

-3-

CHAPTER 2: Related words

A good example of the type of problem is Halliday's example of strong vs powerful tea (Halliday 1966: p150) It is a convention in English to talk about strong tea, not powerful tea,

although any speaker of English would also understand the latter unconventional expression The combination of words that do not following a rule of grammar or semantics is definition of collocations Thus, one collocation can be interpreted as a combination of the words which do not follow a rule of grammar or semantics at all In some points of view, collocations are fixed and inflexible Means of one collocation is not usually inferred from the meaning of words into parts, and replace a word with one component of synonyms can completely change the meaning of the collocation

Collocations are also understood as idiosyncratic pragmatics combination of lexical items

(Fontenelle, 1992, p222): heavy rain, light breeze, great difficulty, grow steadily, meet requirement, reach consensus, pay attention, ask a question Unlike idioms (kick the bucket, lend

a hand, pull someone’s leg), their meaning is fairly transparent and easy to decode Differently from the regular productions, (big house, cultural activity; read a book) collocations expressions

are highly idiosyncratic, since the lexical items a headword combines with in order to express a given meaning is contingent upon that word (Mel’ˇcuk, 2003)

As it has been pointed out by many researchers (Cruse, 1986; Benson, 1990; McKeown and Radev, 2000), collocations cannot be described by means of general syntactic and semantic rules They are arbitrary and unpredictable, and therefore need to be memorized They constitute the so-called semi-finished products of language (Hausmann, 1985) or the islands of reliability (Lewis, 2000) on which the speakers build their utterances

In addition, collocation is a special problem of linguistic Syntax imposes constraints on word order or the occurrence of particular phrasal types such as PPs or NPs, and lexical semantics imposes Joachim Wermter and Udo Hahn [1] introduced a linguistic measure for identifying PP-verb collocations in German, which is based on the property of non- or limited modifiability

Due to their popularity that there are a large number of collocation extraction word concerns the English language: (Choueka, 1988; Church et al, 1989; Church and Hanks, 1990; Smadja, 1993; Justeson and Katz, 1995; Kjellmer in 1994, Sinclair., in 1995; Lin, 1998), among many others Choueka (1988) provide methods to detect n-grams (consecutive) simply by calculating the co-occurrence frequency Justeson and Katz (1995) apply a POS-filter on the pair

of their extraction (Kjellmer 1994) Smadja (1993) using the z-score associated with multiple diagnostic (e.g., the presence of two systems of lexical items at the same distance in the text) and extracts predicative collocations, rigid noun phrases and phrasal templates He then uses the parser to validate the results Parsing is shown to lead to an increase in accuracy from 40\% to 80\% (Church et al, 1989) and (Church and Hanks, 1990) using POS information and parsed to

Trang 8

-4-

extract verb-object pairs, then they are ranked according to the mutual information (MI) measure Lin (1998) also proposes a hybrid approach based on a dependency parser The candidate extracted then compare with MI result

In the document production tasks such as machine translation [2, 21, 23] and natural language processing [4, 16, 18, 25, 27], collocations also presented the importance Furthermore, they are useful in a variety of other applications, such as word sense disambiguation (Brown et al, 1991) and parsing (Alshawi and Carter, 1994) Collocations is particularly important because the incidence in the native language, in all the areas or categories According to Jackendoff (1997, 156) and Mel 'Cuk (1998, 24), a large number of collocations appeared in the vocabulary of a language The past decade has witnessed a considerable development of collocation extraction techniques that concerns both monolingual (parallel) multilingual corpora We can mention here only a part of this work: (Berry-Rogghe, 1973; Church et al., 1989; Smadja, 1993; Lin, 1998; Krenn and Evert, 2001) for monolingual extraction, and (Kupiec, 1993; Wu, 1994; Smadja et al., 1996; Kitamura and Mat-sumoto, 1996; Melamed, 1997) for bilingual extraction via alignment

In the first paper on fuzzy decision making Raj Kishor Bisht and H.S.Dhami [3] suggest a way to check the possibility whether a word combination can be considered as collocation or not Fuzzy logic allows the formation of a logic based model by utilizing the reasoning behind the existing methods The resulting model has the simplicity of the logic based model and performs better than the existing statistical models

In the study of collocation, German is the second most investigated language The first is the study of Breidt (1993) and more recently, Krenn and Evert, such as (Krenn and Evert in 2001; Evert and Krenn, 2001 Evert 2004) Breidt using MI and t-score and compare accuracy results when the different parameters change, such as window size, the presence compared with absence

of lemmatization, corpus size, and the presence compared with absence of POS and syntactic information Then, Krenn and Evert (2001) used a German chunk-er to extract the pair syntax as P-N-V Their work set the basis of formal methods and the pricing system in collocation extraction Zinsmeister and Heid (2003, 2004) focused on combining NV and ANV determined using a stochastic parser

Thanks to the outstanding work of Gross on lexicon-grammar (1984), French is one of the languages most studied on the distribution and conversion capabilities of the word This work was done before the computer era and the advent of corpus linguistics, while the automatic extraction was then performed, for example, in (Lafon, 1984; Daille in 1994 ; Bourigault in 1992, Goldman et al, 2001)

There are also a number of methods to extract collocation studies in other languages For over 20 years ago, the field of natural language processing has achieved many accomplishments (such as labelling grade, topic detection, or recovery information .) However, most of these were made for Western languages and their value is lost when applied to other languages Only very recently, Vietnamese researchers are attracted linguistics and Vietnamese standard grade

Trang 9

on when the Chi-Square to find the collocation Chi-Square value are calculated from a large data set (data Vnexpress (199MB) and Wikipedia (270MB) in about 200 subjects), and is based on a threshold value to determine the collocation (which authors called coloThreshold)

Trang 10

According to the dictionary translation in English - Vietnamese, collocation means "an arrangement in place, the placement order." In the field of language, collocation can be understood like "(a) use the word, (a) incorporating the word" In Vietnamese, there is a concept very close to the meaning of collocation, which is a fixed phrase [38] The fixed phrase is a number of word combined, exists as a unit is available as word, it has semantic constituents and stability as well as word

Definition of fixed phrase has developed and organized in a way that the organization of the phrase, and are generally iconic Therefore, if only based on the surface, on the meaning of each

constituent is generally could not understand the whole phrase For example: anh hùng rơm, đồng không mông quạnh, tiếng bấc tiếng chì…

Furthermore, fixed phrase mean as a whole corresponds to a body structure of its material

This means that it has very high expression, for example, the fixed phrase: rán sành ra mỡ, méo miệng đòi ăn xôi vò, say như điếu đổ the expression is the fullest extent The fixed phrase

should be distinguished from the neighbouring units, they are easily confused with compound words and free phrases

If accepted a temporary name that is not immediately identify their conceptual content, it can

be summed up one of the classification picture Vietnamese fixed phrase as follows [36]:

Figure 3.1 : Some types of Vietnamese fixed phrase The classification of Vietnamese fixed phrase above is not worked out the absolute boundaries between these categories, and not the units in each category are shown the properties of pure type There are intermediate unit is formed by the way of free expression, less stable still crisp There are those who have achieved the high expression, but the durable and the body of the structure are low

Định dạng
Số trang	21
Dung lượng	530,07 KB