Tagset evaluation and automatical error verrification in pos tagged corpus, đánh giá tập nhãn và xác định lỗi tự động trong kho ngữ liệu đã gán nhãn

Statistic ambiguous the word types in VnQtag corpus.... Statistic detail ambiguous word types in VnQtag corppus.... Evaluating and convertible possibility of tagset In previous section,

Trang 1

VIETNAM NATIONAL UNIVERSITY HANOI

UNIVERSITY OF ENGINEERING AND

TECHNOLOGY

THI-THANH-TAM DO

TAGSET EVALUATION

AND AUTOMATICAL ERROR VERRIFICATION

IN POS TAGGED CORPUS

MASTER THESIS

(Natural language processing)

Ha Noi - 2012

Trang 2

VIETNAM NATIONAL UNIVERSITY HANOI

UNIVERSITY OF ENGINEERING AND

TECHNOLOGY

THI-THANH-TAM DO

TAGSET EVALUATION

AND AUTOMATICAL ERROR VERRIFICATION

IN POS TAGGED CORPUS

Branch of knowledge: Information technology

Major: Computer science

Code: 60 48 01

MASTER THESIS

Supervisor: Dr Nguyen Phuong Thai

Ha Noi - 2012

Trang 3

TABLE OF CONTENTS

ACKNOWLEDGEMENTS iii

TABLE OF CONTENTS iv

LIST OF FIGURES vi

LIST OF TABLES vii

NOTATIONS/ABBREVIATIONS viii

ORIGINALITY STATEMENT ix

ABSTRACT 1

CHAPTER 1 2

INTRODUCTION AND MOTIVATION 2

1.1 Characteristics of Vietnamese language 2

1.2 Vietnamese part of speech 3

1.2.1 Criteria to classify 3

1.2.2.The ways to build up tagset 4

1.3 Copora 4

1.3.1 VietTreeBank 5

1.3.2 VnQtag 6

1.4 Motivation 8

1.5 Organization of the thesis 11

CHAPTER 2: 12

EVALUATING DISTRIBUTIONAL PROPERTIES - 12

CONVERSION POSSIBILITY OF TAGSETS 12

IN VIETNAMESE 12

2.1 Tagset evaluation 12

2.1.1.Introduction 12

2.1.2.Tagset 13

2.1.3.A method for evaluating distributional properties of tagsets 13

2.1.3.1 Internal criterion 13

2.1.3.2 External benchmark 15

2.1.3.3 Algorithm 15

2.1.4 Result of tagset evaluation 16

iv

Trang 4

2.2 Possibility of Tagsets convertibility 19

Result of tagset convertibility 20

CHAPTER 3: 24

AUTOMATIC ERROR VERIFICATION 24

OF POS - TAGGED CORPUS 24

3.1 Concept related to variation n-gram method 24

3.2 Types of Vietnamese tagging error 25

3.3 A algorithm for detecting errors 26

3.4 Classifying variations 26

3.5 Result of detecting errors in POS tagging 27

3.6 Word segmentation 31

3.6.1 Word in Vietnamese 31

3.6.2 N-gram in word segmentation 32

3.6.3 Result of detecting errors in word segmentation 33

CHAPTER 4: 35

CONCLUSION AND SUMMARY 35

BIBLIOGRAPHY 37

APPENDIX 40

A.1 The Vietnamese treebank tagset 38

A.2 Vietnamese Tagset (VietTreeBank) 40

A.3 Tagset 3 (25tags) 41

A4 Tagset 4 (40 tags) 42

A5 Syntax function tags in VTB 43

A6 Adverbial classification tag of verb in VTB 43

A7 Phrase tagset in VTB 44

A8 Clause tagset in VTB 44

v

Trang 5

LIST OF FIGURES

Figure 1.1 The features of Vietnamese type 2

Figure 2 Purity as external evaluation criterion for cluster quality Majority class and

number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4

Figure 3 N-gram and variation nuclei in VTB corpus with n up to 29 27

vi

Trang 6

LIST OF TABLES

Table 1 The expression of grammatical meaning in Vietnamese 3

Table 2 Corpus with VnQtag tagset annotation 8

Table 3 Principle differences between Vietnamese and English 11

Table 4 Some frames is found in corpus 17

Table 5 Result of tagset evaluation method 18

Table 6 Some properties in tagset convertibility method in Hoangtube 20

Table 7 Statistic ambiguous the word types in VnQtag corpus 21

Table 8 Statistic ambiguous the token in VnQtag corpus 21

Table 9 Statistic detail ambiguous word types in VnQtag corppus 22

Table 10 Statistic errors in corpus 28

Table 11 The detail n-gram in tagged corpus 28

Table 12 The errors and ambiguous statistic in word segmentation algorithm 33

Table 13: Detail of context and varitation in VTB corpus 34

Trang 7

CHAPTER 1 INTRODUCTION AND MOTIVATION

1.1 Characteristics of Vietnamese language

Every language in the world has its own features and so has Vietnamese Tounderstand more Vietnamese, we would like to list some emerging features andcompare Vietnamese with some other languages such as Chinese, English

Followed M.Ferlus and other domestic and international researchers inVietnam, Vietnamese is native origin language, belongs to South Asian language,Mon-Khmer family, has relationship closely with Muong language Besides,Vietnamese belongs to a isolating language type with three prominent features Firstly,

a syllable is foundation unit to form a word and a sentence The syllable may be singleword or be element to compose a complex word, a compound word and a reiterationword Secondly, the Vietnamese word is not inflectional In particular, there are no

difference between singular noun and plural noun; for example, “hai cuốn sách” (two books) and “một cuốn sách” (one book) Thirdly, grammatical meaning expresses

mainly through word order and expletive method Given some expletives such as “sẽ,

đã, không” and sentence “Tôi ra ngoài” We can make three different meaning sentences from given input: “Tôi sẽ ra ngoài”; “tôi đã ra ngoài”;” tôi không ra

ngoài”.

The characteristics of Vietnamese

Figure 1 The features of Vietnamese type

In the world, some languages also belong to isolating language such as Chinese andThai language English, French, Russian are flexional language So, there are somedifferent features, for instance comparing Vietnamese, English and Chinese sentence

2

Trang 8

Table 1 The expression of grammatical meaning in Vietnamese

Vietnamese Chinese English Word order Tôi yêu anh ấy Wo ai ta I love him

Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him

Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe).

1.2 Vietnamese part of speech

1.2.1 Criteria to classify

In European language, POS notion glues with morphological category such as gender, numeral, mood, so on In Vietnam, there are two idea followed:

morphological modification (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung)

words in tags, or define POS of words, it is necessary to base on certain criteria So far,Vietnamese branch has almost agreed using criteria following ( Diep Quang Ban, HoangVan Thung, 2010):

b General meaning: “The meaning of a POS is the general meaning of a words

group, bases on vocabulary generalization foundation to form common grammaticalcategory generalization (lexical-grammatical category)” POSs are suitable for

definition of classification category These are groups having giant number of wordsthat each group has a classification feature: object, quality, action or state, so on

Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns

because their vocabulary meaning is generalized and abstracted as objects Thegrammar category belongs to noun

b Combination ability: With general meaning, words can get involve to one

meaningful combination: some words can replace each other in a certain position of acombination, the rest of the combination make the setting for appearing replacement

ability Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns.

3

Trang 9

c Syntax function: Participating in sentence composition, words can stand in one or

some certain positions in a sentence, or can replace each other in the positions, andexpress one relation about syntax function with other parts in the sentence

composition, can be classified into one POS For instance, some words such as nhà,

bàn, chim, cát are noun They may be subjects in sentences in which the subject

function is a syntax function to classify them into noun

1.2.2 The ways to build up tagset

Nowadays, there are two kinds of set of POS tags have developed in which the firstkind received attention much more from linguistic researchers

The first kind bases on 8 basic POS tags that are used many in dictionaries or linguisticmaterials These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection,emotive word From the 8 basic tags, some finer set of POS tags are built up Eachresearcher relies on certain criteria to build up the tagset finer (criteria are discussed inthe section 1.2.1) Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags;VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix)

The second kind is built up by mapping a tagset from other language to Vietnamesebased on association between words of two languages (Dinh Dien and Hoang Kiem2003)

1.3 Copora

Annotated corpora are large bodies of text with linguistically-informative mark-up.They play an important role for current work in computational linguistics, so greatattention has gone into developing such corpora Any countries, there are their owncorpora as well Some common corpora such as: British National corpus (Leech et at,1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank(Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery andRichard Xiao, 2005) In Vietnam, there are notable corpora: VnQtag, VnPos, VTB

To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson,

2001, p.29)

 Sampling and representativeness: elements in a corpus must be general,diversified and plentiful A sample is representative if what we find for the sample alsoholds for the general population

4

Trang 10

 Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size.

We must admit that it takes much time to build a large corpus by manual due to needhuge linguistic knowledge With manually built large corpus, the quality of corpus isnot surely good corpus Therefore, our thesis will find out and improve it

Two corpora we used in our experiments are VietTreeBank and VnQtag After that,

we would like to deeper discuss about building way of the corpora

1.3.1 VietTreeBank

VietTreeBank is the result of a national project VLSP that is developed by VTB group(Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators) Thecorpus includes 142 documents belonging to a politics-society topic of the Youth newsresponding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POStagging, syntax structure) The group based on MEMs and CRFs machine learningmodel to assign POS tags The preciseness of the model is over 93% VTB isdeveloped with the purpose to aid programs building: word segmentation, POStagging, syntax parsing, and so on VTB group chose two criteria to classify POS:combination ability and syntactic function words For instance, noun has role assubject or object in a sentence Besides, noun can combine with numeral (three, four)and attribute (each, every)

One POS tag can contain information about basic class of words (noun, verb,adjective, so on), morphological information (countable or uncountable), subcategory(verb goes with noun, verb goes with a clause, etc), semantic information or othersyntax information VTB group built up the tagset just based on basic class of wordswithout other information such as morphological information, subcategory, etc (seetagset in appendix)

In addition to POS information, the group describes basic syntax elements as phraseand clause Syntax tags are the most foundation information in syntax tree, they formsspine of the tree A7 and A8 in appendix list phrase and clause tagset, respectively

5

Trang 11

Function tag of a syntax element expresses its role in syntax element in higher level.The tags are assigned to the main elements in the sentence such as subject, predicative,object They provide information help us identify basic grammar relationship asfollowed.

Building VnQtag tagset belongs to KC01 national project and is performed by

development group including Nguyen Thi Minh Huyen, Vu Xuan Luong, Le HongPhuong The group based on a print dictionary (Vietnamese dictionary of LinguisticInstitution in 2000) to carry out their work First of all, they segmented sentences intowords by a syllable otomat and a lexical otomat Then, they used Qtag tagger to assignPOS label to Vietnamese words The number of POS labels is 59 labels (see inappendix) In addition of grammar information, the group got adding semanticinformation (general meaning of word) to classify into 59 word class labels Forexample, words are considered verb that they express general meaning about process.Process meaning expresses directly in action feature of object This is action meaning.State meaning is generalized in relationship with action of object in time and space(Vietnamese grammar of Diep Quang Ban and Hoang Van Thung) The automatictagger experiment is carried out on 7 documents that are listed in table 2 Theannotated corpus plays an important in NLP; it is data database containing high qualitylinguistic sources; it obeys international standards and data express

The gained corpus has format following: each lexical unit and corresponding POSstand on one line, in which using space in each syllable, between word and POS havetab to separate The type of punctuation and other symbols in text are processed aslexical unit with label is punctuation corresponding This corpus includes 7 documentsthat belonged to different types such as story, novel, science and press It gathers

6

Trang 12

common words used popularly in daily life and the press It also gathers words that wecan usually see in literature works or science-technical terms.

Trang 13

Table 2. Corpus with VnQtag tagset annotation

The The number of

Id Document Type number of processing unit

lexical unit (included punctuations)

rang dong-part I

rang dong- part II

phong thu quoc gia

1.4 Motivation

Until now, maybe you not image my thesis will solve which problems as well as the reasons I chose to solve them In this section, therefore, we will discuss about them As

we all know, linguistic theories first developed to describe of Indo-European

languages and until now there are many significant archievements In our country, NLP field has begun since 1990, however; achieved results have still limit Whereas, Vietnamese processing issue is responsible for Vietnamese; we cannot expect this issue in foreign researchers (Ho Tu Bao, 2001) Therefore, this thesis wishes

contributed a part in improving Vietnamese processing by concentrating on enhancing tagsets and detection errors in tagging

Natural language processing is done at five stages These are:

 Morphological and lexical analysis: The lexicon of a language is its

vocabularies that include its words and expressions Morphology is the identification,analysis and description of structure of words The words are generally accepted as beingthe smallest units of syntax The syntax refers to the

Trang 14

rules and principles that govern the sentence structure of any individuallanguage.

Lexical analysis: The aim is to divide the text into paragraphs, sentences and

words The lexical analysis cannot be performed in isolation frommorphological and syntactic analysis

 Syntactic Analysis: The analysis is of words in a sentence to know the

grammatical structure of the sentence The words are transformed into structures thatshow how the words relate to each others Some word sequences may be rejected if theyviolate the rules of the languages for how words may be combined

 Semantic analysis: It derives an absolute meaning from context it determines

the possible meanings of a sentence in a context

 Discourse integration: The meaning of an individual sentence may depend on

the sentences that precede it and may influence the meaning of the sentences that followit

 Pragmatic analysis: It derives knowledge from external commonsense

information it means understanding the purposeful use of language in situations, particularlythose aspects of language which require world knowledge For example: Do you know whattime is it? The sentence should be interpreted as a

request

Our thesis concentrates on the first stage (i.e morphological analysis) in naturallanguage processing It is very important preprocessing step for following stages such

as syntactic analysis and semantic analysis

Our thesis has two big problems and two small problems These are evaluating tagsetand detecting tagging errors automatically; checking convertible possibility of tagsetand detecting segmentation errors automatically, respectively

a Evaluating and convertible possibility of tagset

In previous section, we mentioned some tagsets such as VietTreeBank (17 tags);VNPOS (15 tags); VNQTag (59 tags) Such inconsistent tagsets emerge somequestions such as: which tagsets can be better? What methods can evaluate thesetagsets or how we can choose right set of POS tags for certain applications In the firstpart of this thesis, we will focus to answer the question

9

Trang 15

Another aspect we will also discuss here is tagsets conversion ability The choice onetagset much affect on the difficulty of POS tagging issues In particular, if big tagsetwill increase the difficulty but smaller one will not satisfy for a certain purpose.Therefore, it is necessary to balance between quality and the quantity in one tagset, itmeans that:

 Information quality more clear (i.e classify to more Part-of-speech based onconcrete meaning)

 Possibility of tagging (i.e the number of Pos as little as possible)

From above discussed problem, we try to find a method to balance them It means that

we carry out experiment on source tagset (ST) and target tagset (TT) Then calculatingthe number of ambiguous words when we converted; therefore, we give conclusion

b Detecting POS tagging and word segmentation errors

 If each word belongs to only one label then one limited a dictionaryincluding words and corresponding labels can solve absolutely POS tagging issue Infact, however, one word can belong to more than one label and that leads to ambiguityand errors in POS tagging To fix this problem, it costs much time and money by manual

We want to find out method to detect errors automatically to reduce cost about time andmoney

One sentence maybe to have many different segmentation ways For example, chiếc xeđạp nặng quá Way 1: chiếc/ xe/ đạp/ nặng/ quá Way 2: chiếc/ xe đạp/ nặng/ quá Here,

we used “/” to separate words Both of ways are accepted because each sentence isprivate meaningful

One of reasons causes the difference is listed in following table And the last problem

in our thesis is word segmentation:

10

Trang 16

Table 3 . Principle differences between Vietnamese and English

Character Vietnamese English

combination of syllableAll above reasons are motive power to help me find the last answer

1.5 Organization of the thesis

The thesis is organized four main chapters with basic content following:

Chapter 1: Introduction and motivation.

Chapter 1 provides a general picture about Vietnamese such as features of Vietnameseand part-of-speech Besides, reasons I chose the topic in the thesis also discuss

Chapter 2: Evaluating distributional properties and conversion possibility of tagsets in Vietnamese.

Chapter 2 we will find out deeper about tagset for instance way to build up tagset orway to merge labels as well as introduction basic notions to carry out evaluatingproperties of tagsets

Chapter 3: Automatic error verification of pos-tagged corpus

In this chapter, we will introduce notion related to errors detecting method, after thatpresent algorithm and discuss about classifying variation into errors or ambiguity

Chapter 4: Summary and conclusion

In this chapter, we will discuss about three issues These are thesis’s contributionsabout theory, experiment and further new directions It sums up achievement that wegained and discussed further some word needed solve in future

Trang 17

CHAPTER 2:

EVALUATING DISTRIBUTIONAL PROPERTIES - CONVERSION POSSIBILITY OF TAGSETS IN

VIETNAMESE

2.1 Tagset evaluation

2.1.1 Introduction

It is obvious that evaluating tagset has received much attention of NLP reserachers

since over 20 years ago Tagset evaluation allows us to test and assess the impact of

tagset modifications on results, by using different versions given tagset on the same texts (Martin Volk and Gerol Schneider, 1998) In 2000, Dzeroski Saso and Erjavec

Tomaz and Zavrel Jakub calculated by comparing accuracy of design tagsets that areformed by decreasing the cardinality of the tagset: ommitting certain attributes of thetagset or almost all, except certain attributes Accuracies were computed using aBlack-Box combiner (Halteren, Dzeroski) In the same year, Herv Ejean Seminar andHervé Déjean presented two kinds of a tagset evaluation: a global and a local one Thefirst kind consists of evaluating the initial grammar generated by ALLiS The secondkind is to use the notion of reliability that reliability of an element corresponds to theratio between its frequence in the structure over its total frequency in the corpus.Besides, in Indian language, Madhav Gopal, Diwakar Mishra, and Devi PriyankaSingh (2010) gave some discussions about evaluated tagsets: ILMT tagset, JNU-Sanskrit tagset, LDCIL tagset, Sanskrit consortium tagset

Vietnamese is an isolating language and important syntactic information source isword order To evaluating Vietnamese tagsets, this chapter will introduce a simplemethod using internal criteria and external criteria Frequency frame and purity areused in internal criteria to check whether tag is assigned accurately External criteriareview reduction cardinality of the tagset to check information quality is retained It istrue that a number of evaluations showed that a lot of tagging errors are caused bysometimes too fine differentiations within major categories (Eugenie Giesbrecht,2008)

Trang 18

2.1.2 Tagset

A POS is a set of words with some grammatical characteristic(s) in common and each

POS differs in grammatical characteristics from every other POS For example, nounshave different properties from verbs, which have different properties from adjectiveand so on

Tagset is set of POS tags built up based on the criteria (see in 1.2) Therefore, tagsets

usually vary quantity of tags and also used in various applications

Properties of tagset: One tagset need guarantee some properties as followed:

Retaining linguistic feature, reflect syntax structure, possibility of tagging accurately,reduction ambiguous words when we carried out tagging

2.1.3 A method for evaluating distributional properties of tagsets

2.1.3.1 Internal criterion

Among properties of tagsets, we high appreciate possibility of tags is assignedaccurately in corpus It means that we mention of internal criterion It is worth notingthat we can review this criterion through a frame notion and a purity formula Theframe represents reviewed local context It can alert for us which tag can appear in thisthe frame Next, purity formula assesses possibility convergence of tag in the localcontext

Discussion about purity

As mention preciously, we use purity formula as external evaluation criterion for tagset (Stanford natural language processing, 357) Purity is widely used in cluster quality evaluation measure It is simple and transparent evaluation measure To

compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number

of correctly assigned documents and dividing by N Formally:

(1)Where:

is the set of clusters

is the set of classes

13

Trang 19

We interpret w k as the set of documents in w k and c j as the set of documents in c j in

equation (1)

High purity is easy to achieve when the number of clusters in large, in particular,

purity is 1 if each document gets its own cluster For example:

Figure 2 Purity as external evaluation criterion for cluster quality Majority class and

number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4

(cluster 2); and , 3 (cluster 3) Purity is

Frame notion

The frame notion is mentioned in 2006 by Mintz Then, in 2010, Dickinson and

Jochim redefined it following: In local context, one frame consists of three words in

which two words surrounding a target word leading to target’s categorization We will

use frames to test the quality of distributional mappings In English, for example, the

frame “you_it” generally predicts a verbal category for the target (i.e, target word may

be hit, beat, eat, or kiss) In Vietnamese, the frame “mẹ_là” leads target word

belonging to pronoun (Pp), i.e, “tôi, anh, chị, bác” Therefore, to have a more exact

result, we used a frequency and a frequent frame notion Frequent frame supplies

category information in child language corpora It means that, frame’s role in a corpus

is not similar Many times one frame appears, more linguistic information the frame

concentrates We identify the frequent threshold based on a formula about 0.03% of

the frame total In particular, if we have 10000 frames in a corpus then the frequency is

3 (10000*0.03%) So, one frame appears above 3 times, we consider them as one

frequent frame.

Next, purity formula is applied in the method with respect to calculating possibility of

distributing tag in one frame It means that percentage of each tag appears in frame is

different To calculate purity value, we just consider to the biggest frequency of a tag

in each frame Next, we add them and divide total of appearing times of all tags If the

14

Trang 20

purity value is higher, then words ability can be tagged accurately higher For instance,

we have two frames: Tôi_ở and mẹ _bảo The first frame appears 4 times in a corpus in

which the target word belongs to two tags Vits, Vitn (1 times and 3 times,respectively) The second frame appears 8 times in which 7 times target word’s POS is

Np, 1 times is Pp We can calculate the purity value by

2.1.3.2 External benchmark

Normally, to evaluate tagset, linguistic scientists have mapped a tagset into reducedone because this work helps us check retained linguistic features Of course, reducedtagset is built up by merge tags; however, how do we have to merge? This is a difficultquestion that we need solve

Herv Ejean Seminar, Hervé Déjean and Universität Tübingen (2000) discussed aboutthe theoretically minimal tagset They affirmed that the quality of a tagset does notdepend on the quantity of tags They built up the minimum tagset necessary to parsesentence whatever the domain are Originally, they use a tagset with one tag perstructure (NP-VP) Then, they estimated that a tagset of about 20 tags is enough toparse a sentence into PS and clause structures

Indeed, there are many ways to merge labels so the tagsets with various tags quantityhave still existed English is morphological language so it is rather easily to identifysituations can merge such as conflating base form verb (VB) and present tense verb(non-third person singular, VPB) but Vietnamese is not The tagsets are used in ourthesis have two kinds:

Firstly, we used tagset that it is built up by preceding NLP researchers, for instance,VnQtag, VietTreeBank

Secondly, we conflate ourselves some labels based on Vietnamese features Thenumber of tags in VnQtag is the largest, so we use it as source tagset to generate othertagsets

Trang 21

2 Calculating the quantity of frames in the corpus, after based on total of the frames to calculate a frequency.

5 Finally calculating the new purity value in the new tagsets and statistic lost ambiguous words

Preparing data:

We carried out this method on corpus with VnQtag tagset annotated corpus

2.1.4 Result of tagset evaluation

The experiments are performed on VnQtag corpus including four VnQtag annotateddocuments Then we carried out merging some tags in VnQtag to form new tagsets:tagset 3 and tagset 4 Therefore, we have: VietTreeBank (18 tags), basic tagset 2 (8tags), tagset 3 (25 tags) and tagset 4 (40 tags) (see in appendix) We relied on the book(Ngữ pháp tiếng Việt - Diệp Quang Ban) to merge tags in which he organizedVietnamese POS system into two groups:

Each major category he classified finer-grain such as noun has two main kinds: Propernoun and common noun Common noun contains synthetic noun and non-syntheticnoun Both of them are fine classified into countable noun and uncountable noun and

so on

To gain 25-POSs and 30-POSs tagsets, we merged some tags of noun and verb Theyare basic categories and have the largest number of words in Vietnamese In theVnQtag tagset, noun is fine classified to 8 tags and verb with 10 finer tags Weemployed four annotated documents in VnQtag and four tagsets to gain results in thetable 4 and the table 5

16

Trang 22

Table 4 Some frames is found in corpus

of target word and its appearing times In particular, the first cell contains the frame:

“mẹ_là” This frame occurs 4 times in the corpus and all of them are assigned as Pp(Pronoun)

Trang 23

17

Trang 24

Table 5 Result of tagset evaluation method

Frequency Document Words Mapping Tags frame, total Purity Lost

of frame, ambs threshold

Những bài VietTreeBank 18 88.34% 172học nông 8247 Basic tagset 8 242, 3845, 2 88.83% 362

Based on the table 5, we can see that the first document has 31520 words and the total

of frames is computed exactly 15706 frames Threshold is calculated by formula:0.03%* the total of frames Therefore threshold in the situation is approximate 5.When experiment the first document on different tagsets, we achieved purity values(60.69%, 82.86%, 82.06%, 69.71%, 79.09%) as well as the number of ambiguouswords (0, 331, 397, 55, 137 respectively) We can realized that the percentage of theambiguous words compared to the total of words in document is small, i.e, 0%, 1.05%,1.26%, 0.17%, 0.43% Other documents are similar in the explaining

Conclusion, three tagsets namely VietTreeBank, basic tagset and tagset 4 areappreciated higher than other tagsets Because, value of purity is high and the number

of lost ambiguous words is quite small

Trang 25

18

Trang 26

2.2 Possibility of Tagsets convertibility

As you know, existing different tagsets in the same language helps linguistic scientistshave more tagset options In English language, there are some tagsets following:Brown tagset in 1967 (87 tags), Susanne tagset in 1987 (353 wordtags), Penn TreeBank tagset in 1991 (36 tags), IBM Lancaster in 1993 (132 tags) To give rightdecision, they have found out relationship between tagsets as well as specificapplications

In Vietnamese, it is notable that there are three tagsets: VnQtag (59 tags), VnPos (15tags), VietTreeBank (18 tags) Some Vietnamese linguistic researchers have advocatedminimal tagset it means that they are interested in smaller tagset With small tagset,tagging is performed more easily, and less cost about time and money Therefore, wewant to test converting from a large tagset into small one Of course, reverse direction

is always true As a result, the first direction, some words can be lost ambiguity abouttag This is not good sign However, if their number is small then we can just add someinformation of context or syntax to understand them For instance, Daniel Zeman(2008) used Interset (Tagset diriver) to convert source tagset into target one BartoszZaborowski andAdam Przepiórkowski (2012) used set of rules converting particulartags

In our thesis, we emphasize to ability of conversion from one tagset to another Thething we wish found out here is that any large tagsets always can convert easily tosmall tagsets with minimal ambiguous word cardinality Ambiguous words are wordsthat are lost a distinction in finer tags in target tagset In particular, we carried out asfollowed:

 Identifying tagsets that we want to check

 Identifying corpus annotated as well as tagger

 Calculating the number of word belonging to each POS tag of tagset

o The number of ambiguous tokens in corpus (when we convert largetagset into small tagset, some tags in large one will merge to correspond

to tags of small tagset)

 Computing the percentage of ambiguous tokens and word types 19

Trang 27

Result of tagset convertibility

The data input of this method is two tagsets: VnQtag and VietTreeBank Besides, weused Qtag probability and Vn Tagger to tag for the folder containing 7 documents(Hoàng tử bé, Chuyện tình, Lược sử thời gian, Những bài học nông thôn, Chiến tranhcục bộ, muối của rừng, An Dương Vương) with two tagsets respectively

Then we compared outputs to have last conclusion

Table 6. Some properties in tagset convertibility method in Hoangtube

Here, there are tiny note that word in above table is exactly word type not token Itmeans that each word we just count once time Besides, the experiment is performed

on one document (hoangtube) so we can see ambiguous percentage is quite small Thenumber of ambiguous words sometime is large so in table we listed some situationsnot all of them

20

Trang 28

Table 7. Statistic ambiguous the word types in VnQtag corpus

POS Ambiguity Total of word type Percentage

V1 tag: Vitb, Vits, Vitc, Vitm, Vitim

V2 tag: Vta, Vtb, Vtc, Vtv, Vtim, Vto, Vts, Vtm, Vtv

Table 8. Statistic ambiguous the token in VnQtag corpus

POS Ambiguity Total of token Percentage

Trang 29

Table 9. Statistic detail ambiguous word types in VnQtag corppus

Number of

word types

ban, cuộc đời, công nguyên, khoảnh khắc, một khi, phép,

tương lai, tết

phiên bài học, chư hầu, chứng cứ, công việc, cảnh vật, của, cử chỉ, dân chúng, gia cảnh, gia đình, giải phóng, huyện đội, hình

mầu sắc, ngày tháng, nhà nước, quân đội, sinh vật, thiên nhiên, thân hình, triều đại, vật chất, xã hội, đám ma, đường nét, đất nước, đồ

bước, bộ, chừng, câu, cõi, cú, cấp, dòng, gợn, hàng, hình,

nụ, phần, thằng, tiếng, tiền, trận, tên, tụi, vì, vòng, vòng tròn,

vị, vở, vụ, điều, điệu, đàn, đám, đơn vị, đạo

trước đây, đông, đầu, đầu tiên

bề mặt, chân trời, căn cứ, góc, không gian, lòng, mặt, nguồn

tầm, ven, vùng, vương quốc, đất

Trang 30

Ja/Jd 1 không

22

Trang 31

nằm, quỳ, rời, sập, thấm, vượt, úp, đổ, đứng

It is worth noting that ambiguous percentage is quite big in table 8 (most of themabove 15%) Therefore, in this step we can conclusion that it is hard to convert theoriginal tagset into another tagset followed V, V1, V2, V3 However, if we analysisdeeper then we will realize some different important aspects, here In one POS,ambiguous words usually concentrate on a certain group, for example in V category(Vitm/Vtm) group includes 16 words or (Vitm/Vits) group consist of 18 words inwhich (Vits/Vtc) and (Vitm/Vtc) groups have only 1 word In other words, thepercentage we calculated changing following groups Table 9 shows that which grouphas more ambiguous words than other groups For example: In noun, (Na/Ng/Nu)group has 8 ambiguous words compared with 4 instances in (Nn/Nu/Nt) group

Trang 32

23

Định dạng
Số trang	65
Dung lượng	197,39 KB