The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages Dinh Dien and Hoang Kiem 2003 1.3.. VTB group built
Trang 1VIETNAM NATIONAL UNIVERSITY HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
THI-THANH-TAM DO
TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION
IN POS TAGGED CORPUS
MASTER THESIS
(Natural language processing)
Ha Noi - 2012
Trang 2VIETNAM NATIONAL UNIVERSITY HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
THI-THANH-TAM DO
TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION
IN POS TAGGED CORPUS
Branch of knowledge: Information technology
Major: Computer science
Code: 60 48 01
MASTER THESIS
Supervisor: Dr Nguyen Phuong Thai
Ha Noi - 2012
Trang 3to completely focus on this thesis
During the thesis accomplishing process, my husband, parent and friends are always
by my side My husband„s comments made my program code brighter and shorter My parents are great encouragement source to me and my friends shared useful information
Besides, it is also really mistake if I do not mention the project QGTĐ.12.21 becausethe thesis is partly supported by it
Finally I would like to thank Dr Huyen Nguyen-Thi for data that she sent to me It is a big help
Trang 4TABLE OF CONTENTS
ACKNOWLEDGEMENTS iii
TABLE OF CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES vii
NOTATIONS/ABBREVIATIONS viii
ORIGINALITY STATEMENT ix
ABSTRACT 10
CHAPTER 1 3
INTRODUCTION AND MOTIVATION 3
1.1 Characteristics of Vietnamese language 3
1.2 Vietnamese part of speech 4
1.2.1 Criteria to classify 4
1.2.2.The ways to build up tagset 5
1.3 Copora 5
1.3.1 VietTreeBank 6
1.3.2 VnQtag 7
1.4 Motivation 9
1.5 Organization of the thesis 12
CHAPTER 2: 13
EVALUATING DISTRIBUTIONAL PROPERTIES - 13
CONVERSION POSSIBILITY OF TAGSETS 13
IN VIETNAMESE 13
2.1 Tagset evaluation 13
2.1.1.Introduction 13
2.1.2.Tagset 14
2.1.3.A method for evaluating distributional properties of tagsets 14
2.1.3.1 Internal criterion 14
2.1.3.2 External benchmark 16
2.1.3.3 Algorithm 16
Trang 52.1.4 Result of tagset evaluation 17
2.2 Possibility of Tagsets convertibility 20
Result of tagset convertibility 21
CHAPTER 3: 25
AUTOMATIC ERROR VERIFICATION 25
OF POS - TAGGED CORPUS 25
3.1 Concept related to variation n-gram method 25
3.2 Types of Vietnamese tagging error 26
3.3 A algorithm for detecting errors 27
3.4 Classifying variations 27
3.5 Result of detecting errors in POS tagging 28
3.6 Word segmentation 32
3.6.1 Word in Vietnamese 32
3.6.2 N-gram in word segmentation 33
3.6.3 Result of detecting errors in word segmentation 34
CHAPTER 4: 36
CONCLUSION AND SUMMARY 36
BIBLIOGRAPHY 38
APPENDIX 41
A.1 The Vietnamese treebank tagset 38
A.2 Vietnamese Tagset (VietTreeBank) 40
A.3 Tagset 3 (25tags) 41
A4 Tagset 4 (40 tags) 42
A5 Syntax function tags in VTB 43
A6 Adverbial classification tag of verb in VTB 43
A7 Phrase tagset in VTB 44
A8 Clause tagset in VTB 44
Trang 6LIST OF FIGURES Figure 1.1 The features of Vietnamese type 3 Figure 2 Purity as external evaluation criterion for cluster quality Majority class and
number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4 (cluster 2); and , 3 (cluster 3) Purity is 15
Figure 3 N-gram and variation nuclei in VTB corpus with n up to 29 28
Trang 7LIST OF TABLES
Table 1 The expression of grammatical meaning in Vietnamese 4
Table 2 Corpus with VnQtag tagset annotation 9
Table 3 Principle differences between Vietnamese and English 12
Table 4 Some frames is found in corpus 18
Table 5 Result of tagset evaluation method 19
Table 6 Some properties in tagset convertibility method in Hoangtube 21
Table 7 Statistic ambiguous the word types in VnQtag corpus 22
Table 8 Statistic ambiguous the token in VnQtag corpus 22
Table 9 Statistic detail ambiguous word types in VnQtag corppus 23
Table 10 Statistic errors in corpus 29
Table 11 The detail n-gram in tagged corpus 29
Table 12 The errors and ambiguous statistic in word segmentation algorithm 34
Table 13: Detail of context and varitation in VTB corpus 35
Trang 8NOTATIONS/ABBREVIATIONS
ALLiS Architecture for Learning Linguistic Structures
CRFS Conditional Random Fields
HMM Hidden Markov Model
ME Maximum Entropy
NLP Natural language processing
POS Part of Speech
WEST Weighted Finite State Transducer
WSJ Wall Street Journal
Trang 9Signed
Thanh-Tam Do Thi
Trang 10We, the supervisor and the reviewer approved that the thesis is revised carefully It takes into account the supervisor and the reviewers‟ comments and would be ready as the final version for the Master of Computer Science thesis at the University of Engineering and Technology, Vietnam National University, Hanoi."
Trang 11After that, we continue to improve part of speech tagged corpora by detecting errors N-gram notion is mentioned in which n-gram identifies local context containing n contiguous words In variations n-gram, the longer similar context, the more able to be error variation nuclei This is a simply idea but it is really useful (Dickinson 2005) Unlike English, the space is not the Vietnamese word delimitation in Vietnamese Therefore, word segmentation is a difficult issue, NLP scientists applied different methods to solve and improve preciseness of the word segmentation In our thesis, we will spend a part to detect inconsistent word segmentation errors in segmented corpus based on variation n-gram notion
Finally, we have achieved some following results First of all, VietTreeBank and basic tagset (not include tagset 4) have the less ambiguity as well as purity value is high Therefore, they are appreciated higher than other tagsets Secondly, we computed the number of ambiguous items when converting between tagsets following many aspects: counting token and word type on the whole major category or subcategories classified follow semantic and syntactic criteria Thirdly, we found 0.01% word segmentation errors and 0.034% ambiguity cases in segmented VTB corpus Next, we detected 0.107% POS tagging errors in the corpus annotation
Trang 12
CHAPTER 1 INTRODUCTION AND MOTIVATION
1.1 Characteristics of Vietnamese language
Every language in the world has its own features and so has Vietnamese To understand more Vietnamese, we would like to list some emerging features and compare Vietnamese with some other languages such as Chinese, English
Followed M.Ferlus and other domestic and international researchers in Vietnam, Vietnamese is native origin language, belongs to South Asian language, Mon-Khmer family, has relationship closely with Muong language Besides, Vietnamese belongs to a isolating language type with three prominent features Firstly,
a syllable is foundation unit to form a word and a sentence The syllable may be single word or be element to compose a complex word, a compound word and a reiteration word Secondly, the Vietnamese word is not inflectional In particular, there are no
difference between singular noun and plural noun; for example, “hai cuốn sách” (two books) and “một cuốn sách” (one book) Thirdly, grammatical meaning expresses
mainly through word order and expletive method Given some expletives such as “sẽ,
đã, không” and sentence “Tôi ra ngoài” We can make three different meaning sentences from given input: “Tôi sẽ ra ngoài”; “tôi đã ra ngoài”;” tôi không ra
ngoài”
Figure 1 The features of Vietnamese type
In the world, some languages also belong to isolating language such as Chinese and Thai language English, French, Russian are flexional language So, there are some different features, for instance comparing Vietnamese, English and Chinese sentence
The characteristics of Vietnamese
The grammatical meaning express mainly through word order and expletive method
Trang 13Table 1 The expression of grammatical meaning in Vietnamese
Word order Tôi yêu anh ấy
Anh ấy yêu tôi
Wo ai ta
Ta ai wo
I love him
He loves me
Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him
Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe)
1.2 Vietnamese part of speech
a General meaning: “The meaning of a POS is the general meaning of a words
group, bases on vocabulary generalization foundation to form common grammatical category generalization (lexical-grammatical category)” POSs are suitable for definition of classification category These are groups having giant number of words
that each group has a classification feature: object, quality, action or state, so on
Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns
because their vocabulary meaning is generalized and abstracted as objects The grammar category belongs to noun
b Combination ability: With general meaning, words can get involve to one
meaningful combination: some words can replace each other in a certain position of a combination, the rest of the combination make the setting for appearing replacement
ability Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns
Trang 14c Syntax function: Participating in sentence composition, words can stand in one or
some certain positions in a sentence, or can replace each other in the positions, and express one relation about syntax function with other parts in the sentence
composition, can be classified into one POS For instance, some words such as nhà,
bàn, chim, cát are noun They may be subjects in sentences in which the subject
function is a syntax function to classify them into noun
1.2.2 The ways to build up tagset
Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers
The first kind bases on 8 basic POS tags that are used many in dictionaries or linguistic materials These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection, emotive word From the 8 basic tags, some finer set of POS tags are built up Each researcher relies on certain criteria to build up the tagset finer (criteria are discussed in the section 1.2.1) Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags; VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix)
The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem 2003)
1.3 Copora
Annotated corpora are large bodies of text with linguistically-informative mark-up They play an important role for current work in computational linguistics, so great attention has gone into developing such corpora Any countries, there are their own corpora as well Some common corpora such as: British National corpus (Leech et at, 1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank (Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and Richard Xiao, 2005) In Vietnam, there are notable corpora: VnQtag, VnPos, VTB
To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson,
2001, p.29)
Sampling and representativeness: elements in a corpus must be general, diversified and plentiful A sample is representative if what we find for the sample also holds for the general population
Trang 15 Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size
Machine-readable form
Standard reference
We must admit that it takes much time to build a large corpus by manual due to need huge linguistic knowledge With manually built large corpus, the quality of corpus is not surely good corpus Therefore, our thesis will find out and improve it
Two corpora we used in our experiments are VietTreeBank and VnQtag After that,
we would like to deeper discuss about building way of the corpora
1.3.1 VietTreeBank
VietTreeBank is the result of a national project VLSP that is developed by VTB group (Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators) The corpus includes 142 documents belonging to a politics-society topic of the Youth news responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS tagging, syntax structure) The group based on MEMs and CRFs machine learning model to assign POS tags The preciseness of the model is over 93% VTB is developed with the purpose to aid programs building: word segmentation, POS tagging, syntax parsing, and so on VTB group chose two criteria to classify POS: combination ability and syntactic function words For instance, noun has role as subject or object in a sentence Besides, noun can combine with numeral (three, four) and attribute (each, every)
One POS tag can contain information about basic class of words (noun, verb, adjective, so on), morphological information (countable or uncountable), subcategory (verb goes with noun, verb goes with a clause, etc), semantic information or other syntax information VTB group built up the tagset just based on basic class of words without other information such as morphological information, subcategory, etc (see
tagset in appendix)
In addition to POS information, the group describes basic syntax elements as phrase and clause Syntax tags are the most foundation information in syntax tree, they forms spine of the tree A7 and A8 in appendix list phrase and clause tagset, respectively
Trang 16Function tag of a syntax element expresses its role in syntax element in higher level The tags are assigned to the main elements in the sentence such as subject, predicative, object They provide information help us identify basic grammar relationship as followed
Building VnQtag tagset belongs to KC01 national project and is performed by
development group including Nguyen Thi Minh Huyen, Vu Xuan Luong, Le Hong Phuong The group based on a print dictionary (Vietnamese dictionary of Linguistic Institution in 2000) to carry out their work First of all, they segmented sentences into words by a syllable otomat and a lexical otomat Then, they used Qtag tagger to assign POS label to Vietnamese words The number of POS labels is 59 labels (see in appendix) In addition of grammar information, the group got adding semantic information (general meaning of word) to classify into 59 word class labels For example, words are considered verb that they express general meaning about process Process meaning expresses directly in action feature of object This is action meaning State meaning is generalized in relationship with action of object in time and space (Vietnamese grammar of Diep Quang Ban and Hoang Van Thung) The automatic tagger experiment is carried out on 7 documents that are listed in table 2 The annotated corpus plays an important in NLP; it is data database containing high quality linguistic sources; it obeys international standards and data express
The gained corpus has format following: each lexical unit and corresponding POS stand on one line, in which using space in each syllable, between word and POS have tab to separate The type of punctuation and other symbols in text are processed as lexical unit with label is punctuation corresponding This corpus includes 7 documents that belonged to different types such as story, novel, science and press It gathers
Trang 17common words used popularly in daily life and the press It also gathers words that we can usually see in literature works or science-technical terms
Trang 18Table 2 Corpus with VnQtag tagset annotation
The number of lexical unit
The number of processing unit (included punctuations)
2 Chuyen tinh ke truoc luc
3 Chuyen tinh ke truoc luc
6 Nhung bai hoc nong thon Story 6682 8244
7 Cong nghe va he thong
Natural language processing is done at five stages These are:
Morphological and lexical analysis: The lexicon of a language is its
vocabularies that include its words and expressions Morphology is the identification, analysis and description of structure of words The words are generally accepted as being the smallest units of syntax The syntax refers to the
Trang 19rules and principles that govern the sentence structure of any individual language
Lexical analysis: The aim is to divide the text into paragraphs, sentences and
words The lexical analysis cannot be performed in isolation from morphological and syntactic analysis
Syntactic Analysis: The analysis is of words in a sentence to know the
grammatical structure of the sentence The words are transformed into structures that show how the words relate to each others Some word sequences may be rejected if they violate the rules of the languages for how words may be
combined
Semantic analysis: It derives an absolute meaning from context it determines the possible meanings of a sentence in a context
Discourse integration: The meaning of an individual sentence may depend on
the sentences that precede it and may influence the meaning of the sentences
that follow it
Pragmatic analysis: It derives knowledge from external commonsense
information it means understanding the purposeful use of language in situations, particularly those aspects of language which require world knowledge For example: Do you know what time is it? The sentence should be interpreted as a request
Our thesis concentrates on the first stage (i.e morphological analysis) in natural language processing It is very important preprocessing step for following stages such
as syntactic analysis and semantic analysis
Our thesis has two big problems and two small problems These are evaluating tagset and detecting tagging errors automatically; checking convertible possibility of tagset and detecting segmentation errors automatically, respectively
a Evaluating and convertible possibility of tagset
In previous section, we mentioned some tagsets such as VietTreeBank (17 tags); VNPOS (15 tags); VNQTag (59 tags) Such inconsistent tagsets emerge some questions such as: which tagsets can be better? What methods can evaluate these tagsets or how we can choose right set of POS tags for certain applications In the first part of this thesis, we will focus to answer the question
Trang 20Another aspect we will also discuss here is tagsets conversion ability The choice one tagset much affect on the difficulty of POS tagging issues In particular, if big tagset will increase the difficulty but smaller one will not satisfy for a certain purpose Therefore, it is necessary to balance between quality and the quantity in one tagset, it means that:
Information quality more clear (i.e classify to more Part-of-speech based
on concrete meaning)
Possibility of tagging (i.e the number of Pos as little as possible)
From above discussed problem, we try to find a method to balance them It means that
we carry out experiment on source tagset (ST) and target tagset (TT) Then calculating the number of ambiguous words when we converted; therefore, we give conclusion
b Detecting POS tagging and word segmentation errors
If each word belongs to only one label then one limited a dictionary including words and corresponding labels can solve absolutely POS tagging issue In fact, however, one word can belong to more than one label and that leads to ambiguity and errors in POS tagging To fix this problem, it costs much time and money by manual We want to find out method to detect errors automatically to reduce cost about time and money
Besides, it admits that Vietnamese word segmentation is a thorny issue One sentence maybe to have many different segmentation ways For example, chiếc xe đạp nặng quá Way 1: chiếc/ xe/ đạp/ nặng/ quá Way 2: chiếc/ xe đạp/ nặng/ quá Here, we used “/” to separate words Both of ways are accepted because each sentence is private meaningful
One of reasons causes the difference is listed in following table And the last problem
in our thesis is word segmentation:
Trang 21Table 3 Principle differences between Vietnamese and English
Boundary of word Context meaningful
combination of syllable Blank or Delimiters All above reasons are motive power to help me find the last answer
1.5 Organization of the thesis
The thesis is organized four main chapters with basic content following:
Chapter 1: Introduction and motivation
Chapter 1 provides a general picture about Vietnamese such as features of Vietnamese and part-of-speech Besides, reasons I chose the topic in the thesis also discuss
Chapter 2: Evaluating distributional properties and conversion possibility of tagsets in Vietnamese
Chapter 2 we will find out deeper about tagset for instance way to build up tagset or way to merge labels as well as introduction basic notions to carry out evaluating properties of tagsets
Chapter 3: Automatic error verification of pos-tagged corpus
In this chapter, we will introduce notion related to errors detecting method, after that present algorithm and discuss about classifying variation into errors or ambiguity
Chapter 4: Summary and conclusion
In this chapter, we will discuss about three issues These are thesis‟s contributions about theory, experiment and further new directions It sums up achievement that we gained and discussed further some word needed solve in future
Trang 22It is obvious that evaluating tagset has received much attention of NLP reserachers
since over 20 years ago Tagset evaluation allows us to test and assess the impact of
tagset modifications on results, by using different versions given tagset on the same texts (Martin Volk and Gerol Schneider, 1998) In 2000, Dzeroski Saso and Erjavec
Tomaz and Zavrel Jakub calculated by comparing accuracy of design tagsets that are formed by decreasing the cardinality of the tagset: ommitting certain attributes of the tagset or almost all, except certain attributes Accuracies were computed using a Black-Box combiner (Halteren, Dzeroski) In the same year, Herv Ejean Seminar and Hervé Déjean presented two kinds of a tagset evaluation: a global and a local one The first kind consists of evaluating the initial grammar generated by ALLiS The second kind is to use the notion of reliability that reliability of an element corresponds to the ratio between its frequence in the structure over its total frequency in the corpus Besides, in Indian language, Madhav Gopal, Diwakar Mishra, and Devi Priyanka Singh (2010) gave some discussions about evaluated tagsets: ILMT tagset, JNU-Sanskrit tagset, LDCIL tagset, Sanskrit consortium tagset
Vietnamese is an isolating language and important syntactic information source is word order To evaluating Vietnamese tagsets, this chapter will introduce a simple method using internal criteria and external criteria Frequency frame and purity are used in internal criteria to check whether tag is assigned accurately External criteria review reduction cardinality of the tagset to check information quality is retained It is true that a number of evaluations showed that a lot of tagging errors are caused by sometimes too fine differentiations within major categories (Eugenie Giesbrecht, 2008)
Trang 232.1.2 Tagset
A POS is a set of words with some grammatical characteristic(s) in common and each
POS differs in grammatical characteristics from every other POS For example, nouns have different properties from verbs, which have different properties from adjective and so on
Tagset is set of POS tags built up based on the criteria (see in 1.2) Therefore, tagsets usually vary quantity of tags and also used in various applications
Properties of tagset: One tagset need guarantee some properties as followed:
Retaining linguistic feature, reflect syntax structure, possibility of tagging accurately, reduction ambiguous words when we carried out tagging
2.1.3 A method for evaluating distributional properties of tagsets
2.1.3.1 Internal criterion
Among properties of tagsets, we high appreciate possibility of tags is assigned accurately in corpus It means that we mention of internal criterion It is worth noting that we can review this criterion through a frame notion and a purity formula The frame represents reviewed local context It can alert for us which tag can appear in this the frame Next, purity formula assesses possibility convergence of tag in the local context
Discussion about purity
As mention preciously, we use purity formula as external evaluation criterion for tagset (Stanford natural language processing, 357) Purity is widely used in cluster quality evaluation measure It is simple and transparent evaluation measure To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number
of correctly assigned documents and dividing by N
Formally:
(1) Where:
is the set of clusters
is the set of classes
Trang 24We interpret w k as the set of documents in w k and c j as the set of documents in c j in equation (1)
High purity is easy to achieve when the number of clusters in large, in particular, purity is 1 if each document gets its own cluster
For example:
Figure 2 Purity as external evaluation criterion for cluster quality Majority class and
number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4
(cluster 2); and , 3 (cluster 3) Purity is
Frame notion
The frame notion is mentioned in 2006 by Mintz Then, in 2010, Dickinson and
Jochim redefined it following: In local context, one frame consists of three words in
which two words surrounding a target word leading to target‟s categorization We will use frames to test the quality of distributional mappings In English, for example, the
frame “you_it” generally predicts a verbal category for the target (i.e, target word may
be hit, beat, eat, or kiss) In Vietnamese, the frame “mẹ_là” leads target word
belonging to pronoun (Pp), i.e, “tôi, anh, chị, bác” Therefore, to have a more exact
result, we used a frequency and a frequent frame notion Frequent frame supplies
category information in child language corpora It means that, frame‟s role in a corpus
is not similar Many times one frame appears, more linguistic information the frame
concentrates We identify the frequent threshold based on a formula about 0.03% of
the frame total In particular, if we have 10000 frames in a corpus then the frequency is
3 (10000*0.03%) So, one frame appears above 3 times, we consider them as one
frequent frame
Next, purity formula is applied in the method with respect to calculating possibility of
distributing tag in one frame It means that percentage of each tag appears in frame is
different To calculate purity value, we just consider to the biggest frequency of a tag
in each frame Next, we add them and divide total of appearing times of all tags If the
x x
o x x
x
x o o
o o
x
x
Trang 25purity value is higher, then words ability can be tagged accurately higher For instance,
we have two frames: Tôi_ở and mẹ _bảo The first frame appears 4 times in a corpus
in which the target word belongs to two tags Vits, Vitn (1 times and 3 times, respectively) The second frame appears 8 times in which 7 times target word‟s POS is
Np, 1 times is Pp We can calculate the purity value by
2.1.3.2 External benchmark
Normally, to evaluate tagset, linguistic scientists have mapped a tagset into reduced one because this work helps us check retained linguistic features Of course, reduced tagset is built up by merge tags; however, how do we have to merge? This is a difficult question that we need solve
Herv Ejean Seminar, Hervé Déjean and Universität Tübingen (2000) discussed about the theoretically minimal tagset They affirmed that the quality of a tagset does not depend on the quantity of tags They built up the minimum tagset necessary to parse sentence whatever the domain are Originally, they use a tagset with one tag per structure (NP-VP) Then, they estimated that a tagset of about 20 tags is enough to parse a sentence into PS and clause structures
Indeed, there are many ways to merge labels so the tagsets with various tags quantity have still existed English is morphological language so it is rather easily to identify situations can merge such as conflating base form verb (VB) and present tense verb (non-third person singular, VPB) but Vietnamese is not The tagsets are used in our thesis have two kinds:
Firstly, we used tagset that it is built up by preceding NLP researchers, for instance, VnQtag, VietTreeBank
Secondly, we conflate ourselves some labels based on Vietnamese features The number of tags in VnQtag is the largest, so we use it as source tagset to generate other tagsets
2.1.3.3 Algorithm
To concrete above mentioned theory, we would like to introduce the algorithm containing 5 steps in tagged corpus as followed
Trang 261 Identifying all the words and its POS in the corpus, store them and its positions
2 Calculating the quantity of frames in the corpus, after based on total of the frames to calculate a frequency
3 Then, finding frequency frames and a purity value
4 Mapping the original tagset to new reduced tagsets
5 Finally calculating the new purity value in the new tagsets and statistic lost ambiguous words
Preparing data:
We carried out this method on corpus with VnQtag tagset annotated corpus
2.1.4 Result of tagset evaluation
The experiments are performed on VnQtag corpus including four VnQtag annotated documents Then we carried out merging some tags in VnQtag to form new tagsets: tagset 3 and tagset 4 Therefore, we have: VietTreeBank (18 tags), basic tagset 2 (8 tags), tagset 3 (25 tags) and tagset 4 (40 tags) (see in appendix) We relied on the book (Ngữ pháp tiếng Việt - Diệp Quang Ban) to merge tags in which he organized Vietnamese POS system into two groups:
Group 1: Noun, Verb, Adjective
so on
To gain 25-POSs and 30-POSs tagsets, we merged some tags of noun and verb They are basic categories and have the largest number of words in Vietnamese In the VnQtag tagset, noun is fine classified to 8 tags and verb with 10 finer tags We employed four annotated documents in VnQtag and four tagsets to gain results in the table 4 and the table 5
Trang 27Table 4 Some frames is found in corpus
Frame
(Frequency)
POS (Frequency)
Frame (Frequency)
POS (Frequency)
Frame (Frequency)
POS (Frequency)
Còn_sinh (3) Pp (3) ba _Phúc (2) Nh (2) với _đứa (2) Nn (2)
sinh _nông thôn (3) Cm (3) Con _nhỏ (2) No (2) dăm _trẻ (2) Nu (2) đứa _dâng (2) Nh (2)
trẻ _đào (2) Vta (2) có _người (3) Aa (2)
Vtf (1)
bố _Lâm (3)
Nh (3) tôi _có (3)
An (1) Vtf (1)
Cc (2)
bố _bảo (9)
“mẹ_là” This frame occurs 4 times in the corpus and all of them are assigned as Pp (Pronoun)
Trang 28Table 5 Result of tagset evaluation method
Frequency frame, total
of frame, threshold
Based on the table 5, we can see that the first document has 31520 words and the total
of frames is computed exactly 15706 frames Threshold is calculated by formula: 0.03%* the total of frames Therefore threshold in the situation is approximate 5 When experiment the first document on different tagsets, we achieved purity values (60.69%, 82.86%, 82.06%, 69.71%, 79.09%) as well as the number of ambiguous words (0, 331, 397, 55, 137 respectively) We can realized that the percentage of the ambiguous words compared to the total of words in document is small, i.e, 0%, 1.05%, 1.26%, 0.17%, 0.43% Other documents are similar in the explaining
Conclusion, three tagsets namely VietTreeBank, basic tagset and tagset 4 are appreciated higher than other tagsets Because, value of purity is high and the number
of lost ambiguous words is quite small