DSpace at VNU: Vietnamese treebank construction and entropy-based error detection

Firstly, to shareour results in constructing a large Vietnamese treebank VTB with three levels ofannotation including word segmentation, part-of-speech tagging, and syntacticanalysis.. M

Trang 1

O R I G I N A L P A P E R

Vietnamese treebank construction and entropy-based

error detection

Phuong-Thai Nguyen1·Anh-Cuong Le1·

Tu-Bao Ho2· Van-Hiep Nguyen3

Abstract Treebanks, especially the Penn treebank for natural language processing(NLP) in English, play an essential role in both research into and the application ofNLP However, many languages still lack treebanks and building a treebank can bevery complicated and difficult This work has a twofold objective Firstly, to shareour results in constructing a large Vietnamese treebank (VTB) with three levels ofannotation including word segmentation, part-of-speech tagging, and syntacticanalysis Major steps in the treebank construction process are described with par-ticular regard to specific Vietnamese properties such as lack of word delimiter andisolation Those properties make sentences highly syntactically ambiguous, andtherefore it is difficult to ensure a high level of agreement among annotators.Various studies of Vietnamese syntax were employed not only to define annotationsbut also to systematically deal with ambiguities Annotators were supported byautomatic labelling tools, which are based on statistical machine learning methods,for sentence pre-processing and a tree editor for supporting manual annotation As aresult, an annotation agreement of around 90 % was achieved Our second objective

is to present our method for automatically finding errors and inconsistencies in

& Phuong-Thai Nguyen

Trang 2

treebank corpora and its application to the construction of the VTB This methodemploys the Shannon entropy measure in a manner that the more reduced entropythe more corrected errors in a treebank The method ranks error candidates by using

a scoring function based on conditional entropy Our experiments showed that thismethod detected high-error-density subsets of original error candidate sets, and thatthe corpus entropy was significantly reduced after error correction The size of thesesubsets was only about one third of the whole set, while these subsets contained 80–

90 % of the total errors This method can also be applied to languages similar toVietnamese

Keywords Treebank · Error detection · Entropy

1 Introduction

Thanks to the development of powerful machine learning methods, natural languageprocessing (NLP) research is currently dominated by corpus-based approaches.Treebanks are used for training word segmenters, part-of-speech taggers, andsyntactic parsers, among others These systems can then be used for applicationssuch as information extraction, machine translation, question answering, and textsummarization The treebanks are also useful for linguistic studies, such as theextraction of lexical-syntactic patterns or the investigation of linguistic phenomena.Treebank construction is a complicated task, and moreover, developing a treebankfor a language that has not been the subject of extensive NLP research, such asVietnamese raises a number of questions concerning the nature of the approach,linguistic issues, and consistency

Why is linguistic annotation difficult? Linguistic annotation of human languages

is difficult because of grammatical complexity and frequently encounteredambiguities Table 1 shows two examples, one in English part-of-speech tagging(sentences 1–2) and the other in Vietnamese word segmentation (sentences 3–4) Inthe first example, the word ‘can’ is an auxiliary in sentence 1, but a noun in sentence

2, and thus there are variations in the way ‘can’ is tagged In the second example,the syllable sequence ‘sắc đẹp’ is a word in sentence 3, but not a word in sentence 4,and thus there are also variations in the way ‘sắc đẹp’ is segmented Therefore,building annotated corpora is a costly and labour-intensive task that depends ondifferent levels of annotation such as word segmentation, part-of-speech tagging andsyntactic analysis There are errors even in released data, as shown by the fact thatcomplex data such as treebanks are often released in several versions.1In order tospeed up annotation and increase the reliability of labelled corpora, various kinds ofsoftware tools have been built for format conversion, automatic annotation, and tree

1

Multi-version treebank publishing has several purposes: error correction, annotation scheme modification, and data addition For example, major changes in the Penn English Treebank (PTB) Marcus and Marcinkiewicz ( 1993 ) upgrade from version I to version II include POS tagging error correction and predicate-argument structure labelling In the PTB upgrade from version II to version III, more data is appended.

Trang 3

editing Pajas and Stepanek (2008) In this paper we have focused on methods forchecking errors and inconsistencies in annotated treebanks.

1.1 Previous studies

1.1.1 Treebank construction

The Penn treebank (PTB) for English Marcus and Marcinkiewicz (1993) is the firstlarge syntactically annotated corpus constructed with a good methodology, andgood process and evaluation, which results in reliable data Such treebanks providerich syntactic information about part of speech, phrase structure, functional anddiscontinuous constituency (deep structure) Though PTB part-of-speech tagset isless detailed than the tag set of previous POS-tagged corpora such as Brown Corpusand LOB Corpus, due to the recoverability property Marcus and Marcinkiewicz(1993), the end users can convert the PTB tag set into a much richer tag set Manysyntactic parsing studies using various formalisms such as phrase structuregrammars Collins (1999), dependency grammars Yamada and Matsumoto (2003),and head-driven phrase structure grammars Miyao and Tsujii (2008) have beencarried out successfully using PTB The PTB phrase structure annotation schemehas been applied to languages such as Korean and Chinese Treebank developmentfor those languages has contributed to establishing the methodology of PTB.The Korean treebank (KTB) was developed and evaluated in Han et al (2002).Korean is an agglutinative language with a very productive inflectional system POStags are a combination of a content tag and functional tags Note that in PTB, onlyphrasal tags follow this method Complements and adjuncts are structurallydistinguished If YP is an argument of X, then YP is a sister of X (part (a) in Fig.1)and If YP is an adjunct of X, then YP is represented as part (b) in Fig.1 The KTBalso uses a number of simple methods to correct POS and constituency errors based

on dictionary words and regular expressions

The Chinese treebank (CTB) Xue et al (2005) contributes to word segmentationannotation and consistency assurance techniques in the construction of treebanks for

an isolating language For word segmentation, the authors conducted an experiment

in manual word segmentation that showed that inter-annotator agreement was nothigh However, according to their analyses, much of the disagreement was caused

by human error and was not critical In response, they designed word-hood tests forthe word segmentation task These tests were based on frequency, combination

Table 1 Examples of annotation ambiguities

Trang 4

ability, a number of transformations, and the number of syllables The fact thatChinese words are not marked with tense, case, or gender indicated that there wereoften two choices of POS criteria: meaning based and distribution based Theauthors chose distribution criteria, since it complies with principles in contemporarylinguistic theories such as X-bar theory and GB theory.2 They had pragmaticapproaches to quality control and important development phases such as guidelinepreparation and annotation For example, in guideline preparation for syntacticbracketing, they tackled ba-construction and bei-construction issues by: (1) studyinglinguistic literatures, (2) attending Chinese linguistics conferences, (3) conductingdiscussions with linguistic colleagues, (4) studying and testing their analyses ofrelevant sentences contained in their corpus, and (5) using special tags to markcrucial elements in these constructions CTB makes a clearer distinction betweenconstituency and functional tags Some tags in PTB such as WHNP and WHPP aresplit in CTB.

There have been a number of published works on Vietnamese word segmentationand POS tagging These works have often used small, private “home made” corpora.vnQTAG Nguyen et al (2003), a shared corpus, is one example This data set,containing 74,756 words, was annotated with word boundaries and POS tags Aswith other Vietnamese corpora, there was little description of this corpus Also,vnQTAG’s POS tag set was chosen from a Vietnamese syntactic book The design

of this tag set was based on both meaning and distribution criteria

Most treebank annotation schemas try to be less specific about linguistic theories.However, two main groups of annotation schemas can be recognized: schemas thatannotate the phrase structure as presented above and schemas that annotate thedependency structure The latter focus on dependency relations between words.Dependency schemes are more suitable for languages with relatively free wordorder like Czech and Japanese, since grammatical functions can be indicted withoutlots of indications of movement Recently, Rambow (2010) have had an excellentdiscussion about dependency representations and phrase structure representationsfor syntax

Trang 5

1.1.2 Treebank error detection

Dickinson and Meurers (2003) proposed three techniques to detect part-of-speechtagging errors The main idea of their first technique was to consider variationn-grams, which occur more than once in the corpus and include at least onedifference in their annotation For example, “centennial year” is a variation bi-gramwhich occurs in the Wall Street Journal (WSJ), a part of Penn treebank corpusMarcus and Marcinkiewicz (1993) with two possible tagging3“centennial/JJ year/NN” and “centennial/NN year/NN” Of these, the second tagging is correct.Dickinson found that a large percentage of variation n-grams in WSJ have at leastone instance (occurrence) of an incorrect label However, using this variationn-gram method, linguists have to check all instances of variation n-grams to finderrors The other two techniques take into account more linguistic informationincluding tagging-guide patterns and functional words

Dickinson (2006) presented an error correction method employing off-the-shelfPOS taggers.4The method includes three steps: firstly, training the tagger on theentire corpus; secondly, running the trained tagger over the same corpus; thirdly, forthe positions the variation ngram detection method Dickinson and Meurers (2003)flags as potentially erroneous, choosing the label output by the tagger Dickinson’spaper also presented a treebank transformation method to improve POS taggingaccuracy, which resulted in improvements in error correction The method convertsoriginal POS tags into ambiguity tags in order to reduce ambiguity in the originaldata Treebank transformation techniques have been used for both POS tagging, asmentioned in Dickinson’s paper, and syntactic parsing Johnson (1998), Klein andManning (2003) Treebank transformation is often carried out as a preprocessingstep for different tagging and parsing methods

Dickinson (2008) reported a method to detect ad-hoc treebank structures Heused a number of linguistically-motivated heuristics to group context-free grammar(CFG) rules into equivalent classes by comparing the right hand side (RHS) of rules.For example, one heuristic suggests that CFG rules of the same category shouldhave the same head tag and similar modifiers, but can differ in the number ofmodifiers they have By applying these heuristics, the RHS sequences5ADVP RBADVP and ADVP, RB ADVP can be grouped into the same class Classes with onlyone rule, or rules which do not belong to any class are problematic Dickinsonevaluated the proposed method to analyse several types of errors in the Penntreebank Marcus and Marcinkiewicz (1993) However, in a similar way toDickinson and Meurers (2003), this study proposed a method to determinecandidates of problematic patterns (ad hoc CFG rules instead of variation n-grams)but not problematic instances of those patterns

Yates et al (2006) produced a study on detecting parser errors using semanticfilters Firstly, the syntactic trees—the output of a parser—are converted into an

Trang 6

intermediate representation known as relational conjunction (RC) Then, using theWeb as a corpus, RCs are checked using various techniques including point-wisemutual information, verb sampling tests, text-runner filters, and question answering(QA) filters For evaluation, error rate reductions of 20 and 67 % were reportedwhen tested on the PTB and TREC, respectively The interesting point of their paperwas that information from the Web was utilized to check for errors.

Novak and Razimova (2009) used the association rule mining algorithm Apriori

to find annotation rules, and then to search for violations of these rules in corpora.They found that violations are often annotation errors They reported an evaluation

of this technique performed on the Prague Dependency Treebank 2.0, presenting anerror analysis which showed that in the first 100 detected nodes, 20 contained anannotation error However, this was not an intensive evaluation

1.2 A summary of our work

1.2.1 Vietnamese treebank Construction

There are a number of important characteristics of the Vietnamese language thatimpact greatly on the treebank construction First, the smallest unit in the formation

of Vietnamese words is the syllable Words can have just one syllable (for example

Thai, there is no word delimiter in Vietnamese The space is a syllable delimiter butnot a word delimiter, so a Vietnamese sentence can often be segmented in manyways Second, Vietnamese is an isolating language in which words do not changetheir forms according to their grammatical function in a sentence Table2shows anexample Vietnamese words ‘ ’ and ‘racome’ function as the subject and themain verb respectively in sentence 1, while they function as the complements of

‘bảoask’ in sentence 2 However, in both sentences, these words do not change theirforms, while English translation sentences 1e-2e require different word forms (‘he’-

‘him’ and ‘comes’-‘to come’) Third, the Vietnamese syntax conforms to thesubject-verb-object (SVO) word order as illustrated in examples we considered sofar (Tables1,2)

Since Vietnamese has a relatively restrictive word order and often relies on theorder of constituents to convey important grammatical information, we chose to useconstituency representation of syntactic structures For languages with a freer wordorder such as Japanese or Czech, dependency representation is more suitable Weapplied the annotation scheme proposed by Marcus et al Marcus and

Table 2 An example about isolating property of the Vietnamese language

Trang 7

Marcinkiewicz (1993) This approach has been successfully applied to a number oflanguages such as English, Chinese, and Arabic For Vietnamese, there are threeannotation levels including word segmentation, POS tagging, and syntactic labeling.Our main goal was to build a corpus of 70,000 word segmented sentences, 20,000POS tagged sentences, and 10,000 syntactic trees.6Treebank construction is a verycomplicated task in which the major phases include investigation, guidelinepreparation, tool building, raw text collection, and annotation Actually this is aniterative process involving three phases: annotation, guideline revision, and toolupgrade We drew our raw texts from the news domain, with the Youth (Tuổi Trẻ),

an online daily newspaper, focusing on social and political topics, as our source

In order to deal with ambiguities occurring at various levels of annotation, wesystematically applied linguistics analysis tests such as deletion, insertion,substitution, questioning, and transformation Nguyen (2009) Notions for thesetechniques were described in the guideline documents with examples, argumentsand alternatives These techniques originated in the literature or were proposed bymembers of our group For automatic labeling tools, we used advanced machinelearning methods such as conditional random fields (CRFs) for POS tagging orlexicalized probabilistic context-free grammars (LPCFGs) for syntactic parsing.These tools helped us speed up the annotation process We also used a tree editor tosupport manual annotation

Our treebank project is a branch project of a national project which aims to developbasic resources and tools for Vietnamese language and speech processing (VLSP) Inaddition to a treebank, the VLSP project also develops other text-processing resourcesand tools including a Vietnamese machine readable dictionary, an English-Vietnamese parallel corpus, a word segmenter, a POS tagger, a chunker, and asyntactic parser During the annotation process, tools are trained using treebank data,and then are used to support treebank construction as a preprocessing step

After finishing the treebank project, we achieved our goal in terms of corpus size,annotation agreement, and usability for text-processing tools Since 2010, theVietnamese treebank (VTB) and other resources and tools developed by the VLSPproject have been shared on the VLSP web page.7Sections2.5and 2.6will givemore analysis about the treebank status

1.2.2 Treebank error detection

In this paper, we introduce a learning method based on conditional entropy fordetecting errors in treebanks Our method, using ranking, can detect erroneousinstances of variation ngrams8in treebank data (Fig.2) This method is based on theentropy of labels, given their contexts Our experiments showed that conditional

6

Steedman et al ( 2003 ) showed that a training set size of around 10,000 syntactic trees was good for English parsing since when using a larger training set, improvement in parsing performance was small (as tested on Collins’ parser).

7

http://vlsp.vietlp.org:8080/demo/

8 This term has the same meaning as the term ‘variation nuclei’ in Dickinson and Meurers ( 2003 ) In our paper, a variation n-gram is an n-gram which varies in how it is labelled because of ambiguity or annotation error Contextual information, such as surrounding words, is not included in an n-gram.

Trang 8

entropy was reduced after error correction, and that by using ranking, the number ofchecked instances could be reduced drastically We used Vietnamese treebankNguyen et al (2009) for the experiments.

Our method inherits the idea of variation ngram/nuclei from the work ofDickinson and Meurers (2003), although it improves the capability of detectingerroneous instances Our work differs from Dickinson (2006) in that we do notrequire an available POS tagger Instead, we sort error candidates, employ entropyfor error detection, and experiments on not only POS tagged data, but also word-segmented data sets that show the effectiveness of the entropy-based method.1.3 Organization of the paper

The rest of this paper is organized as follows In Sect.2, we present the main aspects

of Vietnamese treebank construction including annotation schemes, guidelinepreparation for three annotation levels, tools, annotation process, and preliminaryresults on treebank and tool distribution In Sect 3 we present a mathematicalrelationship between entropy and annotation errors, an entropy-based error detectionmethod, and experimental results for error detection with discussion Finally,conclusions are drawn, and future work is proposed in Sect.4

In this paper, Vietnamese examples are annotated with English words assubscripts, except for proper nouns and numbers Since Vietnamese is an isolatinglanguage, English subscripts are often in base form There are several specialsubscripts expressing grammatical information including the ‘past’, ‘continuous’,

‘future’, and ‘passive’ tenses In the reference section, there are selected Vietnamesebooks and journal papers in which only two are in English9 Nguyen (2009);Thompson1987), and the others are in Vietnamese

Fig 2 Conceptual sets S1 The

whole treebank data; S2 data set of

variation ngrams; S3 error set

(supposed to be the region with

highest entropy)

9 Online versions at: http://ir.library.osaka-u.ac.jp/metadb/up/LIBRIWLK01/riwl_001_019.pdf ; C´http:// www.sealang.net/archives/mks/THOMPSONLaurenceC.htm

Trang 9

2 Vietnamese treebank construction

2.1 Word segmentation

2.1.1 Word types

With regard to their structure, Vietnamese words can be divided into a number oftypes including single-syllable words, coordinated compound words, subordinatedcompound words, reduplicative words, and accidental compound words As shown

in Table3, single-syllable words only cover a small proportion while two-syllablewords account for the largest proportion of the whole vocabulary Forming thatvocabulary is a set of 7729 syllables, higher than the number of single words Thesyllables which are not single words are bound morphemes,10 which can only beused as part of a word but not as a word on its own The coordinated compoundwords, specific to Vietnamese, are words in which their parts—each part can be aword, single or compound words—are parallel in the sense that their meanings aresimilar and their order can be reversed The meaning of a coordinated compound isoften more abstract than the meanings of its parts The proportion of this kind ofwords is about 10 % of the number of compound words according to the statistics inthe Vietlex dictionary Reduplicative words (such as ‘ ’, ‘làm lụngwork’)are compounds whose parts have a phonetic relationship This kind of words isspecific to Vietnamese, although their proportion is small The identification ofreduplicative words is normally deterministic and not ambiguous Accidentalcompounds are non-syntactic compounds containing at least two meaningless

words (SCWs) are the most problematic A SCW can be considered as having twoparts, a head and a modifier Normally, the head goes first and then the modifiers.SCWs make up the largest proportion in the Vietnamese dictionary Generally,discrimination between SCW and phrase is problematic because SCW’s (syntactic)structure is similar to that of a phrase This is a classical but persistent problem inVietnamese linguistics

In addition to the word types mentioned above, we consider the following types

in the word segmentation phase: idioms, proper names, date/time and numberexpressions, foreign words, and abbreviations Note that sentences are segmentedinto word sequences in which words are not labeled with type information.However, in our annotation guidelines, word segmentation rules are organizedfollowing word types

2.1.2 Word deﬁnition

There are many approaches to word definition such as those based on morphology,syntax, meaning or linguistic comparison Since Vietnamese words are not markedwith respect to number, case or tense, the morphology-based approach is not veryapplicable We mostly rely on an approach based on the syntactic role and

10

They may have a meaning (‘ ’, ‘hàn cold ’) or not (‘lẽo’, ‘nha´nh’)

Trang 10

combination ability of words, so that we consider words to be syntactic atomsSciullo and Williams (1987) in the sense that it is impossible to analyze the wordstructure using syntactic rules (except subordinated compounds), or that words arethe smallest unit which is syntactically independent We do not use meaning as aword definition, but we make use of the non-compositionality property of a largeproportion of compound words.

From the application point of view, the word definition should supportapplications as much as possible For example, machine translation researchersmay prefer a good match between Vietnamese vocabulary and foreign languages’vocabulary The problem is that there are so many foreign languages which aredifferent in terms of linguistic properties and word characteristics Lexicographers(dictionary makers) may want to extract candidates of collocations and new wordsfrom texts, which need to have their meaning explained For such applications,syntactic parsers can be used since they can identify and extract phrases Theapplication considerations are important However at this stage of the resourcedevelopment of Vietnamese NLP, we have concentrated on word segmentation forother fundamental tasks such as POS tagging, chunking, syntactic parsing thanabout other applications

2.1.3 Word segmentation guidelines

In the annotation phase, we used dictionaries as a reference In fact, dictionarywords can be considered to be candidates for word segmentation and the rightsegmentation will be chosen based on context This is not a very difficult task forhumans We also applied techniques to identify new (compound) words Forrepeated words, there are linguistic rules Nguyen (2004) which well-trainedannotators can apply without much difficulty For coordinated and subordinatedcompound words, we used word-hood tests which have been discussed in variousVietnamese linguistic studies:

Tests for word-hood verification (without loss of generality, considering asequence of two syllables AB):

Table 3 Word length statistics from a popular Vietnamese dictionary, made by the Vietnam raphy Center (Vietlex)

Trang 11

– Stress: in pronunciation of AB, if A or B is stressed while the other is not, then

– Parallel: if the meaning of A and the meaning of B are similar, and A and B can

be reordered, then AB is likely to be a coordinated compound word

– Transformation 1 (insertion): if we can insert C, C’, between A and B, then

AB is not likely to be a word The more productive the transformation is, theless likely to be a word AB is

– Transformation 2 (substitution): if A (or B) can be substituted by A’, A”, (orB’, B”, ) of the same type, then AB is not likely to be a word The moreproductive the transformation is, the less likely that AB is a word

In fact, our word segmentation guidelines are much more specific than the previouslist of tests However, as shown by a prior study of Chinese treebank constructionXue et al (2005), the specification of such general word-hood tests can helpannotators systematically understand word identification criteria, and thereforeimprove the inter-annotator word segmentation agreement These word-hood testscan be used directly or indirectly in case there are more specific tests (guidelines) ofthe same type

In practice, in verifying whether a syllable compound is a word or not, annotatorsoften have to use multiple tests The satisfaction of one test reflects only one aspect

—phonetic, structural, and syntactic transformation possibilities—of a word There

‘tạp chı´magazine’ (words) There are also sequences that do not satisfy any test, such

poles, there are sequences that satisfy one, two, or several tests only Suchsequences, which are often SCWs, form a source of inconsistency We try tomaintain the consistency of annotation as much as possible For example, if

in this example, the transformation can be considered not productive since it results

in two possible compounds only

2.2 Part-of-speech tagging

2.2.1 Part-of-speech tag set and annotation guidelines

In Vietnamese syntactic studies, there are two common approaches to classifyingwords into POSs The first approach is based on the combination ability andsyntactic functions of words (or in other words, distribution), while the other relies

on word meaning In fact, these approaches are often combined Diep (2005) We

Trang 12

choose the first view, combination ability and syntactic function, for our POS tag setdesign since words with different meanings can have the same syntactic function.Therefore our POS tags do not contain semantic information Note that in NLP, POStagging studies often make use of local lexical information such as surroundingwords and POSs, rather than use phrase-structure information In practice thepipeline processing, or incremental approach, is quite popular when buildingsentence analysis systems.11For example, in order to parse a sentence, the necessaryprocessing steps include word segmentation, POS tagging, and syntactic parsing.The process operates sequentially with the output of one step providing the input tothe next step Our POS tags do not contain sub-categorization information (e.g.transitive/intransitive verbs, verbs followed by clauses, etc.) Where “extra”information such as semantic and sub-categorization is necessary, higher levels ofanalyses such as word sense disambiguation and syntactic parsing are required.From these reasons, we choose a medium level of tag details Later, we will discussthe refinable property of a number of tags.

Table 4 shows our POS tag set Vietnamese parts of speech do not necessarycorrespond directly with English parts of speech of the same name The class ofadverbs in Vietnamese is a closed class (or a class of function words), while inEnglish the class of adverbs is an open class (or a class of content words)

as ‘ ’, ‘hơirather’), and negation (such as ‘khôngnot’) Therefore the number ofadverbs in Vietnamese is much smaller than that in English Other words thatchange or qualify the meaning of verbs are classified in Vietnamese as adjectives.12There are current controversies about how some Vietnamese words should betagged with POS Classifier words such as ‘ca´i’, ‘con’ are examples Vietnamesecountable nouns often must be preceded by these classifiers when these nouns arebeing counted or specified (e.g ‘cái bàntable’, ‘con gàchicken’) Some argue that thisgroup of words should be considered an independent part of speech or a sub class ofnoun Recent studies have showed that these words can be considered as (classifier)nouns Cao (2007) More specifically, these words can serve as the head of a nounphrase Another example concern the verb and adjective parts of speech Many

‘sẽfuture’, to identify verbs, ‘ ’, ‘hơirather’, ‘qua´extremely’, ‘ ’, toidentify adjectives However, Cao (2007) showed that many verbs, such asemotional ones, can co-occur with ‘raˆ´tvery’, ‘hơirather’, ‘qua´extremely’, ‘ ’, ,while hundreds of adjectives (such as extreme adjectives) can not occurconcurrently with that group of adverbs Cao states that we should merge verband adjective parts of speech as the predicative part of speech Although we do notfollow his opinion, his argument helps us understand some of the limitations of thelexical evidence techniques we used

In the POS annotation guidelines, we list ambiguous cases and describe tests andexamples for POS disambiguation For example, the set of directional words ‘raout’,

‘vàoin’, ‘leˆnup’, ‘ ’, are ambiguous between verb, preposition, and

Trang 13

adverb POSs For instance, ‘ra’ is a verb in ‘ ’ (insertion of

‘ ’, ‘ ’, ‘sẽfuture’, ), an adverb in ‘TôiI nghĩfindraout giảipha´psolution’ (showing result), and a preposition in ‘TôiI đigo rato Hà Nội’ (have anoun complement)

2.2.2 Reﬁnable properties

Based on lexical information, a number of tags such as pronoun, adverb,conjunction, and particle can be easily—with less ambiguity—split into morespecific sub-tags This is similar to the recoverable13property mentioned by Marcus

et al Marcus and Marcinkiewicz (1993) For example, the pronoun tag P can besplit into the vocative pronoun (such as ‘tôiI ;me’, ‘chu´ng tôiwe ;us’), the deterministicpronoun (such as ‘đaˆythis’, ‘đo´that’), and the interrogative pronoun (such as ‘aiwho’,

‘gı`what’) Another example is the adverb tag R This tag can be split into threesubtypes reflecting the relative position of an adverb in a sentence includingbeginning of the sentence (such as ‘thỉnh thoảngsomtimes’, ‘bỗng dưngsuddenly’),preceding the modified verb (such as ‘cu˜ngalso’, ‘sẽfuture’) or following the verb(such as ‘roˆ`ialready’, ‘nữaagain’) This property can be useful in cases in whichlinguists want to investigate the behaviour of words belonging to these subtypes

In designing this tag set, we did not make distinctions within the syntacticstructure For example, we do not distinguish noun-modifier adjectives frompredicative adjectives and verb-modifier adjectives and vice versa Such distinctionscan be made from the information about the adjective’s position in the parse tree (e

Table 4 Vietnamese treebank POS tag set

13 This term came from the fact that the design for the Penn Treebank tag set was based on the simplification of the Brown Corpus tag set.

Trang 14

g if the parent node’s tag is NP, then the adjective is a noun modifier) in the parsedversion of the corpus.

2.3 Syntactic annotation

2.3.1 Syntactic tag set

Our syntactic tag set contains three types of tags including constituency (Table5),function (Table6), and null element (Table7) The design of constituency tags isless controversial than that of the POS tag set Each phrase tag XP often has acorresponding POS tag X, as the POS of the XP’s head is X Tags with a WH prefixsuch as WHNP, WHAP are used for labelling phrases containing the interrogativewords, used in question sentences Another design option is to represent WH as afunctional tag as in the Chinese treebank There are several clausal tags representingstatement sentences, question sentences, and also subordinate clauses Eachfunctional tag represents a specific kind of complement or adjunct Consideringthat head identification is important, we use the tag H to label phrases’ head If aphrase has more than one head connected by coordination conjunctions or commas,then all heads will be labelled with the H tag Since other treebanks such as PTB andCTB often do not use head tag, researchers in syntactic parsing such as CollinsCollins (1999), Klein and Manning (2003) use heuristic rules to determine the head

of CFG rules Machine learning methods such as the expectation maximization(EM) algorithm also can be used to recover the ‘hidden’ head in this case Chiangand Bikel (2002) Null element tags are often used for representing deep structures

of adjective clauses, ellipsis, passive voice, and topic

Table 5 Vietnamese treebank constituency tags

Trang 15

2.3.2 Sentence and phrase analysis techniques

In Vietnamese, words belonging to a number of classes such as verb, adjective, nounand preposition can be the predicate of a sentence Additionally, the isolatingproperty makes Vietnamese sentences more structurally ambiguous A kind of

‘morphological’ property of Vietnamese is that there are functional wordsexpressing number (before nouns), tense (before verbs), capability (beforeadjectives), etc However, the use of these words is often not a morpho-syntacticconstraint, and speakers or writers just use those functional words when they areimportant for expressing the meaning of a sentence (i.e to avoid misunderstanding).Therefore, in sentence analysis, the test of insertion of functional words isimportant

Table 6 Vietnamese treebank functional tags

Table 7 Vietnamese treebank null-element tags

Trang 16

The annotation of real texts relies on various techniques because ambiguity mayoccur in any step of phrase structure analysis, such as determining the head element,discriminating between possible syntactic patterns (especially subcategorizationframes), and discriminating between complements and adjuncts Important sentenceanalysis techniques include deletion, substitution, insertion, transformation, andquestion formation These techniques exploit combination ability, word order, andfunctional words in order to disambiguate between possible structures Table 8

shows some examples of such techniques

As shown in the examples above, to identify linguistic units (e.g an adverbialphrase), syntactic behaviors (e.g reorderable) can be verified by transformation If atransformed sentence is correct, then the corresponding analysis is chosen.Otherwise, it is rejected In practice, the incorrectness of a transformed sentencecan be caused by a reason on the syntactic level, the semantic level, or the pragmaticlevel It is not always easy to separate these levels

2.3.3 Existential and passive sentences

The subject identification of existential sentences like ‘Treˆnon bàntable đa

˙

˘tput mộtalo

Such sentences can be composed mechanically by omitting the first argument (e.g.agent) of the predicate, and moving one of the other arguments (e.g recipient)toward the beginning of the sentence The commonality of existential sentences isthe consequence of the topic sensitive property of Vietnamese language Recentsyntactic studies Nguyen (2009) showed that ‘treˆnon bàntable’ or ‘nhàhouse’ can beconsidered as the subject of such sentences, while a number of previous studiesrecognized ‘treˆnon bàntable’ as an adverbial phrase, or ‘nhàhouse’ as a moved object,and there is no subject To ensure the consistency of our subject-predicate approach,

we label those phrases functionally as subject We use null element tags with traceindices to imply that logically, such phrases are the moved argument of thepredicate

Table 8 Examples of disambiguation tests

Định dạng
Số trang	33
Dung lượng	3,53 MB