Building a treebank for vietnamese depen

Building a Treebank for VietnameseDependency Parsing Luong Nguyen Thi Dalat University Information technology Lamdong, Vietnam Email:luongnt@dlu.edu.vn Linh Ha My, Hung Nguyen Viet, Huye

Trang 1

Building a Treebank for Vietnamese

Dependency Parsing

Luong Nguyen Thi

Dalat University Information technology

Lamdong, Vietnam Email:luongnt@dlu.edu.vn

Linh Ha My, Hung Nguyen Viet, Huyen Nguyen Thi Minh, Phuong Le Hong

VNU University of Science Hanoi, Vietnam Email: halinh.hus@gmail.com, hungnguyenviet@vnu.edu.vn, huyenntm@vnu.edu.vn, phuonglh@vnu.edu.vn

Abstract—The problem of Vietnamese syntactic parsing,

espe-cially constituency parsing, has recently been tackled by several

research groups A common effort of the Vietnamese language

processing community has allowed the creation of VietTreebank,

a reference parsed corpus containing about 10,000 sentences for

the constituency parsing task In this paper, we present our work

to build a reference treebank, based on VietTreebank, for the

dependency parsing task which has not been very well studied

for Vietnamese First we define a dependency label set in adapting

the dependency schema developed by the NLP group at Stanford

university and in taking into account the particularities of

Vietnamese grammar Then we propose an algorithm to convert

a constituency treebank to a dependency one The algorithm is

tested on a set of 100 sentences of VietTreebank corpus and

gives very good results Finally, we carry out an experiment on

Vietnamese dependency parsing using MaltParser tool and the

dependency treebank converted from VietTreebank.

I INTRODUCTION Dependency parsing has been one interesting approach to

syntactic parsing in recent years The basic idea of dependency

parsing is to find the syntactic structure which consists of

lexical items, linked by binary asymmetric relations called

dependencies There have been many studies on the

depen-dency parsing Many tools have been developed to solve this

problem Especially, methods based on machine learning give

high accuracy parsing results on in English, Chinese, Swedish

For Vietnamese, most studies centered on constituency

pars-ing such as [?], [?] The Vietnamese treebank reported in [?]

consists of about 10,000 sentences in Penn treebank format

For dependency parsing, there exists only two works, one of

Nguyễn Lê Minh et al [?] which uses MST parser on a corpus

consisting of 450 sentences, and one of Lê Hồng Phương et

al [?], which uses a lexicalized tree-adjoining grammar parser

trained on a subset of the Vietnamese treebank

In this paper, we report our work on building a large

corpus for Vietnamese dependency parsing We first develop

algorithms for converting from constituency structure to

de-pendency structure We then use the resulting dede-pendency

tree-bank to train and evaluate MaltParser - a language-independent

dependency parser [?] and report the parsing results.

This paper is organized as follows The next section

in-troduces dependency parsing where basic concepts and some

existing works are given The following section presents the

construction of a Vietnamese dependency treebank Finally,

the last section reports experimental results on Vietnamese dependency parsing with MaltParser

II DEPENDENCYPARSING

A Definition

Syntax is the subject of two research communities consisting

of linguists and computer scientists Natural language is the object of study of linguists where formal syntax is one language level to be described Computer scientists develops models and algorithms for computer to analyze formal syntax

to build natural language processing applications

Dependency syntax is syntactic structures containing lexical items, or tokens, connected by binary asymmetric relations called dependencies A dependency relation between two to-kens can be named to clarify the relationship between them Dependency structure is determined by the relationship

between the center token (head) and its dependent token (dependent), denoted by an arrow By convention, the root of

the arrow is the head, and the top of the arrow is the dependent

In comparison to constituency structure, dependency structure

is more appropriate to represent syntactic structures of free languages, such as Czech or Turkish

In dependency parsing, each syntactic parse of a sentence can be represented by a dependency graph A dependency graph is a graph where each node is a token of the sentence Arcs (edges) of the graph are used to represent dependency relationship between two nodes and the name of the arc is dependency label between those nodes

For example, consider an English sentence: "Bills on ports

and immigration were submitted by Senator Brownback, Re-publican of Kansas" Its dependency graph contains 13 nodes

corresponding to 13 words and 12 relationships connect-ing words The relationships presented in the sentence are

prep(Bills, on), pobj(on, ports) [?].

Also by convention, there is a special node, which does not correspond to any token in the sentence and always represents the root of the dependency graph

Dependency parsing is the problem of constructing the most probable dependency graph for a given input sentence The input a dependency parser is a tokenized and part-of-speech tagged sentence Most studies on dependency parsing employ machine learning techniques To build a supervised

Trang 2

Bills

nsubjpass

on

prep

ports

pobj

and

cc

immigration

conj

were

auxpass

by

prep

by

prep

Brownback

pobj

Senator

nn

Republican

appos

of

prep

Kansas

pobj

Fig 1 Dependency graph of an English sentence.

dependency parser for a language, we need a large dependency

treebank of that language

B Related Works

Recently dependency parsing has been received the attention

of many research groups There have been many studies and

softwares on dependency parsing: MaltParser, StanfordParser,

MSTParser Most dependency parsing tools achieve high

accuracy and suitable for many languages as English, Chinese,

German, Czech The accuracy of a parser is evaluated using

two indices: unlabeled attachment score, which is the

propor-tion of correct head - ASU, and labeled attachment score,

which is the proportion of correct head and correct dependency

type - ASL

1) MSTParser: MSTParser is developed by Ryan

McDon-ald et al [?] MSTParser has two processes: training and

analysis In training, MSTParser uses on-line algorithms [?].

In analysis, MSTParser uses a graph-based algorithm The

accuracy of MSTParser on a variety of languages is quite high:

ASU = 92.8%, ASL = 90.7% for Japanese, ASU = 91.1%,

ASL = 85.9% for Chinese, ASU = 90.4%, ASL = 87.3%

for German .1

2) Stanford Parser: Stanford Parser is developed by NLP

group at Stanford University Stanford Parser defines 53

de-pendency types for English based on Penn Treebank [?].

The accuracy of the parser is quite high, in particular for

English ASU = 87.2% and ASL = 84.2% This parser have

been extended to parse languages other than English, such as

Chinese, German, French and Arabic.2

1 http://sourceforge.net/projects/mstparser/

2 http://nlp.stanford.edu/software/lex-parser.shtml

3) MaltParser: MaltParser is developed by Johan Hall et al.

MaltParser is the most effectively dependency parsing tool, with high accuracy for more than 20 languages MaltParser has two processes: training and analysis In training, Malt-Parser uses support vector machines algorithm In analysis, MaltParser uses a transition-based algorithm The accuracy of the tool is high, for example ASU = 88.1%, ASL = 86.3% for English and ASU = 88.1%, ASL= 83.4% for German.3 All of the above tools are trained using supervised machine learning algorithms and require a large corpus for concerned languages There does not exist such a dependency corpus for Vietnamese The most important step to develop a dependency parser for Vietnamese is to build a dependency corpus In the next section, we present our work on constructing a Vietnamese dependency corpus

III BUILDINGVIETNAMESEDEPENDENCYTREEBANK The orgininal constituency treebank is a corpus containing about 10,000 sentences in Penn treebank format An example

sentence is (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H

của) (NP (N-H đạn) (N-H bom)))) (VP (R không) (V-H còn) (NP-DOB (N-H người) (A nghèo))) ( .)), where

• S, NP, PP are the labels of phrases and clauses;

• Nc, N, R are the labels of tokens;

• SUB, H, DOB, are the functional syntactic labels of

phrases, clauses or tokens

We design an algorithm to convert this constituency treebank

to a dependency treebank The algorithm has two steps: (1) determining all the dependencies in the sentence and (2) labeling the dependency relations The first step is solved

by determining the central element (head element) of all grammatical phrases and clauses using head rules The second step is done by using a dependency label set and a rule for labeling dependencies

A Dependency Schema

Different dependency labels represent different types of relationships between pairs of tokens of a sentence Typically, the set of dependency labels depends on a particular language Nevertheless, many languages may share an important subset

of dependency labels

The dependency schema developed by the NLP group at Stanford University defines 53 types of English dependency All of them are binary relations where each dependency defines a relation between the head and its dependent We adapt and extend this schema to build a dependency schema for Vietnamese which takes into account the particularities of

Vietnamese grammar [?] This schema consists of 48 labels, all

of which are explicitly defined and consistent with Vietnamese syntax The most common dependency labels are given below:

• vmod: verb modifier, for example vmod(đi, qua) in (VP (V-H đi) (V qua));

• rmod: adverb modifier, for example rmod(Xa_xa, nữa) in (AP (A-H Xa_xa) (R nữa));

3 http://www.maltparser.org/

Trang 3

• dobj: direct object of a verbal phrase, for example

dobj(còn, người) in (VP (R không) (V-H còn) (NP-DOB

(N-H người) (A nghèo)));

• pobj: direct object of a prepositional phrase, for example

pobj(bằng, cùi_tay) in (PP-MNR (E-H bằng) (NP (M hai)

(N-H cùi_tay) (A cụt_lủn))).

Fig 2 An example of dependency parsing in Vietnamese

Figure 2 shows a dependency parse which includes the

following dependence relations:

ncdep(đất - 2, Mảnh - 1) prepc(Mảnh - 1, của - 2) nsubj(còn - 7, Mảnh - 1) pobj(của - 3, đạn - 4) nn(đạn - 4, bom - 5) neg(còn - 7, không - 6) Root(ROOT - 0, còn - 7) dobj(còn - 7, người - 8) amod(người - 8, nghèo - 9) punct(còn - 7, - 10)

B Head Rules

In order to determine the head element of each phrase, we

build a head rule table This table constitutes an important part

of our work Our head rules follow that presented in [?].

SBAR → -H;SBAR;S;VP;AP;NP;.*

WHNP → -H;WHNP;NP;Nc;Nu;Np;N;P;.*

For example, the rule:

S → -H; S; VP; AP; NP; *

can be understood as follows: to find the head of a sentence S,

we browse from left to right to find the first element marked as -H; if there is such element, it will be the root of the sentence,

if not, we find the S element to be the head; if S is not found

we find VP and so on If there is not any such element, take the first element from the left as head (".*")

C Conversion Algorithm

The conversion algorithm has two stages In the first stage,

a constituency parse is constructed from the bracket format

of each sentence of the treebank For example, the parsed

sentence (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H của)

(NP (N-H đạn) (N-H bom)))) (VP (R không) (V-H còn) (NP-DOB (N-H người) (A nghèo))) ( .)) has the constituency parse

as shown in Figure 3 In the second stage, the constituency

S-STL NP-SUB

Nc-H Mảnh

N đất

PP E-H của

NP N-H đạn

N-H bom

VP R

không

V-H còn

NP-DOB N-H người

A nghèo

.

Fig 3 A constituency parse of a sentence in the Vietnamese treebank.

parse is converted to the dependency one This stage has three steps First, find the head of each phrase in the sentence using the head rule table (see Algorithm 1) Second, find a label for each dependency (head, dependent) (see Algorithm 2) Finally, build all the labeled dependencies using a recursive routines calling the two previous steps (see Algorithm 3)

D Results

To evaluate the accuracy of the conversion algorithm, we first select a subset of 100 sentences from the Vietnamese tree-bank and manually annotate them with dependency relations

We then run the conversion algorithm presented above on these sentences to get dependency parses and compare them to the manual annotation The result is very good–the unlabeled attachment score is of 99.6% and the labeled attachment score

is perfect on matched attachments

As an example, from the constituency parse (S-TTL

(NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H của) (NP H đạn)

(N-H bom)))) (VP (R không) (V-(N-H còn) (NP-DOB (N-(N-H người) (A nghèo))) ( .)), the automatic conversion algorithm produces

the following dependency parse:

Trang 4

Algorithm 1 FindHeadP(P, lstHeadRules, lstElements)

Require: P: a phrase; lstElements: list of elements in P;

lstHeadRules: list of head rules

Ensure: head of P

for headRule ∈ lstHeadRules do

if headrule.Phrase=P then

hr ← headRule

break

end if

end for

lstRightHR ← hr.Right

for element ∈ lstElements do

for rightEle ∈ lstRightHR do

if element.Phrase=rightEle or element.Pos=rightEle

then

head ← element

break

end if

end for

return head

Algorithm 2 GetDependentLabel(h, d, lstLabels)

Require: (h, d), where d is a head and d is its dependent;

lstLabels: list of labels

Ensure: a dependency label l: h−→l d

for labelele ∈ lstlabel do

left ← GetInformation(h, labelele.Left)

right ← GetInformation(d, labelele.Right)

center ← GetCenterInformation(h, d, labelele.center)

if IsLabel(left,right,center) then

l ← labelele.Label

break

end if

end for

return l

Table I shows the percentage of common labels assigned

to dependencies on all the Vietnamese treebank containing of

about 10,000 sentences

IV EXPERIMENTS WITHMALTPARSER

In this section, we present parsing experiments on the

Vietnamese dependency treebank constructed in the previous

section We use MaltParser to train and test dependency

Algorithm 3ConvertToDP(Root,lstHeadRules,lstLabels,dpTree)

Require: Root: root node of the constituency tree; lstHead-Rules: list of head rules; lstLabels: list of dependency labels; dpTree: saved dependency tree

Ensure: Head of the sentence

if Root=null then

return

end if

if IsLeaf(Root) then

lstElements ← Word(Root) return FindHeadP(Phrase(Root),lstHeadRules,lstElements)

end if

if AllChildIsLeaf(Root) then for child ∈ Root do

lstElements ← Word(child)

end for

h ←FindHeadP(Phrase(Root),lstHeadRules,lstElements)

for child ∈ Root do

label ← GetDependencyLabel(h, child, lstLabels) depTree ← (h, child, label)

end for

return h

end if

lstHeadChilds ← null

for child ∈ Root do

lstHeadChilds ← ConverToDP(Phrase(child), lstHeadRules,lstLabels, dpTree)

end for

h ←FindHeadP(Phrase(Root),lstHeadRules, lstHeadChilds)

for headchild ∈ lstHeadChild do

label ← GetDependencyLabel(h, headchild, lstLabels) depTree ← (h, headchild, label)

end for

return h

TABLE I

P ERCENTAGE OF COMMON DEPENDENCY LABELS ON THE V IETNAMESE

TREEBANK

No Label %

1 vmod 9.95

2 rmod 6.36

3 nsubj 5.81

4 dobj 5.7

5 pobj 5.6

7 conj 4.67

parsing models on the treebank using cross-validation There are 10 data sets for training and testing are created Each round,

500 sentences are randomly selected as test set and the rest is used to train MaltParser The configuration of the parser that

we use is as follow:

Transition system: Arc-Eager

Trang 5

• Parser configuration: Nivre with allowroot=true and

al-low_reduce=false

• Feature model: NivreEager.xml

• Learner: liblinear

• Oracle: Arc-Eager

The experimental results are described in Table II

TABLE II

D EPENDENCY PARSING ACCURACY WITH M ALT P ARSER

No Test (500 sentences) AS U AS L

2 1001-1500 75.58 68.40

3 2001-2500 72.37 65.12

4 3001-3500 74.16 66.58

5 4001-4500 69.69 63.47

6 5001-5500 74.10 67.42

7 6001-6500 73.49 67.27

8 7001-7500 72.76 65.91

9 8001-8500 69.04 63.16

10 9001-9500 72.82 65.74

Average 73.03 66.35

The average ASU is 73.03% and average ASL is 66.35%

In these experiments, MaltParser was not optimized for

Viet-namese, therefore the accuracy was not high The accuracy

can be improved by fixing some errors on the dependency

treebank such as: determining the wrong root in the sentences

with many clauses, wrong dependencies of special tokens

The set of guidelines for dependency annotation needs to be

defined more clearly to improve the quality of dependency

identification

V CONCLUSION There have been several works on constituency parsing

but not many works on dependency parsing for Vietnamese

language as few data exists for training dependency parsers

However, dependency parsing provides more useful

informa-tion in natural language processing than constituency parser

Our work aims to build automatically a Vietnamese

depen-dency treebank from constituency treebanks which exist more

frequently The dependency label set is defined based on

Vietnamese grammar in a way allowing us to compare directly

our labels with English dependency labels To do this, the

English dependency label set developed by the NLP group at

Stanford University is used as reference

Once the Vietnamese dependency treebank of about 10,000

sentences converted from VietTreebank, we have done

exeri-ments on Vietnamese dependency parsing using MaltParser

The evaluation results give 73.03% for the average ASU and

66.35% for the average ASL In a first step, these experiment

results help to show some errors in the reference data In the

next step, we will revise the corpus and carry out experiments

with different parsers to find the best methods for Vietnamese

dependency parsing

REFERENCES [1] L T Hương, P H Quang, and N T Thủy, “Một cách tiếp cận trong

việc tự động phân tích cú pháp văn bản tiếng việt,” Tạp chí tin học và

Điều khiển học, vol 15, no 4, 2000.

[2] P T Nguyen, L V Xuan, T M H Nguyen, V H Nguyen, and

P Le-Hong, “Building a large syntactically-annotated corpus of

Viet-namese,” in Proceedings of the 3rd Linguistic Annotation Workshop,

ACL-IJCNLP, Singapore, 2009.

[3] N L Minh, H T Điệp, and T M Kế, “Nghiên cứu luật hiệu chỉnh kết quả dùng phương pháp MST phân tích cú pháp phụ thuộc tiếng việt,”

in ICT-rda 8, Hanoi, Vietnam, 2008, pp 258–267.

[4] P Le-Hong, T M H Nguyen, and R Azim, “Vietnamese parsing

with an automatically extracted tree-adjoining grammar,” in Proceedings

of the IEEE International Conference in Computer Science: Research, Innovation and Vision of the Future, RIVF, HCMC, Vietnam, 2012.

[5] J Nivre, J Hall, J Nilsson, A Chanev, G Eryigit, S Kubler, S Marinov, and E Marsi, “Maltparser: A language-independent system for

data-driven dependency parsing,” Natural Language Engineering, vol 13,

no 2, pp 95–135, 2007.

[6] M.-C de Marneffe, B MacCartney, and C D Manning, “Generating

typed dependency parses from phrase structure parses,” in Proceedings

of LREC 2006, Genoa, Italy, 2006.

[7] R McDonald, K Lerman, and F Pereira, “Multilingual dependency

parsing with a two-stage discriminative parser,” in Proceedings of the

Tenth Conference on Computational Natural Language Learning, 2006.

[8] R McDonald, K Crammer, and F Pereira, “Online large-margin training

of dependency parsers,” in Proceedings of the 43rd Annual Meeting of

the Association for Computational Linguistics, 2005.

[9] Q B Diệp and V T Hoàng, Ngữ pháp Tiếng Việt (Vietnamese

Grammar) NXB Giáo dục, Hà Nội, Việt Nam, 1999.

[10] P Le-Hong, T M H Nguyen, P T Nguyen, and A Roussanaly,

“Automated extraction of tree adjoining grammars from a treebank for

Vietnamese,” in Proceedings of The Tenth International Workshop on

Tree Adjoining Grammars and Related Formalisms (TAG+10), Yale

University, New Haven, CT, USA, 2010.

Định dạng
Số trang	5
Dung lượng	232,61 KB