Luận văn lexicalized statistical parsing for vietnamese

15 Thesis structure The rest of this thesis is organized as follows: Chapter 2 introduces asic parsing approaches fram classical methods such as tap- down or hotiom-up strategy in statis

Trang 1

[hy INGHỆ

Pham Thi Minh Thu

Faculty of Information Technology Hanoi University of Engineering and Technology

Vietnam National University, Hanoi

Supervised by

Doctor Le Anh Cuong

A thesis submitted in fulfillment of the requirements for the degree of

Master of Computer Science

June, 2010

Trang 2

Table of Contents

Acknowledgements

1 Introduction

1.1 What is syntactic parsing?

1.2 Current Studies in Parsing

Earley algorithm 2.3 Probabilistic context-free grammar (PCLGs)

Trang 3

3.2.2 Bracketing

3.3 Viet Treebank

3.3.2 The POS tagset and Syntax tagset for

3.4 Our approach in huilding a Vietnamese parser

3.4.1 Adapting Bikel’s parser for Vieiname:

3.4.2 Analysc error and props using heuristic rules

4 Experiments and Discussion

4.1 Data

4.2 Rikel's parsing tonl

4.3 Adaplating Bikel's tool to Vietnamese

4.3.1 Investigate different configurations

4.4 Experimental results on using heuristic rules

§ Conclosions and Future Work

Trang 4

List of Figures

The parse tree of sentence "I go to school" .-

A parse tree in Vietnamese

The parse tree of the Vielnumese sentence "mv bat chu

‘Two derivations of the sentence “Toi hiéu Lan hon Nga"

A parse tree af Vietnamese in ILPCEG

A tree with the suffix used to identify

Set of tag in Penn Treebank

A sample of labeled data in Penn Treehank before manually treatment

A sample of laheled data in Penn Veehank after manually treatment

A sample of complete data in English and Vietnamee

‘The Bikel's system nverview

Result of testing standard Collins’ model 2 with training data's size change

from 60% to 100% of the full data, Wheze sezies 1 and series 2 stand for

testing on sentences with length less equal 40 and 100 respectively

24

25

26 27

Trang 5

Analysis table with CYK algorithm Kha kg vài

POS tagset in Wier Trechank

Clause tagset in Viet Treebanle

Syntax fonction tagset in Viet Treehank

The initial results on Viet Treebank with different configurations Key:CB

= average crossing brackets, OCB = zero crossing brackets, < 2C’B =< 2

crossing brackets All results are percentages, except for those in the CH

calumn

Number of sentence for training

The results with the change of the training dats s Sb eee

The error rate We use 520) sentences for development testing Then filtering

sentences which have the li-scre less than 70% As the result, we callect

147 sentences inte the set uf error sentences The Percentage of a errer

is calculated by the number of sentences commit this error divide 147

Because a sentence may be some errors so the total percentage may exceed

100

Ihe abtained results after applying same propasal rules ta correct same

wrong syntactic parsing

Trang 6

Chapter 1

Introduction

or a long time, human heing have always dreamed of an intelligent machine which can listen ta, understand and implement humans’ requirements Many scientists have tried ta make that dream and devoted many achievements for the science of artificial intelligence

In attificial intelligence, natural language processing (NLP) is a field which studies on how to understand and generate automatically human language NLP has many practical applications such as machine translation, information extraction, discourse unulysis, text summarization, These applications have the sume busic problems such as lexical analysis, syntactic pursing and semantic analysis In which, syntactic parsing is the central role and it is alsa the goal of this thesis

1.1 What is syntactic parsing?

Syntactic parsing {parsing or syntactic analysis) is the process af analyzing a given sequence of tokens (ie a sentence) tn identify their grammatical structure with respect

fo a given grammar ‘Ihe grammatical structure is often represented in the form which displays visually the dependence of components as a tree (is called parse tree or syntactic tree) In other words, parsing is the problem to get a given sequence of words as input

and output is the parse trees corresponding to that sequence

Figure 1.1 shows examples for purse tre: a) ä English parse tree in usual form and b) a Vieinamese tree in other form,

Parsing is the major mudule of a grammar checking system In order lo check grammar, we need ta parse input sentences, then examine the correctness of the structures in the output Furthermnre, a sentence which cannnt he parsed may have grammatical errors

Trang 7

Figure 1.2: A parse tree in Vietnamese

Parsing is also the important intermediate stage of representation for semantic analysis, and thus plays an important role in applications like machine translation, question answering, and information extraction For examp!

the system will analyze the source sentence to output a parse tree and then construct the

in transfer-based machine translation

equivalent parse tree in the target language The output sentence will be generated mainly based on this equivalent parse tree It is to understand that in a question answering system

we need parsing to find out which is the subject, object, or action It is also interesting

that parsing can help speech processing It supports to correct the fault of the speech recognition process On the other hand, in speech synthesis parsing help put stress on the correct position in the sentence.

Trang 8

1.2 Current Studies in Parsing 3

Through these above example we can see that construct an accurate and effective parser will bring great benefits to many applications of natural language processing

1.2 Current Studies in Parsing

As one of the basic and central protilem of NLP, parsing attracts many studies They belong to one of the two approaches: rule based and statistics based

In conventions pursing systems, 2 grammar is hund-crafted, uften involves a large amount of lexically spevific information in the form of sub-categorization information In there, ambiguity, u major problem in parsing, is valved through selectional restrictions Kor example, a lexicon might specify that “eae” must take an object with the feature 4+ fond” In (Collins, 1999), the author has showed several problems with selectional restrictions such as increasing the volume of informatian required when the vocahulary size becomes so large In the other word, the biggest challenge is the large amount of vocabulary to require both selectional restrictions and structural preference should be encoded as the soft preferences instead of hard constraints

To overvome these ubslacles, the reseurchers began to explore mavhine-leaming approaches 1 parsing problem, primary through stalistical models In these approuches,

a sel of example pairs of sentence und the corresponding syntactic tree is annotated by hand and used to train parsing models A set af trees is called a “treebank" Several parts of the treehank are reserved as fest data for evaluating the model's accuracy Harly works investigate the use of prohabilistic context free grammar (PCIG) Using PCHG is considered as the next generation of parsing and is also as a beginning step in statistical parsing In a POFG, each grammar mule is associated with a probability The probability

of a purse tree is the product of the probubilities of all rules used in that Wee In the case, parsing is essentially the process of seurching the tree that has the macimum probubilily However, a simple PCEG often fail due to ils lack of sensilivity to lexical information and structural preferences “Ihen some solutians were praposed tn resalve this proflem Several directions wor listed in (Collins, 1999) such as: towards prohabilistic version oF lexicalized grammars; using supervised training algorithms; to construct madels that had increased structural sensitivity; to look inte history based models Among them lexical ized probabilistic context free grammar (LPFG) is a promising approach It can solve many ambignity phenomena in parsing Some works that based on this approach achieved high performunce, such as in (Collins, 1997) After this reseurch, Daniel M Bibel and

Trang 9

his coworker have developed Collin’s models and designed a parser for nvultiple lan guages It has been applied successfully for some langnages as English, Chinese and Arabic (Bikel, 2004), The concrete results of the parser for these languages is reported in

(Bikel, 2004): For English, F-measure is 90.01%; fur Chinese, F-measure is 81.2% and

in Arabic F-measure is 73.7% According to these results and the comparison belween

current parsers (e.g, Churniak purser, Bekelley parser, Sandford purser), Bikel’s purser is still rated one of the hest parser at present,

Recently, this approach of nsing I.PCEG continues heing applied far many languages Moreover, a number of new strategies have proposed to improve the accuracy of parsers

In several researches, using semi supervised training methods becomes a promising ap proach Their experimental results shaw that this approach outperforms the supervised

one, withoul much additional computational vest Some other studies has integrated se-

mantiv informatiun into parsing in order to fully exploit the benefits uf lexical resources and upgrade the parser, such as in (Xiong et al 2005) (Agirre & Baldwin, 2008) (Xiong

et al., 2005) deserihed the way of incorporating semantic knowledge as fallow: Firstly, they used twa Chinese electronic semantics dictionarics and heuristic rules in order ta extract semantic categories ‘Then they built a selection preference sub-model hased on extracted semantic categories Similarly, in (Agirre & Baldwin, 20U8), the sense infor mation was added to parsing by substituting the original wards with their semantic tags which correspond with their semantic classes, for example knife and scissors belong to

TOOL class, cake and purk ate assigned to FOOD class In addition, some other sug-

gested tctics have been enhanced uf the performance of parsing as a powerful learning technique (sumple selection) for reducing the amount of humun-labeled training data

(Carreras et al., 2008); ar, a strategy for utilizing POS tags resources to annotate parrer input in (Watson er al., 2007)

‘Through the review of approaches in parsing and especially some recent stndies, we found that LPCFG appears in all of state of the art parsing systems LPCFG Therefore

in our opinion, LPCFG is a good choice for Vietnamese parsing

1.3 Vietnamese syntactic parsing

In Vietnam, works in natural language processing (i.e computational linguistics) in general and in parsing in patticular have been only motivated very recently A few of the parsers which folluw the knowledge-based upprouch are constructed with the manual

Trang 10

1.4 Objective of the Thesis 5

grammar rules Since the construction af grammar rules is mannal, the accuracy of the parser is not high I only analyzes a limited aumber of sentences generated by the grammar The approach using statistics has been also studied, but also only at brief and has no experimental results For example, (Quuc-The & Thanh-Huong, 2008) presented about LPCEG but surprisingly it did nut provide uny experiment, only some examples were provided (o illustrule the syntactic ambiguity of Vietnamese With such tesiricied results, no Vietnamese parser has heen published widely

It can say that while many countries in the world has gone forward a long way in parsing, Vietnam has just been at the stage to start The precondition for deployment these models for Vietnamese is a corpus containing parsed sentences which is the crucial resources for statistical parsing Since lack of carpus, the previous works on Vietnamese

parsing have not had the significant experimental resulls Fortunately, at present there

is a standard Vietnamese parsed corpus, called Viet Treebunk, which developed in a Project supported by Vielnamese guvernment, This corpus involves about 10,000 parsed sentences followed Penn ‘lreehank’s format, and therefore we can apply it ta the Bikel's

foal

As mentioned ahove, lexiealized madels have heen applied successfully for multiple languages Among these languages, Chinese obtained 81.2 %F score on average as Bikel shown in (Bikel, 2004) This result alse motivated our study since the syntactic structure

of Chinese is similarities to the syntactic structure of Vietnamese

1.4 Objective of the Thesis

This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach

We will use Viet Treebunk as the parsed corpus and adapt Bikel’s parsing tool for Viet-

In summary, in this stndy we try to carry out the following tasks

Study the basic techniques and methods in parsing, focusing on lexicalized statistical

approuches;

Trang 11

try to build and publish a Vietnamese parsing tool which are useful for many tasks of Vietnamese processing

~Invesligue different pursing models und different linguistic features lo discover the best configuration fur Vietnamese;

- Anulyze grammatical errors from a development test set and find uut 4 solution tw

improve the accuracy of the parser

15 Thesis structure

The rest of this thesis is organized as follows:

Chapter 2 introduces asic parsing approaches fram classical methods such as tap- down or hotiom-up strategy in statistic hased methods like prohahilistic context-free grammar (PCFG) and lexicalized probabilistic context free grammar (LPCFG) In this chapter, we also introduce the important parsing algorithms inclnding CYK, Earley and Chast parsing,

Chupier 3 represents Vietnamese parsing and our approach Charucteristivs af Viet- names and Viet Treebank will be intreduced in the comparigen with Penn Treebunk Chapter 4 describes our experiments und discussions After the introduction of the Rikel parsing tool, we will describe the process of applying and developing it far the Vietnamese: from adapting the taal far Vietnamese and investigating it in arder ta find cut the hest canfiguration, and finally handling several grammatical errors tn reduce errr mate and enhance the parser performance

Chapter 6 summarizes the obiained results gives some conclutions of our work, and shuws our plan fur the future work,

Trang 12

Chapter 2

Parsing approaches

In the previous chapter, we have introduced the concept of parsing and its male in natural language processing ‘This chapter will firstly presents context free grammar, and then presents common parsing methods, including two classical parsing strategies (the top down and bottom up); CYK algorithm (Cocke Younger Kasami); Chart parsing and Earley algorithm At the end of this chapter, we will present the Probabilistic Context Free Grammar and the lexivalized statistical model for pursing,

2.1 Context Free Grammar (CFG)

To analyze syntax for a language, we firstly need 1a represent the language in a form which computer can understand The mus! popular formal presentalion for grammar uf

a language is the context-free-grammar (invented by Chomsky), Language is defined as

a set af strings where each string was generated hy a finite set of nonempty elements called the alphahet, such ax Vietnamese alphahet and Vietnamese languages

A context-free grammar (CLG) is a set of four components ‘Ihe grammar is denoted,

we have G = (1, 5,8} Where:

4 Z Wis a finite set of elements which called terminal (lexicon) The set of terminals

is the alphabet (or words) of the language defined by the grammar

- N 7 @is a finite set of nun-terminal characters or variables Nole thai Á ï1 3) — They represent different types of phrase or clause in the sentence

- Sis one of the non-terminal (S € N) and called start variable (or siart symbol) that

is used to represent the whole sentence

- RB is a finite set of rites or productions of the grammar Kach nile in K has the farm

Trang 13

X — a where X isnon terminal and a is a sequence of terminals and non terminals

A grammar G generates a language L

In parsing, CFUs or its variations are used to represent the grammar of a language which is the grammatical base to construct methods to solve the parsing problem, The next seclion presents these methods

2.2 Parsing Algorithms

2.2.1 Top-down parsing

Top-down parsing is a strategy of analyzing the given sentences in the following way: beginning with the start symbul and at each step, expand one of the remaining nontermi- nals (from left ta right) hy replacing it with the right side of one of its productians in the grammar until achieving the desiredl string In other wards, a parse tree is generated hy

a top - down parser as a result of the construction process of the tree beginning with the start symbol (the root of the tree), using rules in the grammar to generate tree from the

root (start symbol) to leaves (words or lexicons)

In top-down parsing, for the rules having the same left hand side, the selection rules could be simplified based on the size of the right-hand side string (comparison belween the right-hand side sirings) or just the order of the symbols in the tight hand side of the

tules in case the top-down analysis nat finished, we turn hack to search the appropriate rule for the parse canstruction

2.2.2 Bottom-up parsing

Contrary to top-down strategy, the bottom-up parsing (also known as shift-reduce parsing) begins with an input sentence, using two main action (shift and rednce) ta backward the input string into the start symbol (the root of parse tree) With a stack, the words in the

input string are pushed into the stack from left to right (shift), and if stack contains the

right-hand side of a rule and it can be replaced by the left hand side of the rule

Similar to top-down strategy in bottam-up parsing, when errors occur, or no analysis,

we perform backtracking to develop by a different mule This process continues until we can not tum hack anymore, at this time if the stack was not reduced backward the start

symbol, the hottam-up parser ean not he analyzed the input string

Trang 14

2.2, Parsing Algorithms 9

2.2.3 Comparison between top-down parsing and bottom-up parsing

Roth methnds have their advantages and disadvantages

‘Top-down strategy does not waste time ta examine the trees that are not hegun hy

the start symbol (S) That means it would never have visited the sub trees without root 8 However, this approach has weaknesses While not waste time with the tree not started

by S, the top down parser throws away resources for the trees that do not match with the input string This weakness is consequence of generating the parse tree befure examining

ihe inpul string

Conversely, in bottom-up parsing, although purse trees can not be generated by the starting symhol S, but it is always ta ensure that the generated parse trees agree with the input string

In short, each strategy has advantages and disadvantages ‘Ihns, if combining of the

two approaches, we will have a good method

2.2.4 CYK algorithm (Cocke-Younger-Kasami)

CYK algorithm, sometimes known as the CKY algnrithm, identify whether a input string can be generated by a given CEG, if so, how it can be generated This algorithm is a form

of bottom up parsing using dynamic programming CYK algorithm operates on CFG to

be in Chomsky Normal Ferm (CE) CFG in CNF is CFG in which the rule of the form:

R- {A= BCA al A,B,C eA eF}

- Pseudo-code of CYK algorithm

The algorithm in pseudocode is as follows:

Let the input be a string S consisting of n characters: a_ un

Let the grammar contain r nunterminal symbuls Ay 2,

This grammar contains the subset 2, which is the set af start symbols

Let P[n,n,r] be an array of booleans Initialize all elements of J to false

For each i — 1 ton

For each unit production f; — a;, set P[i,1, j] = trne

For each i = 2 ton Length of span

For cach j =1ton 74 1 Start of span

For each & — | to i — | Partition of span

For each production RA — RBRC

Trang 15

H P|2,&, B and P[j + É,ï — k,C| then set P'[j,¡, Al = true

W any of #Í1,n, z| is €rue (z Is iterated over the set s, where sare all the indices for #.)

Then $ is member of language

Else S is not member of language

it is eaay ta see that the time complexity of this algorithm is O(n),

A table data-rtructure is used tn keep the track of the parsing proces with CYK algorithm

Trang 16

Farley algorithm is a top-down parsing algorithm using dynamic programming technique

It carries the typical feature of dynamic programming, that is to reduce the running time from exponential ta polynomial by removing the solutions be generated due to turn back

In this case, dynamic programming algorithm makes the running time given in O(.N*)

where N is the total number of input sequences However it does not need grammars

given in CNE, and so overcomes the main disadvantage of the CKY approach

The main idea of the Earley algorithm is w travel from left to right and create a network including N+ | entities With cach word in the sentence, chart contains a list nf states which represent each component af generated tree When a sentence is completely parsed, charts mark the pracess of analyzing the input sentence ending Rach sub-tree can be performed ance only and can be reused by the parser

Each separate state contains a chart entity including three parameters: a sub tree corresponding to a grammar rule, information about the development of tree, and the location of the sub-tree equivalent to the input string We give dot notation (.) on the right side of a rule of grammar te describe that it (the right hand side of this rule) has

already been parsed This structure is called the dotted mile The state of the position will

be expressed by two factors: locate the start state and the position of dot notation

Using grammar in section 2.2.4, we have several examples of dotted rate as follows:

S—+eNP VTI0,0]

2.2]

The basic principles of the Earley parser is to develop the parse tree through the set

including N + 1 states in chart from left to right, processing each state in the set At each

4V — "chuột

Trang 17

step one of three stages described below will be operated to each state of the law In each case, the result was to add a new state based on the current state or the next one

in the chart Algorithms are always developed through adding new information on the chart, the state will never be cunceled and can nol tum back previous chart And state S- >,[0,.4] in the list of state is the last chart, showing that the input string is parsed successfully

‘The three main operaters of the Harley algorithm ia PREDICTOR, COMPLETER and

SCANNER ‘These operators get input as a string and retum a stata PREDICTOR and

COMPLETER add states to the current chart, and SCANNER add states to new chart + Predictor

As its name Predictor is responsible for creating a new state, represents states which

cecur during the anulysis process, Predictor is upplied to any state in which non-lerminal

is on the right uf the dot mutation and is not in the part-of-speech group Result of this operator is anew slale for each expansion replaced non-terminal in grammar

+ Scanner

When a state is generated from tho label an the right of the dat notation, Scanner will

fe called ta check the input and morge the state to tho corresponding label ta put an the chart The tasks are complete when a new state is created and changes the location

of dot notation based on the input gronp has predicted Here, Barley parser using input strings like the top down parser to avoid the ambiguity only terminals (labeled) the words predicted by the states, will be analyzed by the chart

+ Completer

Compleier opetatur is applied to the state that has already hud the dot notation at the end of the rule It ia easily to realize that the curront status shows the success af the analysis ‘The pumposo of thin operator is ta research in the mules and develop previaus

states in the current state af the input New state is generated hy taking the ald states,

and shifting the dots through nules in the grammar and adding new states into the current chart

With the grammar below, we will analyze the sentence "Toi hét" (“Im singing in English") by Earley algorithm:

Trang 18

2.3 Probabilistic context-free grammar (PCFGs) 13

In Churt[2], there is the state S » NPY Pe 0,2] and the length of the input string equal

2, thus the process of analysis is complete and successful

2.3 Probabilistic context-free grammar (PCFGs)

Trang 19

Jeaming Through a training process, it aims to construct a probabilistic model, which is then used to produce the best parse tree for a test sentence In this section we introduce the Probabilistic Cuntext Free Grammar (PCFG)

In PCEG, euch rule has a probubilily The product of the probabilities of the rules which used in u tree is the probability of that pure tree, P(T|S) The parser itself is an algorithm which searches for the tree, Thre, that maximizes P(T|S) A generative model uses the observation that maximizing P(T, $) is equivalent to maximizing P(T|S)

PíT, 9)

PS}

Example: [or the following PCPG, require to compute the probability of parse tree

These = arg max P(L|S) = argmax =arg max P(T, 8)

(Figure 2.1) for the sentence “Méo bat chugt” ("Cats catch mice" in English)

Trang 20

2,4, Lexical Probabilistic Context Free Grammar (LPCEGs) 15

Figure 2.1: The parse tree of the Vietnamese sentence "mio biit chuột"

We consider an example, such sentence "Tôi hiểu Lan hơn Nga" in order to see that

POFGs lack of sensitivity to lexical information This sentence can be put into two parse

trees such as Figure 2,2

If two sentences have the same probubility, we must put lexical context information

to distinguish ‘Thus, adding lexical information to PCKG will hring many honefits

2.4 Lexical Probabii

tic Context Free Grammar (LPCFGs)

In the previous section, we introduced the prohahilistic model for the parsing problem:

However, this madel still remain some drawhacks M Collins proposed a new approach for parsing ‘Ihat approach is Lexical Prohahilistic Context lire Grammar (LPCHG) In

1996, Collin introduced three models under this approach in his paper In three models,

Collin put into the PCEG a new structure called head

Trang 21

« "Quyển" - front auxiliary constituent

« "Sách" the central constituent

« "Hay" - behind auxiliary canstituent

‘The word "sch" here is a central constituent af the phrase If we take it away, the phrase will he "quyển hay" that is nonsensical But if we give up ane of the two auxiliary con- stituents, and even temave hath af twa auxiliary components, the phrase is still meaning (quyển sách, sách hay or sách) The central constituent is called the head

Thus, a phrase has its head word, a sentence also has the head word And the head structure is the basic characteristics of Lexical Probabilistic Context Free Gram mut (LPCFGs)

2.4.2 The cancept of Lexical Probabilistic Context Free Grammar

(LPCFGs)

in a PCKG, for a tree derived fy an applications af context-free re-write rules LHS; >

WHG¡.1<i< n,

P(!|8) = Lị„P(LHS|#HS)

Trang 22

2,4, Lexical Probabilistic Context Free Grammar (LPCEGs) 17

As presented in (Collins, 1997) a PCFG can be lexicalized by associating a word w and

a part of speech (POS) tag t with each nonterminal X in the tree A onterminal can be written as X (x), where x = (w, 4), and X is a constituent label Then each rule has the form:

PR) > Tall

where H is the head child of the phrase, which inherits the head word h from its parent

Pi Ly and Ry Re, are lefl and right modifiers of H Figure 2.3 shows ä Vietnamese parse tree in LPCFG in which each node is associated with lead-word In this example, word "gidu" is the head word of the sentence, Translation uf this sentence into English is

“Thides the hook in the bookcase”, in which the corresponding translations are "Tai" /’

“gidu" /"hides", "quyén séch" /"the hook", "vao" /"in", "ti" /"the honkease"

Tal giấu quyên séck trong tả

Figure 2.3: A parse tree of Viemmamese in I.PCHG

In shert, A LPCHG is a PCHG in which each nan-terminal in a parse tree is lexicalized

hy associating with its head ward ‘That alra means every nonterminal label in every tree is augmented to include a unique head word (maybe and that head word's part of speech) The head of a nonterminal is determined based on the head of its child LPCEG partially overcomes the disadvantages of PUFG Many models to follow this approach have achieved high perfurmance Among them, three modely of Collins have been well known They brought new performance benchmarks on parsing the Penn Treebank and

served as the basic of important work on parser selection

Trang 23

2.4.3 Three models of Collins

Three models, called Model 1, Model 2 and Model 3, were proposed in (Collins, 1997) Model 1 is essentially a generative version of the model described in (Collins, 1997) Model 2 makes the complement /adjunct distinction by adding probabilities over sub calegorization frames for head-wards, Model 3 gives « probubilistic treatment of wh- movement In the Model 1, Collins proposes un estimation of rule based on independent assumptions beiween the modifiers In addition, the appcurance of cach modifier is as- sumed depending on only the head and the left hand side of the rule lor example, the probability of the rule V P(gu) — V(gidin) VP (quyén sich) PP (trang) is estimated as:

PV VP,’ gidu") « P,LN P(“quyển sách" ;[V P, V, " giấu” ]

+P.(PP(trong")|[VP,V," giá") + Pị(STOP|VP, V," giấu")

+B(STOP|VP, V,"giấu")

More generally, the probability of an entire rule can be expressed as:

Generate the head of the phrase H(h) with probability 2, (21(h} P.h)

Generate modifiers te the left af the head with total probahility:

191 1000|P.44,Á)

such that Pu, (,‡¡) = STOP

TRE PAC R(r AP, FA)

such that Ryjiftme.) = STOP

‘cis useful te know additional information of phrases such as information about subject, object of noun phrase The Model 2 makes an adjunct/complement distinction while parsing which helps it more accurate, Figure2.4 gives an example, the translation of this

Trang 24

2.4, Lexical Probabilistic Context Free Grammar (LPCFGs) 19

example in English is “last week he bought a motorbike”, in which the corresponding translations are "tuần trước" / "last week", "nó" / "he", "mua" / "bought", "xe máy” /*a

motorbike", As you can see the tree in figure 2.4, "nó" and "xe máy" are ïn subject and

object position respectively, thus they are attached with "-C’ Meanwhile, "Tuần trước"

is an adjunct, then it is not attached with "-C’,

Figure 2.4: A tree with the "C" suffix used to identify

‘The model 1 can be trained on treebank data with the enhanced set of non-terminals,

and it can leam the lexical properties which distinguish complements and adjuncts How-

ever, it would still suffer from the bad independence assumptions To solve these kinds

of problems, the generative process is extended to include a probabilistic choice of left

and right sub-categorization frames For example, the phrase in figure 2.4 V P(mua) >

NP(tudntruée) NP — C(n6)V P(mua) is estimated as:

Trang 25

(Collins, 1999), Collins has proposed several ways to automatically identify complements adjuncts, Fortunately Viet Treebank has notated some kinds and af complements

The Model 3 aims to overcume obstacle of identifying predicute-argument structure from parse Irees is with wh-movement (in English) Nuun phrases are most often moved from subject position, objec! position, ur within PPs, Handle NP extruction by adding « gap feature ta cach non-terminal in the trea, and propagating gapa through the tree until they are finally discharged as a trace camplement In Ponn ‘Iechank is co-indexed a TRACE with the WEINP head of the SBAR, so it is straightforward to add this information

to wees in training data However this phenomena seems not be similar in Vietnamese,

and we will show the effects of this model in the experiment complements.

Trang 26

Chapter 3

vi

etnamese parsing and our approach

‘Through the previous chapters, we can consider that parsing as a problem in machine leaming ‘Ihe training data containing parsed sentences (usually called ‘Treehank) plays

an important tole for the task tis the crucial resources for statistical parsing At the present, in the world, there is several well known Treebank such as Penn Treebank for English, Chinese Treebank for Chinese This kind of corpus for Vietnamese has been being constructed In this chupler we firstly describe the charucteristicy of the Vietnamese and the charucleristics of Viet Treebank, Then, we introduee our proposal to apply and develop lexiculized statistical models parsing for Vieunamese

Vietnamese wriling is monosyllabic in nature, Every “syllable” is written as though it were a separate dictation-unit with a space hefnre and after In ather word, the small- est unit in the construction of words is syllables Words can he single (identified hy one syllable) or compound (combined by two or more eyllables) ‘thus, different to Lin- glish, besides POS tagging, chunking, and syntactic tagging, Vietnamese has one more annotation level, that is word segmentation

In texms of typology Vietnamese is an isolating language, words have no inflection and word formation is « combination of isolated syllables This feature will duminaie

the uther grammutical features All syntactic aspects will be represented by grammatical

particles, For example: "mus" — "buy"; “di mua” — "bought"; "sé mua" — “will buy"

in the other hand, when combining Vietnamese words into structures such as phrase,

sentences, word order and particles are very important ‘Ihe arrangement of words in a

mn

Trang 27

in a certain onder is the key to indicate the relationship syntax In Vietnamese, saying

“Anh tạ lại đến" ìs đìfferent frora "Lại đến anh ta" When the words of the same type asyociate Logether in principal and accessory relation, the front word Keepy the main Tole, the behind word plays the secondary role By order of the words that “con ga” iy different

from “g@ con", “tink cara" different “cẩm fink” The order that subjects stand before

[redicates is cammon in the Vietnamese sentence structurs Furthermare, the use of particles is the key grammatical method in Viernamese Due to particles, the phrase “anh cia em" is different to “anh va em", “anh vi em" Apart from word order and particle, Vietnamese also use intonation method The intonation expresses the syntax relations of the elements in sentences to give the content On writing intonation is usually indicated

by punctuation We try to compare the following Lwo sentences to see the difference in

the content: “- Đêm hôm qua, edu gdy - Dém hom, qua cd sdly

Through a number of distinctive features thal we have just mentioned above, we

can visualize, somewhat the character and potential af the Vietnamese And as claimed

in (Phuang- Ihai & Xuan-I ong, 2009), using constituency representation of ayntactie structures is suitable far Vietnamese ‘Ihus, this representation is alsa used for building Viet Treebank (Phuong Thai & Xuan Luong, 2009)

3.2, Penn Treebank

‘Treehank is a corpus in which each sentence have grammatical structured in the farm

of parse tee Treebank is often constructed based on a labeled corpus, sometimes the information about the language or semantic will also added syntactic structures toimprove the quality of the Treebunk The process of construction Treebank can be implemented

by hand or semi-automatic with a purser, after finishing of the analysis, the parse tree

which has just achieved need be checked and to complete il occasionally This work may

be extended annually Ponn irechank was developed hy the University of Pennaylvania, containing approximately 4,5 millian English American sontences In the three years from 1989 ta 1992, POS tagging for the sentence is executed ‘This corpus can he found

on the website: http: //www.lde-upenn.edu/ The next section presents some type of label

in the Penn Treebank and the task of arranging components together to get a syntax tree

Tiêu đề	Lexicalized Statistical Parsing for Vietnamese
Tác giả	Pham Thi Minh Thu
Người hướng dẫn	Doctor Le Anh Cuong
Trường học	Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2010
Thành phố	Hanoi

Định dạng
Số trang	54
Dung lượng	625,89 KB