Lexicalized statistical parsing for Vietnamese

1.4 Objective of the Thesis This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach.. 1.5 Thesis structure The rest of this thesis is organized as follows:

Trang 1

Pham Thi Minh Thu

Faculty of Information Technology Hanoi University of Engineering and Technology

Vietnam National University, Hanoi

Supervised by Doctor Le Anh Cuong

A thesis submitted in fulfillment of the requirements for the degree of

Master of Computer Science

June, 2010

Trang 2

Table of Contents

1.1 What is syntactic parsing? 1

1.2 Current Studies in Parsing 3

1.3 Vietnamese syntactic parsing 4

1.4 Objective of the Thesis 5

1.5 Thesis structure 6

2 Parsing approaches 7 2.1 Context Free Grammar (CFG) 7

2.2 Parsing Algorithms 8

2.2.1 Top-down parsing 8

2.2.2 Bottom-up parsing 8

2.2.3 Comparison between top-down parsing and bottom-up parsing 9

2.2.4 CYK algorithm (Cocke-Younger-Kasami) 9

2.2.5 Earley algorithm 11

2.3 Probabilistic context-free grammar (PCFGs) 13

2.3.1 The concept of PCFG 13

2.3.2 Disadvantages of PCFGs 14

2.4 Lexical Probabilistic Context Free Grammar (LPCFGs) 15

2.4.1 Head structure 16

2.4.2 The concept of Lexical Probabilistic Context Free Grammar (LPCFGs) 16 2.4.3 Three models of Collins 18

3 Vietnamese parsing and our approach 21 3.1 Vietnamese characteristics 21

3.2 Penn Treebank 22

iii

Trang 3

3.2.1 POS tagging 23

3.2.2 Bracketing 23

3.3 Viet Treebank 25

3.3.1 Objectives 25

3.3.2 The POS tagset and Syntax tagset for Vietnamese 27

3.4 Our approach in building a Vietnamese parser 27

3.4.1 Adapting Bikel's parser for Vietnamese 29

3.4.2 Analyze error and propse using heuristic rules 30

4 Experiments and Discussion 33 4.1 Data 33

4.2 Bikel's parsing tool 34

4.3 Adaptating Bikel's tool to Vietnamese 35

4.3.1 Investigate different configurations 35

4.3.2 Training 38

4.3.3 Parsing 39

4.3.4 Evaluation of the parser 39

4.3.5 Results 40

4.4 Experimental results on using heuristic rules 42

5 Conclusions and Future Work 46 5.1 Summary 46

5.2 Contribution 46

5.3 Futurework 47

Trang 4

List of Figures

1.1 The parse tree of sentence "I go to school" 2

1.2 A parse tree in Vietnamese 2

2.1 The parse tree of the Vietnamese sentence "mÌo b¾t chuét" 15

2.2 Two derivations of the sentence "T«i hiÓu Lan h¬n Nga" 16

2.3 A parse tree of Vietnamese in LPCFG 17

2.4 A tree with the "C" suffix used to identify 19

3.1 Set of tag in Penn Treebank 24

3.2 A sample of labeled data in Penn Treebank before manually treatment 25

3.3 A sample of labeled data in Penn Treebank after manually treatment 25

3.4 Tagset of Penn Treebank 26

3.5 A sample of complete data in English and Vietnamese 27

4.1 The Bikel's system overview 34

4.2 Result of testing standard Collins' model 2 with training data's size change from 60% to 100% of the full data Where series 1 and series 2 stand for testing on sentences with length less equal 40 and 100 respectively 43

v

Trang 5

2.1 Analysis table with CYK algorithm 11

3.1 POS tagset in Viet Treebank 28

3.2 Phrase tagset in Viet Treebank 28

3.3 Clause tagset in Viet Treebank 29

3.4 Syntax function tagset in Viet Treebank 29

4.1 The initial results on Viet Treebank with different configurations Key:CB = average crossing brackets, 0CB = zero crossing brackets, ≤ 2CB =≤ 2 crossing brackets All results are percentages, except for those in the CB column 41

4.2 Number of sentence for training 42

4.3 The results with the change of the training data set 42

4.4 The error rate We use 520 sentences for development testing Then filtering sentences which have the F-score less than 70% As the result, we collect 147 sentences into the set of error sentences The Percentage of a error is calculated by the number of sentences commit this error divide 147 Because a sentence may be some errors so the total percentage may exceed 100 44

4.5 The obtained results after applying some proposal rules to correct some wrong syntactic parsing 44

vi

Trang 6

Chapter 1

Introduction

For a long time, human being have always dreamed of an intelligent machine which canlisten to, understand and implement humans' requirements Many scientists have tried tomake that dream and devoted many achievements for the science of artificial intelligence

In artificial intelligence, natural language processing (NLP) is a field which studies onhow to understand and generate automatically human language NLP has many practicalapplications such as machine translation, information extraction, discourse analysis, textsummarization These applications have the same basic problems such as lexical analysis,syntactic parsing and semantic analysis In which, syntactic parsing is the central roleand it is also the goal of this thesis

1.1 What is syntactic parsing?

Syntactic parsing (parsing or syntactic analysis) is the process of analyzing a given quence of tokens (i.e a sentence) to identify their grammatical structure with respect

se-to a given grammar The grammatical structure is often represented in the form whichdisplays visually the dependence of components as a tree (is called parse tree or syntactictree) In other words, parsing is the problem to get a given sequence of words as inputand output is the parse trees corresponding to that sequence

Figure 1.1 shows examples for parse tree: a) a English parse tree in usual form andb) a Vietnamese tree in other form

Parsing is the major module of a grammar checking system In order to check mar, we need to parse input sentences, then examine the correctness of the structures inthe output Furthermore, a sentence which cannot be parsed may have grammatical errors

gram-1

Trang 7

Figure 1.1: The parse tree of sentence "I go to school"

Figure 1.2: A parse tree in Vietnamese

Parsing is also the important intermediate stage of representation for semantic analysis,and thus plays an important role in applications like machine translation, question an-swering, and information extraction For example, in transfer-based machine translationthe system will analyze the source sentence to output a parse tree and then construct theequivalent parse tree in the target language The output sentence will be generated mainlybased on this equivalent parse tree It is to understand that in a question answering system

we need parsing to find out which is the subject, object, or action It is also interestingthat parsing can help speech processing It supports to correct the fault of the speechrecognition process On the other hand, in speech synthesis parsing help put stress on thecorrect position in the sentence

Trang 8

1.2 Current Studies in Parsing 3

Through these above example we can see that construct an accurate and effectiveparser will bring great benefits to many applications of natural language processing

1.2 Current Studies in Parsing

As one of the basic and central problem of NLP, parsing attracts many studies Theybelong to one of the two approaches: rule-based and statistics-based

In conventional parsing systems, a grammar is hand-crafted, often involves a largeamount of lexically specific information in the form of sub-categorization information Inthere, ambiguity, a major problem in parsing, is solved through selectional restrictions.For example, a lexicon might specify that "eat" must take an object with the feature+ "food" In (Collins, 1999), the author has showed several problems with selectionalrestrictions such as increasing the volume of information required when the vocabularysize becomes so large In the other word, the biggest challenge is the large amount ofvocabulary to require both selectional restrictions and structural preference should beencoded as the soft preferences instead of hard constraints

To overcome these obstacles, the researchers began to explore machine-learning proaches to parsing problem, primary through statistical models In these approaches,

ap-a set of exap-ample pap-airs of sentence ap-and the corresponding syntap-actic tree is ap-annotap-ated byhand and used to train parsing models A set of trees is called a "treebank" Severalparts of the treebank are reserved as test data for evaluating the model's accuracy Earlyworks investigate the use of probabilistic context free grammar (PCFG) Using PCFG isconsidered as the next generation of parsing and is also as a beginning step in statisticalparsing In a PCFG, each grammar rule is associated with a probability The probability

of a parse tree is the product of the probabilities of all rules used in that tree In the case,parsing is essentially the process of searching the tree that has the maximum probability.However, a simple PCFG often fail due to its lack of sensitivity to lexical informationand structural preferences Then some solutions were proposed to resolve this problem.Several directions were listed in (Collins, 1999) such as: towards probabilistic version oflexicalized grammars; using supervised training algorithms; to construct models that hadincreased structural sensitivity; to look into history-based models Among them lexical-ized probabilistic context free grammar (LPCFG) is a promising approach It can solvemany ambiguity phenomena in parsing Some works that based on this approach achievedhigh performance, such as in (Collins, 1997) After this research, Daniel M Bikel and

Trang 9

his coworker have developed Collins models and designed a parser for multiple guages It has been applied successfully for some languages as English, Chinese andArabic (Bikel, 2004) The concrete results of the parser for these languages is reported in(Bikel, 2004): For English, F-measure is 90.01%; for Chinese, F-measure is 81.2% and

lan-in Arabic F-measure is 75.7% Accordlan-ing to these results and the comparison betweencurrent parsers (e.g Charniak parser, Bekelley parser, Standford parser), Bikels parser isstill rated one of the best parser at present

Recently, this approach of using LPCFG continues being applied for many languages.Moreover, a number of new strategies have proposed to improve the accuracy of parsers

In several researches, using semi-supervised training methods becomes a promising proach Their experimental results show that this approach outperforms the supervisedone, without much additional computational cost Some other studies has integrated se-mantic information into parsing in order to fully exploit the benefits of lexical resourcesand upgrade the parser, such as in (Xiong et al., 2005), (Agirre & Baldwin, 2008) (Xiong

ap-et al., 2005) described the way of incorporating semantic knowledge as follow: Firstly,they used two Chinese electronic semantics dictionaries and heuristic rules in order toextract semantic categories Then they built a selection preference sub-model based onextracted semantic categories Similarly, in (Agirre & Baldwin, 2008), the sense infor-mation was added to parsing by substituting the original words with their semantic tagswhich correspond with their semantic classes, for example knife and scissors belong toTOOL class, cake and pork are assigned to FOOD class In addition, some other sug-gested tactics have been enhanced of the performance of parsing as a powerful learningtechnique (sample selection) for reducing the amount of human-labeled training data(Carreras et al., 2008); or, a strategy for utilizing POS tags resources to annotate parserinput in (Watson et al., 2007)

Through the review of approaches in parsing and especially some recent studies, wefound that LPCFG appears in all of state-of-the-art parsing systems LPCFG Therefore

in our opinion, LPCFG is a good choice for Vietnamese parsing

1.3 Vietnamese syntactic parsing

In Vietnam, works in natural language processing (i.e computational linguistics) ingeneral and in parsing in particular have been only motivated very recently A few ofthe parsers which follow the knowledge-based approach are constructed with the manual

Trang 10

1.4 Objective of the Thesis 5

grammar rules Since the construction of grammar rules is manual, the accuracy of theparser is not high It only analyzes a limited number of sentences generated by thegrammar The approach using statistics has been also studied, but also only at brief andhas no experimental results For example, (Quoc-The & Thanh-Huong, 2008) presentedabout LPCFG but surprisingly it did not provide any experiment, only some exampleswere provided to illustrate the syntactic ambiguity of Vietnamese With such restrictedresults, no Vietnamese parser has been published widely

It can say that while many countries in the world has gone forward a long way inparsing, Vietnam has just been at the stage to start The precondition for deploymentthese models for Vietnamese is a corpus containing parsed sentences which is the crucialresources for statistical parsing Since lack of corpus, the previous works on Vietnameseparsing have not had the significant experimental results Fortunately, at present there

is a standard Vietnamese parsed corpus, called Viet Treebank, which developed in aproject supported by Vietnamese government This corpus involves about 10,000 parsedsentences followed Penn Treebanks format, and therefore we can apply it to the Bikel'stool

As mentioned above, lexicalized models have been applied successfully for multiplelanguages Among these languages, Chinese obtained 81.2 % F-score on average as Bikelshown in (Bikel, 2004) This result also motivated our study since the syntactic structure

of Chinese is similarities to the syntactic structure of Vietnamese

1.4 Objective of the Thesis

This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach

We will use Viet Treebank as the parsed corpus and adapt Bikel's parsing tool for namese Then, we will also investigate some common errors appearing in the parsing,and propose a solution to improve accuracy of the parser

Viet-To achieve this objective, this study has to find the answers for following tions: How to adapt Bikel's system for Vietnamese? How is the initial result? Whichmodel/configure is appropriate for Vietnamese? How to improve performance of thesystem based on analyzing errors?

ques-In summary, in this study we try to carry out the following tasks

- Study the basic techniques and methods in parsing, focusing on lexicalized statisticalapproaches;

Trang 11

- Analyze and adapt Bikel's parser (Bikel, 2004) for Vietnamese; In this aim wetry to build and publish a Vietnamese parsing tool which are useful for many tasks ofVietnamese processing.

- Investigate different parsing models and different linguistic features to discover thebest configuration for Vietnamese;

- Analyze grammatical errors from a development test set and find out a solution toimprove the accuracy of the parser

1.5 Thesis structure

The rest of this thesis is organized as follows:

Chapter 2 introduces basic parsing approaches from classical methods such as down or bottom-up strategy to statistic based methods like probabilistic context-freegrammar (PCFG) and lexicalized probabilistic context-free grammar (LPCFG) In thischapter, we also introduce the important parsing algorithms including CYK, Earley andChart parsing

top-Chapter 3 represents Vietnamese parsing and our approach Characteristics of namese and Viet Treebank will be introduced in the comparison with Penn Treebank.Chapter 4 describes our experiments and discussions After the introduction of theBikel parsing tool, we will describe the process of applying and developing it for theVietnamese: from adapting the tool for Vietnamese and investigating it in order to findout the best configuration, and finally handling several grammatical errors to reduce errorrate and enhance the parser performance

Viet-Chapter 6 summarizes the obtained results, gives some conclutions of our work, andshows our plan for the future work

Trang 12

Chapter 2

Parsing approaches

In the previous chapter, we have introduced the concept of parsing and its role innatural language processing This chapter will firstly presents context free grammar, andthen presents common parsing methods, including two classical parsing strategies (thetop-down and bottom-up); CYK algorithm (Cocke-Younger-Kasami); Chart parsing andEarley algorithm At the end of this chapter, we will present the Probabilistic ContextFree Grammar and the lexicalized statistical model for parsing

2.1 Context Free Grammar (CFG)

To analyze syntax for a language, we firstly need to represent the language in a formwhich computer can understand The most popular formal presentation for grammar of

a language is the context-free-grammar (invented by Chomsky) Language is defined as

a set of strings where each string was generated by a finite set of nonempty elementscalled the alphabet, such as Vietnamese alphabet and Vietnamese languages

A context-free grammar (CFG) is a set of four components The grammar is denoted,

we have G = hT, N, S, Ri Where:

- T 6= ∅ is a finite set of elements which called terminal (lexicon) The set of terminals

is the alphabet (or words) of the language defined by the grammar

- N 6= ∅ is a finite set of non-terminal characters or variables Note that 4 ∩ Σ = ∅.They represent different types of phrase or clause in the sentence

- S is one of the non-terminal (S ∈ N) and called start variable (or start symbol) that

is used to represent the whole sentence

- R is a finite set of rules or productions of the grammar Each rule in R has the form

7

Trang 13

X → αwhere X is non-terminal and α is a sequence of terminals and non-terminals.

A grammar G generates a language L

In parsing, CFGs or its variations are used to represent the grammar of a languagewhich is the grammatical base to construct methods to solve the parsing problem Thenext section presents these methods

2.2 Parsing Algorithms

2.2.1 Top-down parsing

Top-down parsing is a strategy of analyzing the given sentences in the following way:beginning with the start symbol and at each step, expand one of the remaining nontermi-nals (from left to right) by replacing it with the right side of one of its productions in thegrammar until achieving the desired string In other words, a parse tree is generated by

a top - down parser as a result of the construction process of the tree beginning with thestart symbol (the root of the tree), using rules in the grammar to generate tree from theroot (start symbol) to leaves (words or lexicons)

In top-down parsing, for the rules having the same left hand side, the selection rulescould be simplified based on the size of the right-hand side string (comparison betweenthe right-hand side strings) or just the order of the symbols in the right hand side of therules In case the top-down analysis not finished, we turn back to search the appropriaterule for the parse construction

2.2.2 Bottom-up parsing

Contrary to top-down strategy, the bottom-up parsing (also known as shift-reduce parsing)begins with an input sentence, using two main action (shift and reduce) to backward theinput string into the start symbol (the root of parse tree) With a stack, the words in theinput string are pushed into the stack from left to right (shift), and if stack contains theright-hand side of a rule and it can be replaced by the left hand side of the rule

Similar to top-down strategy, in bottom-up parsing, when errors occur, or no analysis,

we perform backtracking to develop by a different rule This process continues until wecan not turn back anymore, at this time if the stack was not reduced backward the startsymbol, the bottom-up parser can not be analyzed the input string

Trang 14

2.2 Parsing Algorithms 9

2.2.3 Comparison between top-down parsing and bottom-up parsing

Both methods have their advantages and disadvantages

Top-down strategy does not waste time to examine the trees that are not begun bythe start symbol (S) That means it would never have visited the sub-trees without root S.However, this approach has weaknesses While not waste time with the tree not started

by S, the top-down parser throws away resources for the trees that do not match with theinput string This weakness is consequence of generating the parse tree before examiningthe input string

Conversely, in bottom-up parsing, although parse trees can not be generated by thestarting symbol S, but it is always to ensure that the generated parse trees agree with theinput string

In short, each strategy has advantages and disadvantages Thus, if combining of thetwo approaches, we will have a good method

2.2.4 CYK algorithm (Cocke-Younger-Kasami)

CYK algorithm, sometimes known as the CKY algorithm, identify whether a input stringcan be generated by a given CFG, if so, how it can be generated This algorithm is a form

of bottom-up parsing using dynamic programming CYK algorithm operates on CFG to

be in Chomsky Normal Form (CNF) CFG in CNF is CFG in which the rule of the form:

R = {A → BC, A → α|A, B, C ∈ 4, α ∈ Σ}

- Pseudo-code of CYK algorithm

The algorithm in pseudocode is as follows:

Let the input be a string S consisting of n characters: a1 an

Let the grammar contain r nonterminal symbols R1 Rn

This grammar contains the subset Rswhich is the set of start symbols

Let P [n, n, r] be an array of booleans Initialize all elements of P to false.For each i = 1 to n

For each unit production Rj → ai, set P [i, 1, j] = true

For each i = 2 to n Length of span

For each j = 1 to n − i + 1 Start of span

For each k = 1 to i − 1 Partition of span

For each production RA → RBRC

Trang 15

If P [j, k, B] and P [j + k, i − k, C] then set P [j, i, A] = true

If any of P [1, n, x] is true (x is iterated over the set s, where s are all the indicesfor Rs)

Then S is member of language

Else S is not member of language

It is easy to see that the time complexity of this algorithm is O(n3)

A table data-structure is used to keep the track of the parsing process with CYKalgorithm

Trang 16

Earley algorithm is a top-down parsing algorithm using dynamic programming technique.

It carries the typical feature of dynamic programming, that is to reduce the running timefrom exponential to polynomial by removing the solutions be generated due to turn back

In this case, dynamic programming algorithm makes the running time given in O(N3)where N is the total number of input sequences However, it does not need grammarsgiven in CNF, and so overcomes the main disadvantage of the CKY approach

The main idea of the Earley algorithm is to travel from left to right and create anetwork including N + 1 entities With each word in the sentence, chart contains a list ofstates which represent each component of generated tree When a sentence is completelyparsed, charts mark the process of analyzing the input sentence ending Each sub-treecan be performed once only and can be reused by the parser

Each separate state contains a chart entity including three parameters: a sub-treecorresponding to a grammar rule, information about the development of tree, and thelocation of the sub-tree equivalent to the input string We give dot notation (.) on theright side of a rule of grammar to describe that it (the right hand side of this rule) hasalready been parsed This structure is called the dotted rule The state of the position will

be expressed by two factors: locate the start state and the position of dot notation.Using grammar in section 2.2.4, we have several examples of dotted rule as follows:

Trang 17

step, one of three stages described below will be operated to each state of the law Ineach case, the result was to add a new state based on the current state or the next one

in the chart Algorithms are always developed through adding new information on thechart, the state will never be canceled and can not turn back previous chart And state

S → •, [0, N ]in the list of state is the last chart, showing that the input string is parsedsuccessfully

The three main operators of the Earley algorithm is PREDICTOR, COMPLETER andSCANNER These operators get input as a string and return a state PREDICTOR andCOMPLETER add states to the current chart, and SCANNER add states to new chart.+ Predictor

As its name, Predictor is responsible for creating a new state, represents states whichoccur during the analysis process Predictor is applied to any state in which non-terminal

is on the right of the dot notation and is not in the part-of-speech group Result of thisoperator is a new state for each expansion replaced non-terminal in grammar

+ Scanner

When a state is generated from the label on the right of the dot notation, Scanner will

be called to check the input and merge the state to the corresponding label to put onthe chart The tasks are complete when a new state is created and changes the location

of dot notation based on the input group has predicted Here, Earley parser using inputstrings like the top-down parser to avoid the ambiguity, only terminals (labeled), thewords predicted by the states, will be analyzed by the chart

+ Completer

Completer operator is applied to the state that has already had the dot notation at theend of the rule It is easily to realize that the current status shows the success of theanalysis The purpose of this operator is to research in the rules and develop previousstates in the current state of the input New state is generated by taking the old states,and shifting the dots through rules in the grammar and adding new states into the currentchart

With the grammar below, we will analyze the sentence "T«i h¸t" ("I'm singing inEnglish") by Earley algorithm:

Trang 18

2.3 Probabilistic context-free grammar (PCFGs) 13

In Chart[2], there is the state S → NP V P • [0, 2] and the length of the input string equal

2, thus the process of analysis is complete and successful

2.3 Probabilistic context-free grammar (PCFGs)

2.3.1 The concept of PCFG

As mentioned in Chapter 1, conventional parsing systems used CFGs and followed therule-based approach This approach has encountered several obstacles with selectionalrestriction At that time, a new approach was proposed to overcome the disadvantages

Trang 19

of the CFGs In this approach, parsing problem is considered as a problem in machinelearning Through a training process, it aims to construct a probabilistic model, which isthen used to produce the best parse tree for a test sentence In this section we introducethe Probabilistic Context Free Grammar (PCFG).

In PCFG, each rule has a probability The product of the probabilities of the ruleswhich used in a tree is the probability of that parse tree, P (T |S) The parser itself is analgorithm which searches for the tree, Tbest, that maximizes P (T |S) A generative modeluses the observation that maximizing P (T, S) is equivalent to maximizing P (T |S)

Tbest= arg max

T P (T |S) = arg maxP (T, S)

P (S) = arg maxT P (T, S)Example: For the following PCFG, require to compute the probability of parse tree(Figure 2.1) for the sentence "MÌo b¾t chuét" ("Cats catch mice" in English)

Trang 20

Figure 2.1: The parse tree of the Vietnamese sentence "mÌo b¾t chuét"

We consider an example, such sentence "T«i hiÓu Lan h¬n Nga" in order to see thatPCFGs lack of sensitivity to lexical information This sentence can be put into two parsetrees such as Figure 2.2

If two sentences have the same probability, we must put lexical context information

to distinguish Thus, adding lexical information to PCFG will bring many benefits

2.4 Lexical Probabilistic Context Free Grammar (LPCFGs)

In the previous section, we introduced the probabilistic model for the parsing problem.However, this model still remains some drawbacks M Collins proposed a new approachfor parsing That approach is Lexical Probabilistic Context Free Grammar (LPCFG) In

1996, Collin introduced three models under this approach in his paper In three models,Collin put into the PCFG a new structure called head

Trang 21

Figure 2.2: Two derivations of the sentence "T«i hiÓu Lan h¬n Nga"

2.4.1 Head structure

Suppose we have a phrase as "quyÓn s¸ch hay" (a good book) We can split this phraseinto three parts as follows:

• "QuyÓn" - front auxiliary constituent

• "S¸ch" - the central constituent

• "Hay" - behind auxiliary constituent

The word "s¸ch" here is a central constituent of the phrase If we take it away, the phrasewill be "quyÓn hay" that is nonsensical But if we give up one of the two auxiliary con-stituents, and even remove both of two auxiliary components, the phrase is still meaning(quyÓn s¸ch, s¸ch hay or s¸ch) The central constituent is called the head

Thus, a phrase has its head word, a sentence also has the head word And thehead structure is the basic characteristics of Lexical Probabilistic Context Free Gram-mar (LPCFGs)

2.4.2 The concept of Lexical Probabilistic Context Free Grammar

(LPCFGs)

In a PCFG, for a tree derived by an applications of context-free re-write rules LHSi →RHSi, 1 < i < n,

P (T |S) = Πi nP (LHSi|RHSi)

Trang 22

As presented in (Collins, 1997), a PCFG can be lexicalized by associating a word w and

a part-of-speech (POS) tag t with each nonterminal X in the tree A nonterminal can bewritten as X (x), where x = (w, t), and X is a constituent label Then each rule has theform:

P (h) → Ln(ln) L1(l1)H(h)R1(r1) Rm(rm)where H is the head-child of the phrase, which inherits the head-word h from its parent

P ; L1 Lnand R1 Rmare left and right modifiers of H Figure 2.3 shows a Vietnameseparse tree in LPCFG in which each node is associated with head-word In this example,word "giÊu" is the head word of the sentence Translation of this sentence into English is

"I hides the book in the bookcase, in which the corresponding translations are "T«i"/"I",

"giÊu"/"hides", "quyÓn s¸ch"/"the book", "vµo"/"in", "tñ"/"the bookcase"

Figure 2.3: A parse tree of Vietnamese in LPCFG

In short, A LPCFG is a PCFG in which each non-terminal in a parse tree is lexicalized

by associating with its head word That also means every nonterminal label in everytree is augmented to include a unique head word (maybe and that head words part ofspeech) The head of a nonterminal is determined based on the head of its child LPCFGpartially overcomes the disadvantages of PCFG Many models to follow this approachhave achieved high performance Among them, three models of Collins have been wellknown They brought new performance benchmarks on parsing the Penn Treebank andserved as the basic of important work on parser selection

Trang 23

In next section, we will review three Collins models in context of Vietnamese parsing.

2.4.3 Three models of Collins

Three models, called Model 1, Model 2 and Model 3, were proposed in (Collins, 1997).Model 1 is essentially a generative version of the model described in (Collins, 1997).Model 2 makes the complement/adjunct distinction by adding probabilities over sub-categorization frames for head-words Model 3 gives a probabilistic treatment of wh-movement In the Model 1, Collins proposes an estimation of rule based on independentassumptions between the modifiers In addition, the appearance of each modifier is as-sumed depending on only the head and the left hand side of the rule For example, theprobability of the rule V P (gu) → V (giÊu)NP (quyÓn s¸ch)P P (trong) is estimated as:

Ph(V |V P, ”giÊu”) ∗ Pr(N P ("quyÓn s¸ch”)|V P, V, ”giÊu”)

∗Pr(P P (”trong”)|V P, V, ”giÊu”) ∗ Pl(ST OP |V P, V, ”giÊu”)

∗Pl(ST OP |V P, V, ”giÊu”)More generally, the probability of an entire rule can be expressed as:

• Generate the head of the phrase H(h) with probability Ph(H(h)|P, h)

• Generate modifiers to the left of the head with total probability:

Trang 24

example in English is last week he bought a motorbike, in which the correspondingtranslations are "tuần trước" / "last week", "nó" / "he", "mua" / "bought", "xe máy" /"amotorbike" As you can see the tree in figure 2.4 , "nó" and "xe máy" are in subject andobject position respectively, thus they are attached with "-C" Meanwhile, "Tuần trước"

is an adjunct, then it is not attached with "-C"

Figure 2.4: A tree with the "C" suffix used to identify

The model 1 can be trained on treebank data with the enhanced set of non-terminals,and it can learn the lexical properties which distinguish complements and adjuncts How-ever, it would still suffer from the bad independence assumptions To solve these kinds

of problems, the generative process is extended to include a probabilistic choice of leftand right sub-categorization frames For example, the phrase in figure 2.4 V P (mua) →

N P (tuần trước)NP − C(nó)V P (mua) is estimated as:

Trang 25

Note that the information about complement is not made in English Penn Treebank so in(Collins, 1999), Collins has proposed several ways to automatically identify complementsadjuncts Fortunately Viet Treebank has notated some kinds and of complements.The Model 3 aims to overcome obstacle of identifying predicate-argument structurefrom parse trees is with wh-movement (in English) Noun phrases are most often movedfrom subject position, object position, or within PPs Handle NP extraction by adding agap feature to each non-terminal in the tree, and propagating gaps through the tree untilthey are finally discharged as a trace complement In Penn Treebank is co-indexed aTRACE with the WHNP head of the SBAR, so it is straightforward to add this information

to trees in training data However this phenomena seems not be similar in Vietnamese,and we will show the effects of this model in the experiment complements

Trang 26

Chapter 3

Vietnamese parsing and our approach

Through the previous chapters, we can consider that parsing as a problem in machinelearning The training data containing parsed sentences (usually called Treebank) plays

an important role for the task It is the crucial resources for statistical parsing At thepresent, in the world, there is several well-known Treebank such as Penn Treebank forEnglish, Chinese Treebank for Chinese This kind of corpus for Vietnamese has beenbeing constructed In this chapter we firstly describe the characteristics of the Vietnameseand the characteristics of Viet Treebank Then, we introduce our proposal to apply anddevelop lexicalized statistical models parsing for Vietnamese

3.1 Vietnamese characteristics

Vietnamese writing is monosyllabic in nature Every syllable is written as though itwere a separate dictation-unit with a space before and after In other word, the small-est unit in the construction of words is syllables Words can be single (identified byone syllable) or compound (combined by two or more syllables) Thus, different to En-glish, besides POS tagging, chunking, and syntactic tagging, Vietnamese has one moreannotation level, that is word segmentation

In terms of typology, Vietnamese is an isolating language, words have no inflectionand word formation is a combination of isolated syllables This feature will dominatethe other grammatical features All syntactic aspects will be represented by grammaticalparticles For example: "mua" → "buy"; "đã mua" → "bought"; "sẽ mua" → "will buy"

In the other hand, when combining Vietnamese words into structures such as phrase,sentences, word order and particles are very important The arrangement of words in a

21

Trang 27

certain order is the key to indicate the relationship syntax The arrangement of words

in a certain order is the key to indicate the relationship syntax In Vietnamese, saying

"Anh ta lại đến" is different from "Lại đến anh ta" When the words of the same typeassociate together in principal and accessory relation, the front word keeps the main role,the behind word plays the secondary role By order of the words that "con gà" is differentfrom "gà con", "tình cảm" different "cảm tình" The order that subjects stand beforepredicates is common in the Vietnamese sentence structure Furthermore, the use ofparticles is the key grammatical method in Vietnamese Due to particles, the phrase "anhcủa em" is different to "anh và em", "anh vì em" Apart from word order and particle,Vietnamese also use intonation method The intonation expresses the syntax relations ofthe elements in sentences to give the content On writing, intonation is usually indicated

by punctuation We try to compare the following two sentences to see the difference inthe content: "- Đêm hôm qua, cầu gãy - Đêm hôm, qua cầu gãy

Through a number of distinctive features that we have just mentioned above, wecan visualize somewhat the character and potential of the Vietnamese And as claimed

in (Phuong-Thai & Xuan-Luong, 2009), using constituency representation of syntacticstructures is suitable for Vietnamese Thus, this representation is also used for buildingViet Treebank (Phuong-Thai & Xuan-Luong, 2009)

3.2 Penn Treebank

Treebank is a corpus in which each sentence have grammatical structured in the form

of parse tree Treebank is often constructed based on a labeled corpus, sometimes theinformation about the language or semantic will also added syntactic structures to improvethe quality of the Treebank The process of construction Treebank can be implemented

by hand or semi-automatic with a parser, after finishing of the analysis, the parse treewhich has just achieved need be checked and to complete it occasionally This work may

be extended annually Penn treebank was developed by the University of Pennsylvania,containing approximately 4,5 million English American sentences In the three yearsfrom 1989 to 1992, POS tagging for the sentence is executed This corpus can be found

on the website: http://www.ldc.upenn.edu/ The next section presents some type of label

in the Penn Treebank and the task of arranging components together to get a syntax tree

Định dạng
Số trang	54
Dung lượng	0,99 MB