15 Thesis structure The rest of this thesis is organized as follows: Chapter 2 introduces asic parsing approaches fram classical methods such as tap- down or hotiom-up strategy in statis
Trang 1[hy INGHỆ
Pham Thi Minh Thu
Faculty of Information Technology Hanoi University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Doctor Le Anh Cuong
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
June, 2010
Trang 2Table of Contents
Acknowledgements
1 Introduction
1.1 What is syntactic parsing?
1.2 Current Studies in Parsing
Earley algorithm 2.3 Probabilistic context-free grammar (PCLGs)
Trang 33.2.2 Bracketing
3.3 Viet Treebank
3.3.2 The POS tagset and Syntax tagset for
3.4 Our approach in huilding a Vietnamese parser
3.4.1 Adapting Bikel’s parser for Vieiname:
3.4.2 Analysc error and props using heuristic rules
4 Experiments and Discussion
4.1 Data
4.2 Rikel's parsing tonl
4.3 Adaplating Bikel's tool to Vietnamese
4.3.1 Investigate different configurations
4.4 Experimental results on using heuristic rules
§ Conclosions and Future Work
Trang 4List of Figures
The parse tree of sentence "I go to school" .-
A parse tree in Vietnamese
The parse tree of the Vielnumese sentence "mv bat chu
‘Two derivations of the sentence “Toi hiéu Lan hon Nga"
A parse tree af Vietnamese in ILPCEG
A tree with the suffix used to identify
Set of tag in Penn Treebank
A sample of labeled data in Penn Treehank before manually treatment
A sample of laheled data in Penn Veehank after manually treatment
A sample of complete data in English and Vietnamee
‘The Bikel's system nverview
Result of testing standard Collins’ model 2 with training data's size change
from 60% to 100% of the full data, Wheze sezies 1 and series 2 stand for
testing on sentences with length less equal 40 and 100 respectively
24
25
25
26 27
Trang 5Analysis table with CYK algorithm Kha kg vài
POS tagset in Wier Trechank
Clause tagset in Viet Treebanle
Syntax fonction tagset in Viet Treehank
The initial results on Viet Treebank with different configurations Key:CB
= average crossing brackets, OCB = zero crossing brackets, < 2C’B =< 2
crossing brackets All results are percentages, except for those in the CH
calumn
Number of sentence for training
The results with the change of the training dats s Sb eee
The error rate We use 520) sentences for development testing Then filtering
sentences which have the li-scre less than 70% As the result, we callect
147 sentences inte the set uf error sentences The Percentage of a errer
is calculated by the number of sentences commit this error divide 147
Because a sentence may be some errors so the total percentage may exceed
100
Ihe abtained results after applying same propasal rules ta correct same
wrong syntactic parsing
Trang 6Chapter 1
Introduction
or a long time, human heing have always dreamed of an intelligent machine which can listen ta, understand and implement humans’ requirements Many scientists have tried ta make that dream and devoted many achievements for the science of artificial intelligence
In attificial intelligence, natural language processing (NLP) is a field which studies on how to understand and generate automatically human language NLP has many practical applications such as machine translation, information extraction, discourse unulysis, text summarization, These applications have the sume busic problems such as lexical analysis, syntactic pursing and semantic analysis In which, syntactic parsing is the central role and it is alsa the goal of this thesis
1.1 What is syntactic parsing?
Syntactic parsing {parsing or syntactic analysis) is the process af analyzing a given se- quence of tokens (ie a sentence) tn identify their grammatical structure with respect
fo a given grammar ‘Ihe grammatical structure is often represented in the form which displays visually the dependence of components as a tree (is called parse tree or syntactic tree) In other words, parsing is the problem to get a given sequence of words as input
and output is the parse trees corresponding to that sequence
Figure 1.1 shows examples for purse tre: a) ä English parse tree in usual form and b) a Vieinamese tree in other form,
Parsing is the major mudule of a grammar checking system In order lo check gram- mar, we need ta parse input sentences, then examine the correctness of the structures in the output Furthermnre, a sentence which cannnt he parsed may have grammatical errors
Trang 7Figure 1.2: A parse tree in Vietnamese
Parsing is also the important intermediate stage of representation for semantic analysis, and thus plays an important role in applications like machine translation, question an- swering, and information extraction For examp!
the system will analyze the source sentence to output a parse tree and then construct the
in transfer-based machine translation
equivalent parse tree in the target language The output sentence will be generated mainly based on this equivalent parse tree It is to understand that in a question answering system
we need parsing to find out which is the subject, object, or action It is also interesting
that parsing can help speech processing It supports to correct the fault of the speech recognition process On the other hand, in speech synthesis parsing help put stress on the correct position in the sentence.
Trang 81.2 Current Studies in Parsing 3
Through these above example we can see that construct an accurate and effective parser will bring great benefits to many applications of natural language processing
1.2 Current Studies in Parsing
As one of the basic and central protilem of NLP, parsing attracts many studies They belong to one of the two approaches: rule based and statistics based
In conventions pursing systems, 2 grammar is hund-crafted, uften involves a large amount of lexically spevific information in the form of sub-categorization information In there, ambiguity, u major problem in parsing, is valved through selectional restrictions Kor example, a lexicon might specify that “eae” must take an object with the feature 4+ fond” In (Collins, 1999), the author has showed several problems with selectional restrictions such as increasing the volume of informatian required when the vocahulary size becomes so large In the other word, the biggest challenge is the large amount of vocabulary to require both selectional restrictions and structural preference should be encoded as the soft preferences instead of hard constraints
To overvome these ubslacles, the reseurchers began to explore mavhine-leaming ap- proaches 1 parsing problem, primary through stalistical models In these approuches,
a sel of example pairs of sentence und the corresponding syntactic tree is annotated by hand and used to train parsing models A set af trees is called a “treebank" Several parts of the treehank are reserved as fest data for evaluating the model's accuracy Harly works investigate the use of prohabilistic context free grammar (PCIG) Using PCHG is considered as the next generation of parsing and is also as a beginning step in statistical parsing In a POFG, each grammar mule is associated with a probability The probability
of a purse tree is the product of the probubilities of all rules used in that Wee In the case, parsing is essentially the process of seurching the tree that has the macimum probubilily However, a simple PCEG often fail due to ils lack of sensilivity to lexical information and structural preferences “Ihen some solutians were praposed tn resalve this proflem Several directions wor listed in (Collins, 1999) such as: towards prohabilistic version oF lexicalized grammars; using supervised training algorithms; to construct madels that had increased structural sensitivity; to look inte history based models Among them lexical ized probabilistic context free grammar (LPFG) is a promising approach It can solve many ambignity phenomena in parsing Some works that based on this approach achieved high performunce, such as in (Collins, 1997) After this reseurch, Daniel M Bibel and
Trang 9his coworker have developed Collin’s models and designed a parser for nvultiple lan guages It has been applied successfully for some langnages as English, Chinese and Arabic (Bikel, 2004), The concrete results of the parser for these languages is reported in
(Bikel, 2004): For English, F-measure is 90.01%; fur Chinese, F-measure is 81.2% and
in Arabic F-measure is 73.7% According to these results and the comparison belween
current parsers (e.g, Churniak purser, Bekelley parser, Sandford purser), Bikel’s purser is still rated one of the hest parser at present,
Recently, this approach of nsing I.PCEG continues heing applied far many languages Moreover, a number of new strategies have proposed to improve the accuracy of parsers
In several researches, using semi supervised training methods becomes a promising ap proach Their experimental results shaw that this approach outperforms the supervised
one, withoul much additional computational vest Some other studies has integrated se-
mantiv informatiun into parsing in order to fully exploit the benefits uf lexical resources and upgrade the parser, such as in (Xiong et al 2005) (Agirre & Baldwin, 2008) (Xiong
et al., 2005) deserihed the way of incorporating semantic knowledge as fallow: Firstly, they used twa Chinese electronic semantics dictionarics and heuristic rules in order ta extract semantic categories ‘Then they built a selection preference sub-model hased on extracted semantic categories Similarly, in (Agirre & Baldwin, 20U8), the sense infor mation was added to parsing by substituting the original wards with their semantic tags which correspond with their semantic classes, for example knife and scissors belong to
TOOL class, cake and purk ate assigned to FOOD class In addition, some other sug-
gested tctics have been enhanced uf the performance of parsing as a powerful learning technique (sumple selection) for reducing the amount of humun-labeled training data
(Carreras et al., 2008); ar, a strategy for utilizing POS tags resources to annotate parrer input in (Watson er al., 2007)
‘Through the review of approaches in parsing and especially some recent stndies, we found that LPCFG appears in all of state of the art parsing systems LPCFG Therefore
in our opinion, LPCFG is a good choice for Vietnamese parsing
1.3 Vietnamese syntactic parsing
In Vietnam, works in natural language processing (i.e computational linguistics) in general and in parsing in patticular have been only motivated very recently A few of the parsers which folluw the knowledge-based upprouch are constructed with the manual
Trang 101.4 Objective of the Thesis 5
grammar rules Since the construction af grammar rules is mannal, the accuracy of the parser is not high I only analyzes a limited aumber of sentences generated by the grammar The approach using statistics has been also studied, but also only at brief and has no experimental results For example, (Quuc-The & Thanh-Huong, 2008) presented about LPCEG but surprisingly it did nut provide uny experiment, only some examples were provided (o illustrule the syntactic ambiguity of Vietnamese With such tesiricied results, no Vietnamese parser has heen published widely
It can say that while many countries in the world has gone forward a long way in parsing, Vietnam has just been at the stage to start The precondition for deployment these models for Vietnamese is a corpus containing parsed sentences which is the crucial resources for statistical parsing Since lack of carpus, the previous works on Vietnamese
parsing have not had the significant experimental resulls Fortunately, at present there
is a standard Vietnamese parsed corpus, called Viet Treebunk, which developed in a Project supported by Vielnamese guvernment, This corpus involves about 10,000 parsed sentences followed Penn ‘lreehank’s format, and therefore we can apply it ta the Bikel's
foal
As mentioned ahove, lexiealized madels have heen applied successfully for multiple languages Among these languages, Chinese obtained 81.2 %F score on average as Bikel shown in (Bikel, 2004) This result alse motivated our study since the syntactic structure
of Chinese is similarities to the syntactic structure of Vietnamese
1.4 Objective of the Thesis
This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach
We will use Viet Treebunk as the parsed corpus and adapt Bikel’s parsing tool for Viet-
In summary, in this stndy we try to carry out the following tasks
Study the basic techniques and methods in parsing, focusing on lexicalized statistical
approuches;
Trang 11try to build and publish a Vietnamese parsing tool which are useful for many tasks of Vietnamese processing
~Invesligue different pursing models und different linguistic features lo discover the best configuration fur Vietnamese;
- Anulyze grammatical errors from a development test set and find uut 4 solution tw
improve the accuracy of the parser
15 Thesis structure
The rest of this thesis is organized as follows:
Chapter 2 introduces asic parsing approaches fram classical methods such as tap- down or hotiom-up strategy in statistic hased methods like prohahilistic context-free grammar (PCFG) and lexicalized probabilistic context free grammar (LPCFG) In this chapter, we also introduce the important parsing algorithms inclnding CYK, Earley and Chast parsing,
Chupier 3 represents Vietnamese parsing and our approach Charucteristivs af Viet- names and Viet Treebank will be intreduced in the comparigen with Penn Treebunk Chapter 4 describes our experiments und discussions After the introduction of the Rikel parsing tool, we will describe the process of applying and developing it far the Vietnamese: from adapting the taal far Vietnamese and investigating it in arder ta find cut the hest canfiguration, and finally handling several grammatical errors tn reduce errr mate and enhance the parser performance
Chapter 6 summarizes the obiained results gives some conclutions of our work, and shuws our plan fur the future work,
Trang 12Chapter 2
Parsing approaches
In the previous chapter, we have introduced the concept of parsing and its male in natural language processing ‘This chapter will firstly presents context free grammar, and then presents common parsing methods, including two classical parsing strategies (the top down and bottom up); CYK algorithm (Cocke Younger Kasami); Chart parsing and Earley algorithm At the end of this chapter, we will present the Probabilistic Context Free Grammar and the lexivalized statistical model for pursing,
2.1 Context Free Grammar (CFG)
To analyze syntax for a language, we firstly need 1a represent the language in a form which computer can understand The mus! popular formal presentalion for grammar uf
a language is the context-free-grammar (invented by Chomsky), Language is defined as
a set af strings where each string was generated hy a finite set of nonempty elements called the alphahet, such ax Vietnamese alphahet and Vietnamese languages
A context-free grammar (CLG) is a set of four components ‘Ihe grammar is denoted,
we have G = (1, 5,8} Where:
4 Z Wis a finite set of elements which called terminal (lexicon) The set of terminals
is the alphabet (or words) of the language defined by the grammar
- N 7 @is a finite set of nun-terminal characters or variables Nole thai Á ï1 3) — They represent different types of phrase or clause in the sentence
- Sis one of the non-terminal (S € N) and called start variable (or siart symbol) that
is used to represent the whole sentence
- RB is a finite set of rites or productions of the grammar Kach nile in K has the farm
Trang 13X — a where X isnon terminal and a is a sequence of terminals and non terminals
A grammar G generates a language L
In parsing, CFUs or its variations are used to represent the grammar of a language which is the grammatical base to construct methods to solve the parsing problem, The next seclion presents these methods
2.2 Parsing Algorithms
2.2.1 Top-down parsing
Top-down parsing is a strategy of analyzing the given sentences in the following way: beginning with the start symbul and at each step, expand one of the remaining nontermi- nals (from left ta right) hy replacing it with the right side of one of its productians in the grammar until achieving the desiredl string In other wards, a parse tree is generated hy
a top - down parser as a result of the construction process of the tree beginning with the start symbol (the root of the tree), using rules in the grammar to generate tree from the
root (start symbol) to leaves (words or lexicons)
In top-down parsing, for the rules having the same left hand side, the selection rules could be simplified based on the size of the right-hand side string (comparison belween the right-hand side sirings) or just the order of the symbols in the tight hand side of the
tules in case the top-down analysis nat finished, we turn hack to search the appropriate rule for the parse canstruction
2.2.2 Bottom-up parsing
Contrary to top-down strategy, the bottom-up parsing (also known as shift-reduce parsing) begins with an input sentence, using two main action (shift and rednce) ta backward the input string into the start symbol (the root of parse tree) With a stack, the words in the
input string are pushed into the stack from left to right (shift), and if stack contains the
right-hand side of a rule and it can be replaced by the left hand side of the rule
Similar to top-down strategy in bottam-up parsing, when errors occur, or no analysis,
we perform backtracking to develop by a different mule This process continues until we can not tum hack anymore, at this time if the stack was not reduced backward the start
symbol, the hottam-up parser ean not he analyzed the input string
Trang 142.2, Parsing Algorithms 9
2.2.3 Comparison between top-down parsing and bottom-up parsing
Roth methnds have their advantages and disadvantages
‘Top-down strategy does not waste time ta examine the trees that are not hegun hy
the start symbol (S) That means it would never have visited the sub trees without root 8 However, this approach has weaknesses While not waste time with the tree not started
by S, the top down parser throws away resources for the trees that do not match with the input string This weakness is consequence of generating the parse tree befure examining
ihe inpul string
Conversely, in bottom-up parsing, although purse trees can not be generated by the starting symhol S, but it is always ta ensure that the generated parse trees agree with the input string
In short, each strategy has advantages and disadvantages ‘Ihns, if combining of the
two approaches, we will have a good method
2.2.4 CYK algorithm (Cocke-Younger-Kasami)
CYK algorithm, sometimes known as the CKY algnrithm, identify whether a input string can be generated by a given CEG, if so, how it can be generated This algorithm is a form
of bottom up parsing using dynamic programming CYK algorithm operates on CFG to
be in Chomsky Normal Ferm (CE) CFG in CNF is CFG in which the rule of the form:
R- {A= BCA al A,B,C eA eF}
- Pseudo-code of CYK algorithm
The algorithm in pseudocode is as follows:
Let the input be a string S consisting of n characters: a_ un
Let the grammar contain r nunterminal symbuls Ay 2,
This grammar contains the subset 2, which is the set af start symbols
Let P[n,n,r] be an array of booleans Initialize all elements of J to false
For each i — 1 ton
For each unit production f; — a;, set P[i,1, j] = trne
For each i = 2 ton Length of span
For cach j =1ton 74 1 Start of span
For each & — | to i — | Partition of span
For each production RA — RBRC
Trang 15H P|2,&, B and P[j + É,ï — k,C| then set P'[j,¡, Al = true
W any of #Í1,n, z| is €rue (z Is iterated over the set s, where sare all the indices for #.)
Then $ is member of language
Else S is not member of language
it is eaay ta see that the time complexity of this algorithm is O(n),
A table data-rtructure is used tn keep the track of the parsing proces with CYK algorithm
Trang 16Farley algorithm is a top-down parsing algorithm using dynamic programming technique
It carries the typical feature of dynamic programming, that is to reduce the running time from exponential ta polynomial by removing the solutions be generated due to turn back
In this case, dynamic programming algorithm makes the running time given in O(.N*)
where N is the total number of input sequences However it does not need grammars
given in CNE, and so overcomes the main disadvantage of the CKY approach
The main idea of the Earley algorithm is w travel from left to right and create a network including N+ | entities With cach word in the sentence, chart contains a list nf states which represent each component af generated tree When a sentence is completely parsed, charts mark the pracess of analyzing the input sentence ending Rach sub-tree can be performed ance only and can be reused by the parser
Each separate state contains a chart entity including three parameters: a sub tree corresponding to a grammar rule, information about the development of tree, and the location of the sub-tree equivalent to the input string We give dot notation (.) on the right side of a rule of grammar te describe that it (the right hand side of this rule) has
already been parsed This structure is called the dotted mile The state of the position will
be expressed by two factors: locate the start state and the position of dot notation
Using grammar in section 2.2.4, we have several examples of dotted rate as follows:
S—+eNP VTI0,0]
2.2]
The basic principles of the Earley parser is to develop the parse tree through the set
including N + 1 states in chart from left to right, processing each state in the set At each
4V — "chuột
Trang 17step one of three stages described below will be operated to each state of the law In each case, the result was to add a new state based on the current state or the next one
in the chart Algorithms are always developed through adding new information on the chart, the state will never be cunceled and can nol tum back previous chart And state S- >,[0,.4] in the list of state is the last chart, showing that the input string is parsed successfully
‘The three main operaters of the Harley algorithm ia PREDICTOR, COMPLETER and
SCANNER ‘These operators get input as a string and retum a stata PREDICTOR and
COMPLETER add states to the current chart, and SCANNER add states to new chart + Predictor
As its name Predictor is responsible for creating a new state, represents states which
cecur during the anulysis process, Predictor is upplied to any state in which non-lerminal
is on the right uf the dot mutation and is not in the part-of-speech group Result of this operator is anew slale for each expansion replaced non-terminal in grammar
+ Scanner
When a state is generated from tho label an the right of the dat notation, Scanner will
fe called ta check the input and morge the state to tho corresponding label ta put an the chart The tasks are complete when a new state is created and changes the location
of dot notation based on the input gronp has predicted Here, Barley parser using input strings like the top down parser to avoid the ambiguity only terminals (labeled) the words predicted by the states, will be analyzed by the chart
+ Completer
Compleier opetatur is applied to the state that has already hud the dot notation at the end of the rule It ia easily to realize that the curront status shows the success af the analysis ‘The pumposo of thin operator is ta research in the mules and develop previaus
states in the current state af the input New state is generated hy taking the ald states,
and shifting the dots through nules in the grammar and adding new states into the current chart
With the grammar below, we will analyze the sentence "Toi hét" (“Im singing in English") by Earley algorithm:
Trang 182.3 Probabilistic context-free grammar (PCFGs) 13
In Churt[2], there is the state S » NPY Pe 0,2] and the length of the input string equal
2, thus the process of analysis is complete and successful
2.3 Probabilistic context-free grammar (PCFGs)
Trang 19Jeaming Through a training process, it aims to construct a probabilistic model, which is then used to produce the best parse tree for a test sentence In this section we introduce the Probabilistic Cuntext Free Grammar (PCFG)
In PCEG, euch rule has a probubilily The product of the probabilities of the rules which used in u tree is the probability of that pure tree, P(T|S) The parser itself is an algorithm which searches for the tree, Thre, that maximizes P(T|S) A generative model uses the observation that maximizing P(T, $) is equivalent to maximizing P(T|S)
PíT, 9)
PS}
Example: [or the following PCPG, require to compute the probability of parse tree
These = arg max P(L|S) = argmax =arg max P(T, 8)
(Figure 2.1) for the sentence “Méo bat chugt” ("Cats catch mice" in English)
Trang 202,4, Lexical Probabilistic Context Free Grammar (LPCEGs) 15
Figure 2.1: The parse tree of the Vietnamese sentence "mio biit chuột"
We consider an example, such sentence "Tôi hiểu Lan hơn Nga" in order to see that
POFGs lack of sensitivity to lexical information This sentence can be put into two parse
trees such as Figure 2,2
If two sentences have the same probubility, we must put lexical context information
to distinguish ‘Thus, adding lexical information to PCKG will hring many honefits
2.4 Lexical Probabii
tic Context Free Grammar (LPCFGs)
In the previous section, we introduced the prohahilistic model for the parsing problem:
However, this madel still remain some drawhacks M Collins proposed a new approach for parsing ‘Ihat approach is Lexical Prohahilistic Context lire Grammar (LPCHG) In
1996, Collin introduced three models under this approach in his paper In three models,
Collin put into the PCEG a new structure called head
Trang 21« "Quyển" - front auxiliary constituent
« "Sách" the central constituent
« "Hay" - behind auxiliary canstituent
‘The word "sch" here is a central constituent af the phrase If we take it away, the phrase will he "quyển hay" that is nonsensical But if we give up ane of the two auxiliary con- stituents, and even temave hath af twa auxiliary components, the phrase is still meaning (quyển sách, sách hay or sách) The central constituent is called the head
Thus, a phrase has its head word, a sentence also has the head word And the head structure is the basic characteristics of Lexical Probabilistic Context Free Gram mut (LPCFGs)
2.4.2 The cancept of Lexical Probabilistic Context Free Grammar
(LPCFGs)
in a PCKG, for a tree derived fy an applications af context-free re-write rules LHS; >
WHG¡.1<i< n,
P(!|8) = Lị„P(LHS|#HS)
Trang 222,4, Lexical Probabilistic Context Free Grammar (LPCEGs) 17
As presented in (Collins, 1997) a PCFG can be lexicalized by associating a word w and
a part of speech (POS) tag t with each nonterminal X in the tree A onterminal can be written as X (x), where x = (w, 4), and X is a constituent label Then each rule has the form:
PR) > Tall
where H is the head child of the phrase, which inherits the head word h from its parent
Pi Ly and Ry Re, are lefl and right modifiers of H Figure 2.3 shows ä Vietnamese parse tree in LPCFG in which each node is associated with lead-word In this example, word "gidu" is the head word of the sentence, Translation uf this sentence into English is
“Thides the hook in the bookcase”, in which the corresponding translations are "Tai" /’
“gidu" /"hides", "quyén séch" /"the hook", "vao" /"in", "ti" /"the honkease"
Tal giấu quyên séck trong tả
Figure 2.3: A parse tree of Viemmamese in I.PCHG
In shert, A LPCHG is a PCHG in which each nan-terminal in a parse tree is lexicalized
hy associating with its head ward ‘That alra means every nonterminal label in every tree is augmented to include a unique head word (maybe and that head word's part of speech) The head of a nonterminal is determined based on the head of its child LPCEG partially overcomes the disadvantages of PUFG Many models to follow this approach have achieved high perfurmance Among them, three modely of Collins have been well known They brought new performance benchmarks on parsing the Penn Treebank and
served as the basic of important work on parser selection
Trang 232.4.3 Three models of Collins
Three models, called Model 1, Model 2 and Model 3, were proposed in (Collins, 1997) Model 1 is essentially a generative version of the model described in (Collins, 1997) Model 2 makes the complement /adjunct distinction by adding probabilities over sub calegorization frames for head-wards, Model 3 gives « probubilistic treatment of wh- movement In the Model 1, Collins proposes un estimation of rule based on independent assumptions beiween the modifiers In addition, the appcurance of cach modifier is as- sumed depending on only the head and the left hand side of the rule lor example, the probability of the rule V P(gu) — V(gidin) VP (quyén sich) PP (trang) is estimated as:
PV VP,’ gidu") « P,LN P(“quyển sách" ;[V P, V, " giấu” ]
+P.(PP(trong")|[VP,V," giá") + Pị(STOP|VP, V," giấu")
+B(STOP|VP, V,"giấu")
More generally, the probability of an entire rule can be expressed as:
Generate the head of the phrase H(h) with probability 2, (21(h} P.h)
Generate modifiers te the left af the head with total probahility:
191 1000|P.44,Á)
such that Pu, (,‡¡) = STOP
© Generule modifiers tơ the right of the head with total probability:
TRE PAC R(r AP, FA)
such that Ryjiftme.) = STOP
‘cis useful te know additional information of phrases such as information about subject, object of noun phrase The Model 2 makes an adjunct/complement distinction while parsing which helps it more accurate, Figure2.4 gives an example, the translation of this
Trang 242.4, Lexical Probabilistic Context Free Grammar (LPCFGs) 19
example in English is “last week he bought a motorbike”, in which the corresponding translations are "tuần trước" / "last week", "nó" / "he", "mua" / "bought", "xe máy” /*a
motorbike", As you can see the tree in figure 2.4, "nó" and "xe máy" are ïn subject and
object position respectively, thus they are attached with "-C’ Meanwhile, "Tuần trước"
is an adjunct, then it is not attached with "-C’,
Figure 2.4: A tree with the "C" suffix used to identify
‘The model 1 can be trained on treebank data with the enhanced set of non-terminals,
and it can leam the lexical properties which distinguish complements and adjuncts How-
ever, it would still suffer from the bad independence assumptions To solve these kinds
of problems, the generative process is extended to include a probabilistic choice of left
and right sub-categorization frames For example, the phrase in figure 2.4 V P(mua) >
NP(tudntruée) NP — C(n6)V P(mua) is estimated as:
Trang 25(Collins, 1999), Collins has proposed several ways to automatically identify complements adjuncts, Fortunately Viet Treebank has notated some kinds and af complements
The Model 3 aims to overcume obstacle of identifying predicute-argument structure from parse Irees is with wh-movement (in English) Nuun phrases are most often moved from subject position, objec! position, ur within PPs, Handle NP extruction by adding « gap feature ta cach non-terminal in the trea, and propagating gapa through the tree until they are finally discharged as a trace camplement In Ponn ‘Iechank is co-indexed a TRACE with the WEINP head of the SBAR, so it is straightforward to add this information
to wees in training data However this phenomena seems not be similar in Vietnamese,
and we will show the effects of this model in the experiment complements.
Trang 26Chapter 3
vi
etnamese parsing and our approach
‘Through the previous chapters, we can consider that parsing as a problem in machine leaming ‘Ihe training data containing parsed sentences (usually called ‘Treehank) plays
an important tole for the task tis the crucial resources for statistical parsing At the present, in the world, there is several well known Treebank such as Penn Treebank for English, Chinese Treebank for Chinese This kind of corpus for Vietnamese has been being constructed In this chupler we firstly describe the charucteristicy of the Vietnamese and the charucleristics of Viet Treebank, Then, we introduee our proposal to apply and develop lexiculized statistical models parsing for Vieunamese
Vietnamese wriling is monosyllabic in nature, Every “syllable” is written as though it were a separate dictation-unit with a space hefnre and after In ather word, the small- est unit in the construction of words is syllables Words can he single (identified hy one syllable) or compound (combined by two or more eyllables) ‘thus, different to Lin- glish, besides POS tagging, chunking, and syntactic tagging, Vietnamese has one more annotation level, that is word segmentation
In texms of typology Vietnamese is an isolating language, words have no inflection and word formation is « combination of isolated syllables This feature will duminaie
the uther grammutical features All syntactic aspects will be represented by grammatical
particles, For example: "mus" — "buy"; “di mua” — "bought"; "sé mua" — “will buy"
in the other hand, when combining Vietnamese words into structures such as phrase,
sentences, word order and particles are very important ‘Ihe arrangement of words in a
mn
Trang 27in a certain onder is the key to indicate the relationship syntax In Vietnamese, saying
“Anh tạ lại đến" ìs đìfferent frora "Lại đến anh ta" When the words of the same type asyociate Logether in principal and accessory relation, the front word Keepy the main Tole, the behind word plays the secondary role By order of the words that “con ga” iy different
from “g@ con", “tink cara" different “cẩm fink” The order that subjects stand before
[redicates is cammon in the Vietnamese sentence structurs Furthermare, the use of particles is the key grammatical method in Viernamese Due to particles, the phrase “anh cia em" is different to “anh va em", “anh vi em" Apart from word order and particle, Vietnamese also use intonation method The intonation expresses the syntax relations of the elements in sentences to give the content On writing intonation is usually indicated
by punctuation We try to compare the following Lwo sentences to see the difference in
the content: “- Đêm hôm qua, edu gdy - Dém hom, qua cd sdly
Through a number of distinctive features thal we have just mentioned above, we
can visualize, somewhat the character and potential af the Vietnamese And as claimed
in (Phuang- Ihai & Xuan-I ong, 2009), using constituency representation of ayntactie structures is suitable far Vietnamese ‘Ihus, this representation is alsa used for building Viet Treebank (Phuong Thai & Xuan Luong, 2009)
3.2, Penn Treebank
‘Treehank is a corpus in which each sentence have grammatical structured in the farm
of parse tee Treebank is often constructed based on a labeled corpus, sometimes the information about the language or semantic will also added syntactic structures toimprove the quality of the Treebunk The process of construction Treebank can be implemented
by hand or semi-automatic with a purser, after finishing of the analysis, the parse tree
which has just achieved need be checked and to complete il occasionally This work may
be extended annually Ponn irechank was developed hy the University of Pennaylvania, containing approximately 4,5 millian English American sontences In the three years from 1989 ta 1992, POS tagging for the sentence is executed ‘This corpus can he found
on the website: http: //www.lde-upenn.edu/ The next section presents some type of label
in the Penn Treebank and the task of arranging components together to get a syntax tree