Firstly, we introduce the Machine Translation MT, which is one of bigapplications in Natural Language Processing NLP and an approach to solve this problem by using statistical.. Machine
Trang 1VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND
Trang 21.1 Overview 1
1.1.1 A Short Comparison Between English and Vietnamese 2
1.2 Machine Translation Approaches 2
1.2.1 Interlingua 3
1.2.2 Transfer-based Machine Translation 3
1.2.3 Direct Translation 4
1.3 The Reordering Problem and Motivations 5
1.4 Main Contributions of this Thesis 5
1.5 Thesis Organization 6
2 Related works 7 2.1 Phrase-based Translation Models 7
2.2 Type of orientation phrases 8
2.2.1 The Distance Based Reordering Model 9
2.3 The Lexical Reordering Model 9
2.4 The Preprocessing Approaches 10
2.5 Translation Evaluation 11
2.5.1 Automatic Metrics 11
2.5.2 NIST Scores 12
2.5.3 Other scores 12
2.5.4 Human Evaluation Metrics 13
2.6 Moses Decoder 13
3 Shallow Processing for SMT 15 3.1 Our proposal model 15
3.2 The Shallow Syntax 16
3.2.1 Definition of the shallow syntax 16
3.2.2 How to build the shallow syntax 17
3.3 The Transformation Rule 18
3.4 Applying the transformation rule into the shallow syntax tree 19
4 Experiments 21 4.1 The bilingual corpus 21
4.2 Implementation and Experiments Setup 21
4.3 BLEU Score and Discussion 22
5 Conclusion and Future Work 25 5.1 Conclusion 25
5.2 Future work 25
Trang 3ii Contents
Appendix A A hand written of the transformation rules 27Appendix B Script to train the baseline model 29
Trang 4List of Tables
1 Corpus Statistical 21
2 Details of our experimental, AR is named as using automatic rules, MR isnamed as using handwritten rules 22
3 Size of phrase tables 23
4 Translation performance for the English-Vietnamese task 23
Trang 5List of Figures
1 The machine translation pyramid 3
2 The concept architecture of Moses Decoder 14
3 An overview of preprocess before training and decoding 15
4 A pair of source and target language 15
5 The training process 16
6 The decoding process 17
7 A shallow syntax tree 17
8 The building of the shallow syntax 18
9 The building of the shallow syntax 20
Trang 6Chapter 1
Introduction
In this chapter, we would like to give a brief of Statistical Machine Translation (SMT),
to address the problem, the motivations of our work, and the main contributions ofthis thesis Firstly, we introduce the Machine Translation (MT), which is one of bigapplications in Natural Language Processing (NLP) and an approach to solve this problem
by using statistical Then, we also introduce the main problem of this thesis and ourresearch motivations The next section will describe the main contributions of this thesis.Finally, the content of this thesis will be outlined
In the field of NLP, MT is a big application to help a user translate automatically asentence from one language to another language MT is very useful in real life: MThelp us surf the website in foreign languages, which we don’t understand, or help youunderstand the content of an advertising board on the street However, the high quality
MT is still challenges for researchers Firstly, the reason comes from the ambiguity ofnatural language at various levels At lexical level, we have problem with the morphology
of the word such as the word tense or word segmentation, such as Vietnamese, Japanese,Chinese or Thai, in which there is no symbol to separate two words For an example, inVietnamese, we have a sentence "học sinh học sinh học.", "học"is a verb, which means
”study” in English, "học sinh"is a noun, which means a pupil or student in English,
"sinh học"is a noun, which means a subject (biology) in English At the syntax level,
we have the ambiguity of coordinate For example, we have another sentence the mansaws the girl with the telescope We can understand that the man used the telescope
to see the girl or the girl with the telescope is seen by the man So on, the ambiguity ismore difficult in the semantic level
Secondly, Jurafsky and Martin (2009) shows that there are some differences in a pair
of language such as the difference in structure, lexical, etc , which make MT becomechallenges
Specially, one of the differences between two languages, which we want to aim onthis thesis, is the order of words in each language For example, English is a type ofSubject-Verb-Object (SVO) language, which means subject comes first, then verb followsthe subject and the end of the sentence is Object In the sentence ”I go to school”,
”I” is its subject, the verb is go to and the object is school The different from English,Japanese is a type of SOV language, or Classical Arabic is VSO language
In the past, the rule-based method were favorite They built MT system with somemanually rules, which are created by human, so that, in closed domain or restricted area,
Trang 72 Chapter 1 Introduction
the quality of rule based system is very high However, with the increase of internet andsocial network, we need a wide broad of MT system, and the rule based method is notsuitable So that, we need a new way to help the MT, and the statistic is applied tothe field of MT At the same time, the statistical method is applied in many studies:automatic speech recognition, etc So that, the idea of using statistical for MT has beencoming out Nowadays, there are some MT systems, in which statistical method is used,can compare with human translation such as GOOGLE1
1.1.1 A Short Comparison Between English and Vietnamese
English and Vietnamese have some similarities such as they base on the Latin character
or are the type of SVO structure For an example:
of phrase The reorder of words can be seen in wh-question, too
en: what is your job?
vn: công_việc của anh là gì ?
In this example, the word what mean gì in Vietnamese The position of these two wordscan be easil seen Because, English follows with S-Structure and Vietnamese follows withD-Structure
In this section, we would like to give a short of approaches in the field of machine lation We would like to begin with complex method (interlingua) and en with simpleone (direct method) From a source sentence, we use some analyzing methods to get thecomplex structures, and then generate the structures or sentences in the target language.The highest complex structure is the interlingua language (figure 1)
trans-1 http://translate.google.com
Trang 81.2 Machine Translation Approaches 3
Figure 1: The machine translation pyramid
1.2.1 Interlingua
The interlingua systems (Farwell and Wilks, 1991; Mitamura, 1999) are based on the idea
of finding a language, which called interlingua language to represent the source languageand is easy enough to generate the sentence in other language In the figure 1, we can seethe process of this approach The analyzing method is the understanding process, in thisstep, from source sentence we can use some technical in NLP to map source sentence todata structure in the interlingua, then retrieve the target sentence by generating process.The problem is how complex the interlingua is If the interlingua is simple, we can getmany translation options In other way, the more complex the interlingua is, the morecost effort the analyzing and the generating are
1.2.2 Transfer-based Machine Translation
Another approach is analyzing the complex structure (simpler than interlingua structure),then using some transfer rules to get the similar structure in the target language Thengenerating the target sentence On this model, MT involves three phrases: analysis,transfer and generation Normally, we can use all three phrases However, we some-times use two of three phrases such as transfer from the source sentence to the structure
in target language then generate the target sentence For example, we would like tointroduce a simple transfer rule to translate source sentence to the target sentence2
[N ominal → AdjN oun]source language⇒ [N ominal → N ounAdj]target language
2 This example is take from Jurafsky and Martin (2009)
Trang 94 Chapter 1 Introduction
1.2.3 Direct Translation
1.2.3.1 Example-based Machine Translation
Example based machine translation was first introduced by Nagao (1984), the authorused a bilingual corpus with parallel texts as its main knowledge base, at run time Theidea is behind it, is finding the pattern in the bilingual and combining with the paralleltext to generate the new target sentence This method is similar with the process inhuman brain Finally, the problem of example based machine translation comes from thematching criteria, the length of the fragments, etc
1.2.3.2 Statistical Machine Translation
Extending the idea of using statistical for speech recognition, Brown et al (1990, 1993)introduced the method using statistical, a version of noisy channel to MT Applyied noisychannel to machine translation, the target sentence is transformed to the source sentence
by noisy channel We can represent MT problem as three tasks of noisy channel:
forward task: compute the fluency of the target sentence
learning task: from parallel corpus find the conditional probability between the targetsentence and the source sentence
decoding task: find the best target sentence from source sentence
So that the decoding task can be represented as this formula:
ˆ
e = arg max
e
P r(e|f )Applying the Bayes rule, we have:
sen-We use the alignment model to compute this value base on the unit of the SMT Basing
on the definition of the translation unit we have some of approaches:
• word based: using word as a translation unit (Brown et al., 1993)
• phrase based: using phrase as a translation unit (Koehn et al., 2003)
• syntax based: using a syntax as a translation unit (Yamada and Knight, 2001)
Trang 101.3 The Reordering Problem and Motivations 5
In the field of MT, the reordering problem is the task to reorder the words, in the targetlanguage, to get the best target sentence Sometimes, we call the reordering model asdistortion model
Phrase-based Statistical Machine Translation (PBSMT), which was introduced byKoehn et al (2003); Och and Ney (2004), is currently the state of the art model in wordchoice and local word reordering The translation unit is the sequence of words withoutlinguistic information So that, in this thesis, we would like integrate some linguisticinformation such as a chunking, a syntax shallow tree or transformation rule and with aspecial aim at solving the global reordering problem
There are some studies on integrating syntactic resources within SMT Chiang ang (2005) shows significant improvement by keeping the strengths of phrases, whileincorporating syntax into SMT Chiang (2005) built a kind of the syntax tree based onsynchronous Context Free Grammar (CFG), known as the hierarchical of phrase Chiang(2005) used log linear model to determine the weighted of extracted rules and devel-oped various of CYK algorithm to implement decoding So that, the reordering phrase isdefined by the synchronous CFG
Chi-Some approaches have been applied at the word-level (Collins et al., 2005) They areparticularly useful for language with rich morphology, for reducing data sparseness Otherkinds of syntax reordering methods require parser trees , such as the work in Quirk et al.(2005); Collins et al (2005); Huang and Mi (2010) The parsed tree is more powerful incapturing the sentence structure However, it is expensive to create tree structure, andbuilding a good quality parser is also a hard task All the above approaches require muchdecoding time, which is expensive
The approach we are interested in here is to balance the quality of translation with coding time Reordering approaches such as a preprocessing step Xia and McCord (2004);
de-Xu et al (2009); Talbot et al (2011); Katz-Brown et al (2011) is very effective ment significant over state of-the-art phrase-based and hierarchical machine translationsystems and separately quality evaluation of reordering models)
Inspiring this preprocess approach, we have proposed a combination approach which serves the strength of phrase-based SMT in local reordering and decoding time as well asthe strength of integrating syntax in reordering As the result, we use an intermediatesyntax between the Parts of Speech (POS) tag and parse tree: shallow parsing Firstly,
pre-we use shallow parsing for preprocess with training and testing Secondly, pre-we apply aseries of transformation rules to the shallow tree We have get two sets of transformationrules: the first set is written by hand, and the other is extracted automatically from thebilingual corpus The experiment results from English-Vietnamese pair showed that ourapproach achieves significant improvements over MOSES, which is the state-of-the artphrase based system
Trang 12Chapter 2
Related works
In this chapter, we would give some background knowledge and a little review on the MT.According to the short description in the chapter 1, the target translation of MT is theresult of finding the maximum argument for the production of the faithfulness (P r(f |e))and the fluency of the target sentence(P r(e)) The fluency of the target sentence knows
as the language model, can be modeled by n-gram The faithfulness can be modeled by atranslation unit with an alignment model between two languages The alignment modelcan be extracted automatically from the bilingual corpus (Brown et al., 1990, 1993; Ochand Ney, 2003) With many kind of translation units, we have some methods:
• Word-based translation models: using word as the translation unit
• Phrase-based translation models: using phrase as the translation unit
• Syntax-based translation models: using syntax as the translation unit
By passing the word-based translation models, we will describe the PBSMT in section 2.1.Then section 2.2 gives some basic kinds of movements of phrase in the reordering problem.And, one of famous lexical reordering model, integrated in the decoding process is seen
in section 2.3 Section 2.4 gives some briefs of method treated the reordering as the process task in training and decoding, using the transformation rules and syntax tree.Finally, we would like to introduce the Moses Decoder (Koehn et al., 2007), which is used
pre-to decode and train our models
PBSMT (Koehn et al., 2003; Och and Ney, 2004) is extended from word-based SMTmodel In this model, the faithfulness (Pr (f — e)) is extracted from bilingual corpus
by using the alignment model (the IBM model (Brown et al., 1993)) with the contiguoussequence of words Koehn et al (2003); Jurafsky and Martin (2009) describe the generalprocess of phrase based translation in three steps First, the words in the source sentenceare grouped into phrases: ¯e1, ¯e2· · · ¯eI Next we translate each phrase ¯ei into the targetphrase ¯fj Finally, each of target phrase ¯fj is (optionally) reordered Koehn et al (2003)use the IBM Models to build the translation model Two aligned phrases are the phrase
in which, each word in the source phrase can be aligned with other word in the targetphrase Specially, no word outside the phrase can be aligned the word in side the phraseand inverse And, the probability of phrase translation can be computed by formula
P r( ¯f |¯e) = count( ¯f , ¯e)
count(¯e)
Trang 138 Chapter 2 Related works
Some alternative methods to extract phrases and learn phrase translation tables haveproposed (Marcu and Wong, 2002; Venugopal et al., 2003) and compared in Koehn’spublication (Koehn et al., 2003)
Another phrase based model, introduced by Och and Ney (2004), use discriminativemodel In this various model, phrase translation probability and others features areintegrated to log linear model following the formula:
P r(f |e) = exp
P
iλihi(e, f )P
e 0exp λihi(e0, f )where hi(e, f ) is a feature function that bases on the source and target sentence The λi
is the weight of feature hi(e, f ) Och (2003) introduce method to estimate these weightsusing Minimum Error Rate Training (MERT) In their studies, they also use some basicfeatures such as:
• bidirectional phrase models (models that score phrase translation)
• bidirectional lexical models (models that consider the appearance of entries from aconventional translation lexical in the phrase translation)
• language model of the target language (usually n-gram model or n-th Markov model,which is trained from mono corpus1)
• distortion model
After having the translation model, we need method to decode input sentence tooutput sentence Normally, a kind of beam search or A∗ method is used Koehn (2004)introduced an effective method, called stack decoding The basic ideal of this method
is using a limited stack of the same length phrase For an example, in the processing
of translation, when three words are translated, the result phrase will be stored in thestack and the length of stack is three The pseudo code of stack decoding is described asfollows:
Tillmann (2004); Galley and Manning (2008) give the basic type of reordering phrase inthe sentence There are three basic types:
• monotone: continuous phrase in the source language is also continuous phrase inthe target language
• swap: the next phrase of one phrase in the target language is the previous phrase
of the aligned with the one in the source language
1 mono corpus is usually same with the training corpus, which is used learn the phrase table or a big set of target sentence, independent with the corpus to train and test
Trang 142.3 The Lexical Reordering Model 9
Algorithm 1 A Stack decoding algorithm, which is used in Phraoh system, to get thetarget sentence from source sentence
Require: a sentence, which we would like to translate
Require: a phrase base model
Require: a language model
initialize hypothesisStack[0,nf ]
for all i in range(0, nf - 1) do
for all hyp in hypothesisStack[i] do
for all new hyp can be derived from hyp do
nf [new hyp] → number of foreign words covered by new hyp
add new hyp to hypothesisStack[nf [new hyp]]
prune hypothesisStack[nf [new hyp]]
end for
end for
end for
find best hypothesis best hyp in hypothesisStack[nf ]
return a best path lead to best hyp via backtrace
• discontinuous: two consequence phrases in the one language are aligned with twodiscontinuous phrase in other language
For example, we have a pair of sentences:
en tom ’s two blue books are good
vn hai cuốn_sách màu_xanh của tôm là tốt
In this example, are good and là tốt is represented as monotone With swap instance,
we see two phrase blue books and cuốn_sách màu_xanh Finally, two blue and hai màu_xanh is the example of discontinous phrase
2.2.1 The Distance Based Reordering Model
Koehn et al (2003) introduced the method of phrase translation model and a simpledistortion model based on the exponent of the penalize α as the formula:
d(ei, ei−1) = αfi −fi−1−1
, so that, the distance of two continuous phrases is based on the difference of the correlativephrase in the target language For example, distance of two phrases hai màu_xanh isd(4, 3) = α3−1−1= α1
Galley and Manning (2008), based on the log linear model, introduced the new reorderingmodel as a feature of translation model The features are parameterized as follows: with
Trang 1510 Chapter 2 Related works
a source sentence f, a sequence of phrase e = ( ¯e1, ¯e2· · · ¯en), and phrase alignment model
a = ( ¯a1, ¯a2· · · ¯an) that defines a source ¯fa i for each translated phrase ¯ei, those modelsestimate the probability of a sequence of orientation o = ( ¯o1, ¯o2· · · ¯on) as follows:
i=1log p(oi = D| )
As mentioned in chapter 1, some approaches using syntactic information are applied
to solve the reordering problem One of approaches is syntactic parsing of the sourcelanguage and reordering rules as preprocessing steps The main idea is transferring thesource sentences to get very close target sentences in word order as possible, so EMtraining is much easier and word alignment quality becomes better There are severalstudies to improve reordering problem such as Xia and McCord (2004); Collins et al.(2005); Nguyen and Shimazu (2006); Wang et al (2007); Habash (2007); Xu et al (2009).They all performed reordering during preprocessing step based on the source treeparsing combining either automatic extracted syntactic rules Xia and McCord (2004);Nguyen and Shimazu (2006); Habash (2007) or manually written rules Collins et al.(2005); Wang et al (2007); Xu et al (2009)
Xu et al (2009) described method using dependency parse tree and a flexible rule toperform the reordering of subject, object, etc These rules were written by hand, but
Xu et al (2009) showed that an automatic rule learner can be used
Collins et al (2005) developed a clause detection and used some handwritten rules
to reorder words in the clause Partly, Xia and McCord (2004); Habash (2007) built anautomatic extracted syntactic rules
Compared with theses approaches, our work has a few differences Firstly, we aim
to develop the phrase-based translation model to translate from English to Vietnamese.Secondly, we build a shallow tree by chunking in recursively (chunk of chunk) Thirdly, we
Trang 162.5 Translation Evaluation 11
use not only the automatic rules, but also some handwritten rules, to transform the sourcesentence As Xia and McCord (2004); Habash (2007) did, we also apply preprocessing inboth training and decoding time
The other approaches use syntactic parsing to provide multiple source sentence ordering options through word (phrase) lattices Zhang et al (2007); Nguyen et al (2007).Nguyen et al (2007) applied some transformation rules, which is learnt automaticallyfrom bilingual corpus, to reorder some words in a chunk A crucial difference betweentheir methods and ours is that they do not perform reordering during training While,our method can solve this problem by using a complicated structure: a shallow syntaxtree (chunk of chunks), which is more efficient
In the previous sections, we introduced about the method which help computer can late automatically the sentence from one language to others Now, the question is howgood the results are To answer this question, this section will introduced some methods
trans-to evalute MT In general, there are two ways trans-to evaluate the MT task: evaluate byhuman or automatically by building a metric to simulate the human processing
2.5.1 Automatic Metrics
2.5.1.1 BLEU Scores
Today, BLUE Score is one of the most favorite method to evaluate the MT This measurewas introduced by IBM (Papineni et al., 2002) This method is one kind of a unigramprecision metric The simple unigram precision metric was based on the frequency of co-occurrence word in both output and reference sentence Similarly, in the BLUE Score, wealso compute the n-gram co-occurrence between the output and the reference translation,and then taking the weighted geometric mean So that, we can compute the BLUE Score
as follows:
BLU En= exp(
Pn i=1bluei
n + length penalty)where bluei and length penalty are computed by counting in each sentence pair in wholetest and reference sets And these parameter is computed follow some function:
bluen = log(N matchedn
N testn )length penalty = min(0, 1 − shortest ref length
N test1 )
Trang 1712 Chapter 2 Related works
And, from each pair of output and reference translation, we can compute the N matchedi,
N testi and shortest ref length as follow:
where S is the set of i-gram in the sentence testn(n-th sentence in the test set), N (sent, ngr)
is the number of the co-occurrence of the ngr ngram in the sentence sent, N is the numbersentence of the test set and refn,r is the r-th reference of n-th test sentence For example,with n-th sentence in the test set, we have R sentence as the reference of this sentence
2.5.2 NIST Scores
NIST evaluation metric, introduced by (Doddington, 2002), is based on the BLUE metric,but with some difficulties NIST Score calculates how informative a particular n-gram is,and the rarer a correct n-gram is, the more weight it will be given So that, we have adifficult equation:
ref-2.5.3 Other scores
There are another evaluation metrics such as Word Error Rate (mWER), Position pendent Error Rate (mPER), etc
Trang 18Inde-2.6 Moses Decoder 13
2.5.4 Human Evaluation Metrics
It is said that this is the most accurate method To evaluate the translation of a bilingualsentence, some people who understand both of the language will rate the translation inmany aspects The weakness of this method is cost and time, it needs many people totake part in to evaluate the translation However this method is used in some share task(?)
In general, the fluency of the translation is used to rate the quality To measure thefluency of the translation, we will base on some criteria such as how intelligible, how clear,how readable, or how natural the MT output is (Jurafsky and Martin, 2009) One method
is give the raters, who evaluate the MT output, a scale and ask them to rate the output.For an example, the scale can in range from one (totally unintelligible) to five (totallyintelligible), or another aspects of fluency such as clarity, naturalness or style
Another aspect is the fidelity of the sentence There are two common aspects of fidelity,which are measured, are adequacy and informativeness Adequacy metric determineswhether the output contains the information that exists in the source sentence It alsouses the scale from one to five (how much of the information in the source was preserved inthe target), for raters to rate This method is useful if we only have monolingual raters,who native in one language The other aspects is informativeness of the translation,the sufficient information in the MT output to perform some task This metric can bemeasured by the percentage of the correct answer of some questions, answered by theraters
• SRI introduced by (Stolcke, 2002), the limitation of this toolkit is the size of monocorpus, which can be load in the toolkit
• IRST introduced by (Federico et al., 2008), used quantization method to scale thelanguage model to big data
2 http://statmt.org/moses
3 This figure is taken from old version of manual of Moses Decoder