(LUẬN văn THẠC sĩ) integrated linguistic to statistical machine translation, tích hợp thông tin ngôn ngữ vào dịch máy tính thống kê

Firstly, we introduce the Machine Translation MT, which is one of bigapplications in Natural Language Processing NLP and an approach to solve this problem by using statistical.. Machine

Trang 1

VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND

Trang 2

1.1 Overview 1

1.1.1 A Short Comparison Between English and Vietnamese 2

1.2 Machine Translation Approaches 2

1.2.1 Interlingua 3

1.2.2 Transfer-based Machine Translation 3

1.2.3 Direct Translation 4

1.3 The Reordering Problem and Motivations 5

1.4 Main Contributions of this Thesis 5

1.5 Thesis Organization 6

2 Related works 7 2.1 Phrase-based Translation Models 7

2.2 Type of orientation phrases 8

2.2.1 The Distance Based Reordering Model 9

2.3 The Lexical Reordering Model 9

2.4 The Preprocessing Approaches 10

2.5 Translation Evaluation 11

2.5.1 Automatic Metrics 11

2.5.2 NIST Scores 12

2.5.3 Other scores 12

2.5.4 Human Evaluation Metrics 13

2.6 Moses Decoder 13

3 Shallow Processing for SMT 15 3.1 Our proposal model 15

3.2 The Shallow Syntax 16

3.2.1 Definition of the shallow syntax 16

3.2.2 How to build the shallow syntax 17

3.3 The Transformation Rule 18

3.4 Applying the transformation rule into the shallow syntax tree 19

4 Experiments 21 4.1 The bilingual corpus 21

4.2 Implementation and Experiments Setup 21

4.3 BLEU Score and Discussion 22

5 Conclusion and Future Work 25 5.1 Conclusion 25

5.2 Future work 25

Trang 3

ii Contents

Appendix A A hand written of the transformation rules 27Appendix B Script to train the baseline model 29

Trang 4

List of Tables

1 Corpus Statistical 21

2 Details of our experimental, AR is named as using automatic rules, MR isnamed as using handwritten rules 22

3 Size of phrase tables 23

4 Translation performance for the English-Vietnamese task 23

Trang 5

List of Figures

1 The machine translation pyramid 3

2 The concept architecture of Moses Decoder 14

3 An overview of preprocess before training and decoding 15

4 A pair of source and target language 15

5 The training process 16

6 The decoding process 17

7 A shallow syntax tree 17

8 The building of the shallow syntax 18

9 The building of the shallow syntax 20

Trang 6

Chapter 1

Introduction

In this chapter, we would like to give a brief of Statistical Machine Translation (SMT),

to address the problem, the motivations of our work, and the main contributions ofthis thesis Firstly, we introduce the Machine Translation (MT), which is one of bigapplications in Natural Language Processing (NLP) and an approach to solve this problem

by using statistical Then, we also introduce the main problem of this thesis and ourresearch motivations The next section will describe the main contributions of this thesis.Finally, the content of this thesis will be outlined

In the field of NLP, MT is a big application to help a user translate automatically asentence from one language to another language MT is very useful in real life: MThelp us surf the website in foreign languages, which we don’t understand, or help youunderstand the content of an advertising board on the street However, the high quality

MT is still challenges for researchers Firstly, the reason comes from the ambiguity ofnatural language at various levels At lexical level, we have problem with the morphology

of the word such as the word tense or word segmentation, such as Vietnamese, Japanese,Chinese or Thai, in which there is no symbol to separate two words For an example, inVietnamese, we have a sentence "học sinh học sinh học.", "học"is a verb, which means

”study” in English, "học sinh"is a noun, which means a pupil or student in English,

"sinh học"is a noun, which means a subject (biology) in English At the syntax level,

we have the ambiguity of coordinate For example, we have another sentence the mansaws the girl with the telescope We can understand that the man used the telescope

to see the girl or the girl with the telescope is seen by the man So on, the ambiguity ismore difficult in the semantic level

Secondly, Jurafsky and Martin (2009) shows that there are some differences in a pair

of language such as the difference in structure, lexical, etc , which make MT becomechallenges

Specially, one of the differences between two languages, which we want to aim onthis thesis, is the order of words in each language For example, English is a type ofSubject-Verb-Object (SVO) language, which means subject comes first, then verb followsthe subject and the end of the sentence is Object In the sentence ”I go to school”,

”I” is its subject, the verb is go to and the object is school The different from English,Japanese is a type of SOV language, or Classical Arabic is VSO language

In the past, the rule-based method were favorite They built MT system with somemanually rules, which are created by human, so that, in closed domain or restricted area,

Trang 7

2 Chapter 1 Introduction

the quality of rule based system is very high However, with the increase of internet andsocial network, we need a wide broad of MT system, and the rule based method is notsuitable So that, we need a new way to help the MT, and the statistic is applied tothe field of MT At the same time, the statistical method is applied in many studies:automatic speech recognition, etc So that, the idea of using statistical for MT has beencoming out Nowadays, there are some MT systems, in which statistical method is used,can compare with human translation such as GOOGLE1

1.1.1 A Short Comparison Between English and Vietnamese

English and Vietnamese have some similarities such as they base on the Latin character

or are the type of SVO structure For an example:

of phrase The reorder of words can be seen in wh-question, too

en: what is your job?

vn: công_việc của anh là gì ?

In this example, the word what mean gì in Vietnamese The position of these two wordscan be easil seen Because, English follows with S-Structure and Vietnamese follows withD-Structure

In this section, we would like to give a short of approaches in the field of machine lation We would like to begin with complex method (interlingua) and en with simpleone (direct method) From a source sentence, we use some analyzing methods to get thecomplex structures, and then generate the structures or sentences in the target language.The highest complex structure is the interlingua language (figure 1)

trans-1 http://translate.google.com

Trang 8

1.2 Machine Translation Approaches 3

Figure 1: The machine translation pyramid

1.2.1 Interlingua

The interlingua systems (Farwell and Wilks, 1991; Mitamura, 1999) are based on the idea

of finding a language, which called interlingua language to represent the source languageand is easy enough to generate the sentence in other language In the figure 1, we can seethe process of this approach The analyzing method is the understanding process, in thisstep, from source sentence we can use some technical in NLP to map source sentence todata structure in the interlingua, then retrieve the target sentence by generating process.The problem is how complex the interlingua is If the interlingua is simple, we can getmany translation options In other way, the more complex the interlingua is, the morecost effort the analyzing and the generating are

1.2.2 Transfer-based Machine Translation

Another approach is analyzing the complex structure (simpler than interlingua structure),then using some transfer rules to get the similar structure in the target language Thengenerating the target sentence On this model, MT involves three phrases: analysis,transfer and generation Normally, we can use all three phrases However, we some-times use two of three phrases such as transfer from the source sentence to the structure

in target language then generate the target sentence For example, we would like tointroduce a simple transfer rule to translate source sentence to the target sentence2

[N ominal → AdjN oun]source language⇒ [N ominal → N ounAdj]target language

2 This example is take from Jurafsky and Martin (2009)

Trang 9

4 Chapter 1 Introduction

1.2.3 Direct Translation

1.2.3.1 Example-based Machine Translation

Example based machine translation was first introduced by Nagao (1984), the authorused a bilingual corpus with parallel texts as its main knowledge base, at run time Theidea is behind it, is finding the pattern in the bilingual and combining with the paralleltext to generate the new target sentence This method is similar with the process inhuman brain Finally, the problem of example based machine translation comes from thematching criteria, the length of the fragments, etc

1.2.3.2 Statistical Machine Translation

Extending the idea of using statistical for speech recognition, Brown et al (1990, 1993)introduced the method using statistical, a version of noisy channel to MT Applyied noisychannel to machine translation, the target sentence is transformed to the source sentence

by noisy channel We can represent MT problem as three tasks of noisy channel:

forward task: compute the fluency of the target sentence

learning task: from parallel corpus find the conditional probability between the targetsentence and the source sentence

decoding task: find the best target sentence from source sentence

So that the decoding task can be represented as this formula:

ˆ

e = arg max

e

P r(e|f )Applying the Bayes rule, we have:

sen-We use the alignment model to compute this value base on the unit of the SMT Basing

on the definition of the translation unit we have some of approaches:

• word based: using word as a translation unit (Brown et al., 1993)

• phrase based: using phrase as a translation unit (Koehn et al., 2003)

• syntax based: using a syntax as a translation unit (Yamada and Knight, 2001)

Trang 10

1.3 The Reordering Problem and Motivations 5

In the field of MT, the reordering problem is the task to reorder the words, in the targetlanguage, to get the best target sentence Sometimes, we call the reordering model asdistortion model

Phrase-based Statistical Machine Translation (PBSMT), which was introduced byKoehn et al (2003); Och and Ney (2004), is currently the state of the art model in wordchoice and local word reordering The translation unit is the sequence of words withoutlinguistic information So that, in this thesis, we would like integrate some linguisticinformation such as a chunking, a syntax shallow tree or transformation rule and with aspecial aim at solving the global reordering problem

There are some studies on integrating syntactic resources within SMT Chiang ang (2005) shows significant improvement by keeping the strengths of phrases, whileincorporating syntax into SMT Chiang (2005) built a kind of the syntax tree based onsynchronous Context Free Grammar (CFG), known as the hierarchical of phrase Chiang(2005) used log linear model to determine the weighted of extracted rules and devel-oped various of CYK algorithm to implement decoding So that, the reordering phrase isdefined by the synchronous CFG

Chi-Some approaches have been applied at the word-level (Collins et al., 2005) They areparticularly useful for language with rich morphology, for reducing data sparseness Otherkinds of syntax reordering methods require parser trees , such as the work in Quirk et al.(2005); Collins et al (2005); Huang and Mi (2010) The parsed tree is more powerful incapturing the sentence structure However, it is expensive to create tree structure, andbuilding a good quality parser is also a hard task All the above approaches require muchdecoding time, which is expensive

The approach we are interested in here is to balance the quality of translation with coding time Reordering approaches such as a preprocessing step Xia and McCord (2004);

de-Xu et al (2009); Talbot et al (2011); Katz-Brown et al (2011) is very effective ment significant over state of-the-art phrase-based and hierarchical machine translationsystems and separately quality evaluation of reordering models)

Inspiring this preprocess approach, we have proposed a combination approach which serves the strength of phrase-based SMT in local reordering and decoding time as well asthe strength of integrating syntax in reordering As the result, we use an intermediatesyntax between the Parts of Speech (POS) tag and parse tree: shallow parsing Firstly,

pre-we use shallow parsing for preprocess with training and testing Secondly, pre-we apply aseries of transformation rules to the shallow tree We have get two sets of transformationrules: the first set is written by hand, and the other is extracted automatically from thebilingual corpus The experiment results from English-Vietnamese pair showed that ourapproach achieves significant improvements over MOSES, which is the state-of-the artphrase based system

Trang 12

Chapter 2

Related works

In this chapter, we would give some background knowledge and a little review on the MT.According to the short description in the chapter 1, the target translation of MT is theresult of finding the maximum argument for the production of the faithfulness (P r(f |e))and the fluency of the target sentence(P r(e)) The fluency of the target sentence knows

as the language model, can be modeled by n-gram The faithfulness can be modeled by atranslation unit with an alignment model between two languages The alignment modelcan be extracted automatically from the bilingual corpus (Brown et al., 1990, 1993; Ochand Ney, 2003) With many kind of translation units, we have some methods:

• Word-based translation models: using word as the translation unit

• Phrase-based translation models: using phrase as the translation unit

• Syntax-based translation models: using syntax as the translation unit

By passing the word-based translation models, we will describe the PBSMT in section 2.1.Then section 2.2 gives some basic kinds of movements of phrase in the reordering problem.And, one of famous lexical reordering model, integrated in the decoding process is seen

in section 2.3 Section 2.4 gives some briefs of method treated the reordering as the process task in training and decoding, using the transformation rules and syntax tree.Finally, we would like to introduce the Moses Decoder (Koehn et al., 2007), which is used

pre-to decode and train our models

PBSMT (Koehn et al., 2003; Och and Ney, 2004) is extended from word-based SMTmodel In this model, the faithfulness (Pr (f — e)) is extracted from bilingual corpus

by using the alignment model (the IBM model (Brown et al., 1993)) with the contiguoussequence of words Koehn et al (2003); Jurafsky and Martin (2009) describe the generalprocess of phrase based translation in three steps First, the words in the source sentenceare grouped into phrases: ¯e1, ¯e2· · · ¯eI Next we translate each phrase ¯ei into the targetphrase ¯fj Finally, each of target phrase ¯fj is (optionally) reordered Koehn et al (2003)use the IBM Models to build the translation model Two aligned phrases are the phrase

in which, each word in the source phrase can be aligned with other word in the targetphrase Specially, no word outside the phrase can be aligned the word in side the phraseand inverse And, the probability of phrase translation can be computed by formula

P r( ¯f |¯e) = count( ¯f , ¯e)

count(¯e)

Trang 13

8 Chapter 2 Related works

Some alternative methods to extract phrases and learn phrase translation tables haveproposed (Marcu and Wong, 2002; Venugopal et al., 2003) and compared in Koehn’spublication (Koehn et al., 2003)

Another phrase based model, introduced by Och and Ney (2004), use discriminativemodel In this various model, phrase translation probability and others features areintegrated to log linear model following the formula:

P r(f |e) = exp

P

iλihi(e, f )P

e 0exp λihi(e0, f )where hi(e, f ) is a feature function that bases on the source and target sentence The λi

is the weight of feature hi(e, f ) Och (2003) introduce method to estimate these weightsusing Minimum Error Rate Training (MERT) In their studies, they also use some basicfeatures such as:

• bidirectional phrase models (models that score phrase translation)

• bidirectional lexical models (models that consider the appearance of entries from aconventional translation lexical in the phrase translation)

• language model of the target language (usually n-gram model or n-th Markov model,which is trained from mono corpus1)

• distortion model

After having the translation model, we need method to decode input sentence tooutput sentence Normally, a kind of beam search or A∗ method is used Koehn (2004)introduced an effective method, called stack decoding The basic ideal of this method

is using a limited stack of the same length phrase For an example, in the processing

of translation, when three words are translated, the result phrase will be stored in thestack and the length of stack is three The pseudo code of stack decoding is described asfollows:

Tillmann (2004); Galley and Manning (2008) give the basic type of reordering phrase inthe sentence There are three basic types:

• monotone: continuous phrase in the source language is also continuous phrase inthe target language

• swap: the next phrase of one phrase in the target language is the previous phrase

of the aligned with the one in the source language

1 mono corpus is usually same with the training corpus, which is used learn the phrase table or a big set of target sentence, independent with the corpus to train and test

Trang 14

2.3 The Lexical Reordering Model 9

Algorithm 1 A Stack decoding algorithm, which is used in Phraoh system, to get thetarget sentence from source sentence

Require: a sentence, which we would like to translate

Require: a phrase base model

Require: a language model

initialize hypothesisStack[0,nf ]

for all i in range(0, nf - 1) do

for all hyp in hypothesisStack[i] do

for all new hyp can be derived from hyp do

nf [new hyp] → number of foreign words covered by new hyp

add new hyp to hypothesisStack[nf [new hyp]]

prune hypothesisStack[nf [new hyp]]

end for

find best hypothesis best hyp in hypothesisStack[nf ]

return a best path lead to best hyp via backtrace

• discontinuous: two consequence phrases in the one language are aligned with twodiscontinuous phrase in other language

For example, we have a pair of sentences:

en tom ’s two blue books are good

vn hai cuốn_sách màu_xanh của tôm là tốt

In this example, are good and là tốt is represented as monotone With swap instance,

we see two phrase blue books and cuốn_sách màu_xanh Finally, two blue and hai màu_xanh is the example of discontinous phrase

2.2.1 The Distance Based Reordering Model

Koehn et al (2003) introduced the method of phrase translation model and a simpledistortion model based on the exponent of the penalize α as the formula:

d(ei, ei−1) = αfi −fi−1−1

, so that, the distance of two continuous phrases is based on the difference of the correlativephrase in the target language For example, distance of two phrases hai màu_xanh isd(4, 3) = α3−1−1= α1

Galley and Manning (2008), based on the log linear model, introduced the new reorderingmodel as a feature of translation model The features are parameterized as follows: with

Trang 15

a source sentence f, a sequence of phrase e = ( ¯e1, ¯e2· · · ¯en), and phrase alignment model

a = ( ¯a1, ¯a2· · · ¯an) that defines a source ¯fa i for each translated phrase ¯ei, those modelsestimate the probability of a sequence of orientation o = ( ¯o1, ¯o2· · · ¯on) as follows:

i=1log p(oi = D| )

As mentioned in chapter 1, some approaches using syntactic information are applied

to solve the reordering problem One of approaches is syntactic parsing of the sourcelanguage and reordering rules as preprocessing steps The main idea is transferring thesource sentences to get very close target sentences in word order as possible, so EMtraining is much easier and word alignment quality becomes better There are severalstudies to improve reordering problem such as Xia and McCord (2004); Collins et al.(2005); Nguyen and Shimazu (2006); Wang et al (2007); Habash (2007); Xu et al (2009).They all performed reordering during preprocessing step based on the source treeparsing combining either automatic extracted syntactic rules Xia and McCord (2004);Nguyen and Shimazu (2006); Habash (2007) or manually written rules Collins et al.(2005); Wang et al (2007); Xu et al (2009)

Xu et al (2009) described method using dependency parse tree and a flexible rule toperform the reordering of subject, object, etc These rules were written by hand, but

Xu et al (2009) showed that an automatic rule learner can be used

Collins et al (2005) developed a clause detection and used some handwritten rules

to reorder words in the clause Partly, Xia and McCord (2004); Habash (2007) built anautomatic extracted syntactic rules

Compared with theses approaches, our work has a few differences Firstly, we aim

to develop the phrase-based translation model to translate from English to Vietnamese.Secondly, we build a shallow tree by chunking in recursively (chunk of chunk) Thirdly, we

Trang 16

2.5 Translation Evaluation 11

use not only the automatic rules, but also some handwritten rules, to transform the sourcesentence As Xia and McCord (2004); Habash (2007) did, we also apply preprocessing inboth training and decoding time

The other approaches use syntactic parsing to provide multiple source sentence ordering options through word (phrase) lattices Zhang et al (2007); Nguyen et al (2007).Nguyen et al (2007) applied some transformation rules, which is learnt automaticallyfrom bilingual corpus, to reorder some words in a chunk A crucial difference betweentheir methods and ours is that they do not perform reordering during training While,our method can solve this problem by using a complicated structure: a shallow syntaxtree (chunk of chunks), which is more efficient

In the previous sections, we introduced about the method which help computer can late automatically the sentence from one language to others Now, the question is howgood the results are To answer this question, this section will introduced some methods

trans-to evalute MT In general, there are two ways trans-to evaluate the MT task: evaluate byhuman or automatically by building a metric to simulate the human processing

2.5.1 Automatic Metrics

2.5.1.1 BLEU Scores

Today, BLUE Score is one of the most favorite method to evaluate the MT This measurewas introduced by IBM (Papineni et al., 2002) This method is one kind of a unigramprecision metric The simple unigram precision metric was based on the frequency of co-occurrence word in both output and reference sentence Similarly, in the BLUE Score, wealso compute the n-gram co-occurrence between the output and the reference translation,and then taking the weighted geometric mean So that, we can compute the BLUE Score

as follows:

BLU En= exp(

Pn i=1bluei

n + length penalty)where bluei and length penalty are computed by counting in each sentence pair in wholetest and reference sets And these parameter is computed follow some function:

bluen = log(N matchedn

N testn )length penalty = min(0, 1 − shortest ref length

N test1 )

Trang 17

And, from each pair of output and reference translation, we can compute the N matchedi,

N testi and shortest ref length as follow:

where S is the set of i-gram in the sentence testn(n-th sentence in the test set), N (sent, ngr)

is the number of the co-occurrence of the ngr ngram in the sentence sent, N is the numbersentence of the test set and refn,r is the r-th reference of n-th test sentence For example,with n-th sentence in the test set, we have R sentence as the reference of this sentence

2.5.2 NIST Scores

NIST evaluation metric, introduced by (Doddington, 2002), is based on the BLUE metric,but with some difficulties NIST Score calculates how informative a particular n-gram is,and the rarer a correct n-gram is, the more weight it will be given So that, we have adifficult equation:

ref-2.5.3 Other scores

There are another evaluation metrics such as Word Error Rate (mWER), Position pendent Error Rate (mPER), etc

Trang 18

Inde-2.6 Moses Decoder 13

2.5.4 Human Evaluation Metrics

It is said that this is the most accurate method To evaluate the translation of a bilingualsentence, some people who understand both of the language will rate the translation inmany aspects The weakness of this method is cost and time, it needs many people totake part in to evaluate the translation However this method is used in some share task(?)

In general, the fluency of the translation is used to rate the quality To measure thefluency of the translation, we will base on some criteria such as how intelligible, how clear,how readable, or how natural the MT output is (Jurafsky and Martin, 2009) One method

is give the raters, who evaluate the MT output, a scale and ask them to rate the output.For an example, the scale can in range from one (totally unintelligible) to five (totallyintelligible), or another aspects of fluency such as clarity, naturalness or style

Another aspect is the fidelity of the sentence There are two common aspects of fidelity,which are measured, are adequacy and informativeness Adequacy metric determineswhether the output contains the information that exists in the source sentence It alsouses the scale from one to five (how much of the information in the source was preserved inthe target), for raters to rate This method is useful if we only have monolingual raters,who native in one language The other aspects is informativeness of the translation,the sufficient information in the MT output to perform some task This metric can bemeasured by the percentage of the correct answer of some questions, answered by theraters

• SRI introduced by (Stolcke, 2002), the limitation of this toolkit is the size of monocorpus, which can be load in the toolkit

• IRST introduced by (Federico et al., 2008), used quantization method to scale thelanguage model to big data

2 http://statmt.org/moses

3 This figure is taken from old version of manual of Moses Decoder

Tiêu đề	Integrated Linguistic to Statistical Machine Translation
Tác giả	Hoai-Thu Vuong
Trường học	Vietnam National University Hanoi University of Engineering and Technology
Thể loại	master thesis
Năm xuất bản	2012
Thành phố	Hanoi

Định dạng
Số trang	36
Dung lượng	632,34 KB