Luận văn integrated linguistic to statistical machine translation tích hợp thông tin ngôn ngữ vào dịch máy tính thống kê

Firstly, we introduce the Machine Translation MT}, which is one of hig applications in Natural Language Processing NLI” and an approach to solve this problem by using statistical.. ‘The

Trang 2

11.1 A Short Compariyon Between Euglish und Vicuummese 2

3.2.1 Definition of the shallow syntax : ca - 18 3.2.2 How lo build the shallow ggHbax eee I

3.4 Applying rhe transformation rule into the shallow syntax tree „ 19

5.2 buture work se ¬ si ˆ +

Trang 4

List of Tables

Details of anv experimental, AR ix named as using automatic el

‘Ivanslation performance for the English Vietmamese task - 23

Trang 5

Au overview of preproeess belore braining and decoding - 1ã

A pair of source and target language Loses - lã

‘The training process : : " - 16 The decoding process : : ca - 7

A shollow syntax wee ¬——

The building of the shallow syntax " - 18 The building of the shullow symax ee eee 20

Trang 6

CHAPTER 1

Introduction

In this chapter, we would like lo give a brief of Statistical Machine Translation (SMT),

to address the problem, the motivations of our work, and the main contributions of this thesis Firstly, we introduce the Machine Translation (MT}, which is one of hig applications in Natural Language Processing (NLI”) and an approach to solve this problem

by using statistical Then, we also introduce the main problem of this thesis and aur research motivations The next seetiou will describe the nein vonlcibutions of Uhis Unesis Finally, the content of this thesis will he outlined

In the field of NLP, MT is 9 big application to help a uscr translate automatically a sentence from one langnage to another language MT is very useful in real life: MT help uy gurl the websile in forcign languages, which we don’t undawtaud, or help you

understand the content of an advertising beard on the street However, the high quality

MT is still challenges for researchers Firstly, (he reason comes from the ambiguity of natural language at various levels At lexical level, we have problem with the morphclogy

of the word sucht as Ue word Lune or word segmentation, such as Vicluamese, Japanesu, Chinese or 'Lhai, in which there is no symbol to separate two words For an example, in Vielnamexe, we have a seuleuce "hoe sinh hye sink hge.", "hgc"is a verb, which means

*study” in English, "hoe sinl"is 2 noun, which means a pupil or student in English,

“sinh hạc”is a noun, which means a subject (biology) in Tnglish At the syntax level,

we have the ambiguity of coordinate For example, we have another sentence the man saws the girl with the telescope We can understand that the man used the telescope

vo see the gitl or the girl with the llescope is seen by the inan, So on, Uke ambiguity is more difficult in the semantic level

Secondly, JnrafRky and Martin (2009) shows that there are some differences in a pair

of language such ay the diflerence in structure, Ixieul, ete , which make MT become challenges

Specially, oue of the differences between two languages, which we want to alm on this thesis, is the order of words in each language Mor example, Lnglish is a type of Subjecl-Verb-Objvct (SVO} language, which micany subject comes first, Chew verb follows the subject and the end of the sentence is Object In the sentence "1 go to school”,

*T” is its subject the verb is go to and the object is school The different from English, Japanese is a type of SOV language, or Classical Arabic is VSO language

Tn the past, the rule-hased method were favorite They built, MT system with some manually rules, which are created by hurnan, so thut, in dosed domain or restricted ura,

Trang 7

2 Chapter 1 Intraductian

the quality of rule based system is very high Ilowever, with the increase of internet and social nctwork, we need a wide broad of MT system, and the cule based method is aot suitable So that, we need a new way to help the M'I, and the statistic is applied to the field of MT At the same time, the statistical method is applied in many studies: automatic speech recognition, ete So that, the idea of using statistical for M'’ has been coming ont Nowadays, there are some MT aystems, in which statistical method is used can compare with human translation such as GOOGLE!

1.1.1 <A Short Compurison Between English and Vietnamese

Finglish and Vietnamese have some similarities such as they hase on the Latin character

or are the type of SVO structure For an cxample:

en: 1 go to school

wm: Tai di hoe

But the order of words in an English noun phrase is different from thal in a Vicuamese one, Mor example:

em: a black hat,

vi¿ một mũ mầu đen

In the above English example, hat is the head of the noun phrase and it stands at the end of the phrase And in Vietnamese, mit is also the head nom, but it is in the middle

of phrase The reorder of words can be seen in wh-question, too

en: what is your job?

vn: công_ việc của anh là gì ?

Tn this example, the word what mean gi in Vietnamese The position of these two words

can be easil seen Because, |'nglish follows with $-Structure and Vietnamese follows with

D-Strnetnre

In this sevtiou, we would like Lo give short of approaches in the field of machine translation We would fike to begin with complex method (interlingua) and en with simple one (direct method) From a sonrce sentence, we se some analyzing methods to get the complex structures, and then generate the structures or sentences in the target language The highest complex strnature is the interlingna language (figure 1}

Tht

‘translate google.com

Trang 8

1.2 Machine Translation Approaches 3

Tignre 1: The machine translation pyramid

1.2.1 Interlingua

‘The interlingua systems (I'arwell and Wilks, 1991; Mitamura, 1999) are based on the idea

of finding a langnage, which calied interlingua language to represent the sourea langnage ancl is easy enough to generate the sentence in other language In the figure 1, we can see the process of this approach ‘The analyzing method is the understanding process, in this step, from source sentence we can use some technical in \LP to map source sentence to data, structure im the imerlingua, then retrieve the target sentence by generating process The problem ix how complex vhe interlingua is If Une inlerlingua is simple, we can get many translation options Tn other way, the more complex the interlingua is, the mare

cost effort the analyzing aud the yenerating are

1.2.2 Transfer-based Machine Translation

Another spproach is analyzing the complex structure (simpler Laan intcrlingua structure),

then using some transfer rules ta get the similar structure in the target language ‘hen generating vie target sentence, On this model, MT involves three phrases: analysis, transfer and generation, Normally, we can use all three phrases However, we sometimes use two of three phrases such as transfer from the source sentence to the structure

in target language then generate the target sentence bor example, we would like to

introduce a simple transfer rule to translare source sentence ro the targer sentence?

[Nominal > AdjNeoun|source language > [Norninul > NourAdj|rarge: language

‘This example is take from Juraisky and Marti (2009)

Trang 9

1.2.3 Direct Translation

1.2.3.1 Example-based Machine Translation

Fxample based machine translation was first imtroduced by Nagan (1984), the anthor used a bilingual corpuy wilh purullc) Wexty us ity nin kuewledge base, wl ru lime, The idea, is behind i, is finding the pattem in the bilmgual and combining with the parallel lext lo geucrate the new turget semtenve, This aicthod is similar with Uhe process in human brain Finally, the problem of example based machine translation comes from the smatuhing criteria, Une leugth of the fraguneuts, eve

1.2.3.2 Statistical Machine Translation

Extonding the idea of using statistical for spcech recognition, Brown ct al (1990, 1993) introduced the method using statiatical, aversion af noisy channel to MT Applyied noisy dhunuel Lu machine translation, the vargel senteuce is transformed le the wouree sentence

by noisy channel We can represent MT problem as three tasks of noisy channel:

forward task: computa the fluency af the target, sentence

learning task: feom parallel corpus fiud the coudivional probability between the Larget

sentence and the source sentence

deroding task: find the best target sentence from source sentence

So that the decoding task cau be represented as Ubis formula:

ê= argmax Pr(e|ƒ) Applying the Bayes rule, we have:

Pr( fie) * Prle)

Prif)

rg sna Recanse of the same denominator, we have:

é — arg max Pr( fie) * Pr(e)

(Jurafsky and Martin, 2000, 2009) define the Pr (e) as the fluency of the target sen tence, known aa the fangnage madel Tt ia uaually modeled by n-gram or n-th Markov model The Pr( fle) is defined ag the [auithfuluesy between the source and target language,

‘We use the alignment model to compute this value base on the unit, af the SMT Basing

ou the definition of the translation unit we lave some of approaches:

« word bascd; using word as a translation unit (Brown ot al., 1993)

« phrase based: using phrase as a translation unit (Koehn et al., 2003)

@ syntax based: using a ayntax as a.translafion mit (Yamada and Knight, 2001)

Trang 10

1.3 The Reardering Problem and Motivations 5

Jn the field of M1’, the reordering problem is the task to reorder the words, in the target

language, vo get the best target sentence, Sometimes, we call the reordering model ox

distortion model

Phrase-based Statistical Machine ‘Iranslation (PBSM'I), which was introduced by Koehn et al (2003); Och and Ney (2004), is currently the state of the art model in word

choice and local word reordering The trauslation unit is Une sequence of words without

linguistic information So that, in this thesis, we would like integrate some linguistic information such as a chunking, a synlax shallow tree or trausformation rule aud wilh a special aim at solving the global reordering problem

‘There are some studies on integrating syntactic resources within $M‘ Chiang Chi- aug (2005) shows significant improvement by keeping the strengths of phrases, while

incorporating syntax into SMT Chiang (2005) built a kind of the syntax tree based on

synchronous Context Free Grammar (CFG), known as the hierarchical of phrase Chiang

(2005) used log lincar model ta determine the weighted of extrectcd rules and devcl-

oped various of CYK algorithm to implement decoding So that, the reordering phrase is defined by the synchronous CFC

Some approaches have been applied at the word-level (Collins et al., 2005) They are particularly useful for language with rich morphology, for reducing data sparseness Other kinds of syntax reordering methods require parser trees , such as the work in Quirk et al (2005); Collins ct al (2005): Huang and Mi (2010) The parsed tree is more powerful in

capturing the sentence structure However, it is expensive to create tree structure, and

uilding » good quality parser is also a hard task All dhe above upprouches require much decoding time, which is expensive

‘Lhe approach we are interested in here is to balance the quality of translation with de-

coding time Reordering approaches such us a preprocessing siep Mla and McCord (2064);

Ku et al (2009); Talbot et al (2011); Katz-Lrown et al (2011) is very effective (improve ment significant over state of-the-art, phrase-hased and hierarchical machine translation

systems and separately quality evaluation nf reordering models}

Inspiring this preprocess approach, we have proposed a combination approach which pre- serves the strength of phrase-based SM'l’ in local reordering and decoding time as well as the strength of integrating syntax in reordering As the result, we use an intermediate syntax between the arts of Speech (I$) tag and parse tree: shallow parsing Mirstly,

we nse shallow parsing far prepracess with training and testing Secondly, we apply a seriey of translurmation rules bo tke shullow tree We lave get two sels of transformation rules: the first set is written by hand, and the other is extracted automatically from the bilingual corpus The experiment resully from English-Vickimunese pair showed (hut our approach achieves significant improvements over MOSTS, which is the state-of-the art

phrase based system.

Trang 11

Chapter 2 will describe the other studies, which is related to our method Our method

will be introduced in chapter 3 Tn the next chapter (chapter 4), we would give the detail

of some experiments and the results to prove the effect of our method ‘Lhe sum up of cur study will be seen in chapter 5 This chapter will also discuas abont the future works

Trang 12

as the language model, can be modeled by -gram ‘I'he faithfulness can be modeled by 9 vauslation unit with an aliguneut model belween two languages The aligumeut aivdel can be extracted automatically from the bilmgual corpus (Brown et al., 1990, 1993; Och and Ney, 2003) With many kind of translation units, we have some methods:

@ Phrase-hased translation models: using phrase as the translation unit

« Syntax-based translation models: using syntax as the translation unit

By passing thy word-bused translation inodels, we will deserlbe the PBSMT in section 2.1

‘Yhen section 2.2 gives some basic kinds of movements of phrase in the reordering problem And, one of famons lexical reordering model, integrated in the decoding process is seen

in section 2.3 Section 2.4 gives some briefs of method treated the reordering as the preprocess task in training and decading, using the transformation rules and syntax tree Finally, we would like to introduce the Moses Decoder (Koehn et al., 2007}, which is used

ta decode and train aur models

PBSMI (Koehn et al., 2003: Och and Ney, 20U4} is extended from word-based SM‘ invdel In this model, the faithfulocss (Pr (f — ¢)) is extracted rout bilingual corpus

by using the alignment model (the 13M model (Brown et al., 1993}) with the contiguous sequence of words Koch ot al (2003); Juralsky and Murtin (2009) deseribe the general process of phrase based translation in three steps First, the words in the source sentence are gronped into phrases: 4,43 -+-@] Next we translate each phrase 4 into the target phrase f, Finally, each of target phrase f; is (optionally) reordered Koehn et al (2003) nse the TBM Models to build the translation model Two aligned phrases are the phrase

in which, each word in the source phrase can be aligned with other word in the target

phrase Specially, no word outside the phrase can be aligned the word in side the phrase

aud inverse, Aud, the probability of plrase wauslation can be computed by formula

_ count( F, 2)

Pr(/Is) caum‡(8)

Trang 13

8 Chapter 2 Related works

Some alternative methods to extract phrases and learn phrase translation tables have propescd {Muzcu aud Woug, 2002; Venugopal ct al., 2003) und compared in Kochu’s publication (Koebn et al., 2003)

Another phrase based model, introduced by Och and Ney (2004), use discriminative model Tn this vious wedel, plu: translution probubilily aud others features are integrated to log linear model following the formula’

oy — PD Atle f)

TrÚI) = S SWAhie,/)

where h;(e, ƒ) is a fearure funetion that bases on the source and target sentence The 4;

is the weight of feature h,(e, f) Och (2003) introduce method to estimate these weights using Minimum Lrror Rate ‘Lraining (MIL) In their studies, they also use some basic features snch as:

@ lunguage model of Uhe target lunguage (usually a-grse model or o-th Markov model,

which is trained from mono corpus!}

® distortion model

After having the translation model, we need method to decode input sentence to

output sentence Normally, a kind of beam search or A* method is used Koehn (2004) introduced an effective methad, called stack decoding The basic ideal of this method

iy using a Tiniled stuck of the sume length phrase For un exumple, in the processing

of translation, when three words are transtated, the result phrase will he stared in the

slack and the length of stuck is Une The pseudo code of stack decoding iy described us

follows:

‘Tillmann (2004); Galley and Manning (2008) give the basic type of reordering phrase in the sentence There are three basic types:

* monotone: continuous phrase in the source language is also continuous phrase in

the target langnage

* swap: the next phrase of one phrase in the target language is the previous phrase

of the aligned with the one in the source language

Lmoro corpus is usnally same with the training corpus, which is nsed learn the phrase table ar a big set of target sentence, independent with the corpus to train and test

Trang 14

2.2 The Lexical Reordering Mndel 8

Algorithm £ A Stack decoding algoritlun, whieh 1s used in Phraoh system, to get the

Vargel sentence from source sentence

@ sentence, which we would like to translate

a phrase bose model

» language model initialize hypothesisStack[0,nf]

for all i in range(0, nf - 1) do

for all hyp in hypathesisStack[i] do

for all new_hyp can be derived from hyp do

njlnew hyp] + number of foreign words covered by new_hyp ald new_hyp to hypothesisStack|n f|new_hypll

prunc hypothesisStack|nf new _hypl]

eud for

end for

find best hypothesis best_hyp in hypothesisStack|n#]

return a best path lead to best_hyp via backtrace

Tor example, we have a pair of sentences:

en tom ’s two blue books ure god

vn hai cuốn_ sách miu_xanh cia tom 1a tat

In this example, ane good and ia i8t is represented as monotone With swap instance,

we soe lwo phrase blue booRs and cuốn sách rrầu xanh Fluully, two blue ond het

man_aurh.is the example of discontinons phrase

2.2.1 The Distance Based Reordering Model

Koehn et al (2003) introduced the method of phrase translation madel and a simple distortion model based on the exponent of the penalize a as the formula:

ale, 1-1) — oh

„ so Lhal, the distance of two continuous phrases is based on the difference of the correlative

phrase in the target language [for example, distance of two phrases Aai mdu_zanh is

(4,3) — nt —

Galley and Manning (2008), based on the log linear model, introduced the new reordering

mde ax a fcature of translation model The feavures are parameterized ay follows: with

Trang 15

10 Chapter 2 Related works

a source sentence f, a sequence of phrase e — (&,€+++é,), and phrase alignment model

œ = (đ,đy-đ,) hát defines a source fy, for cach translated phrase & gj, those models estimate the probability of a sequence of orientation o — (0, 62° dq) as follows:

in which each 9; takes value over the set of possible orientation A = M,5,0 We can

define three type as

eS ita —aa—-1

en —Difa ain £1

At decoding time, they define there feavure function such as:

* fm = XE log pte; = MI )

*® Á— X— legn(a: — SỊ )

As mentioned in chapter 1, some approaches using syntactic information are applied

to solve the reordering, problen ue of approuches is syulaclie parsing, of the source language and reordering rules as preprocessing steps The main idea is transferring the sourec gonlcnces to gel very close turgel sculcuees in word urder ax possible, xo EM training is much easier and word alignment quality becomes better There are several studies tu improve reordering problem such as Xia aud McCord (2004); Collins et al (2005); Nguyen and Shimazu (2006); Wang et al (2007); Habash (2007) Xu et al (2009) They all performed reordering during preprocessing step based on the source tree parsing combining cithcr automatic extracted syntactic rulca Xia and McCord (2004); Nguyen and Shimazu (2006); [labash (2007) or mamually written rules Collins et al (2005); Wang ct al (2007); Xu ct al (2009)

‘Xu et al (2009) described method using dependency parse tree and a flexible rule to perfortn the reordering of subject, objec These cules were writtea by und, but

Xu et al (2009) showed that an automatic rule learner can be used

Colling etal (2005) developed a clause detection and nsed some handwritten roles

to reorder words in the clause Partly, Xia and MeCord (2004); Habash (2007) built an

antomatic extracted syntactic rules

Coupared with theses approuches, our work hus a lew dilfercuees Firstly, we aim

ta develop the phrase-based translation model to translate from Tinglish to Vietnamese Booondly, we build a shullow trve by chunking in recursively (dunk ef duu) Thirdly, we

Trang 16

In the previous sections, we introduced about the method which help computer can translate automatically the sentence from one language ta others Now, the question is how good the results arc To answer this question, this scction will introduced some methods

to evalute MT Tn general, there are two ways to evaluate the MT task: evaluate by human or dutomaticully by building ð aucirie to simulate the humun processing

2.5.1 Automatic Metrics

2.5.1.1 BLEU Scores

Today, BLUE Sore is ove of the most favorite method bo cvuluate the MT This measure

was introduced by IBM (Papineni et al., 20U2) ‘I’his method is one kind of a unigram precision metric, The simple unigram precision metric was based ou the frequency of vo

occurrence word in both output and reference sentence Similarly, in the BLUE Score, we

also campute the n-gram oa-occurrence between the ourput and the reference translation,

and then taking the weighted gcometrie mean So that, we can compute the BLUE Score

Vinatchedn,

Bue, — los(—

shortest ref length

length penalty = win(O,1 Nitec

Tiêu đề	Integrated Linguistic to Statistical Machine Translation Tích hợp Thông tin Ngôn ngữ vào Dịch Máy Tính Thống Kê
Tác giả	Hoai-Thu Vuong
Trường học	Vietnam National University Hanoi, University of Engineering and Technology
Chuyên ngành	Integrated Linguistic and Statistical Machine Translation
Thể loại	Thesis
Năm xuất bản	2012
Thành phố	Hanoi

Định dạng
Số trang	33
Dung lượng	294,39 KB