Firstly, we introduce the Machine Translation MT}, which is one of hig applications in Natural Language Processing NLI” and an approach to solve this problem by using statistical.. ‘The
Trang 211.1 A Short Compariyon Between Euglish und Vicuummese 2
3.2.1 Definition of the shallow syntax : ca - 18 3.2.2 How lo build the shallow ggHbax eee I
3.4 Applying rhe transformation rule into the shallow syntax tree „ 19
5.2 buture work se ¬ si ˆ +
Trang 4List of Tables
Details of anv experimental, AR ix named as using automatic el
‘Ivanslation performance for the English Vietmamese task - 23
Trang 5Au overview of preproeess belore braining and decoding - 1ã
A pair of source and target language Loses - lã
‘The training process : : " - 16 The decoding process : : ca - 7
A shollow syntax wee ¬——
The building of the shallow syntax " - 18 The building of the shullow symax ee eee 20
Trang 6CHAPTER 1
Introduction
In this chapter, we would like lo give a brief of Statistical Machine Translation (SMT),
to address the problem, the motivations of our work, and the main contributions of this thesis Firstly, we introduce the Machine Translation (MT}, which is one of hig applications in Natural Language Processing (NLI”) and an approach to solve this problem
by using statistical Then, we also introduce the main problem of this thesis and aur research motivations The next seetiou will describe the nein vonlcibutions of Uhis Unesis Finally, the content of this thesis will he outlined
In the field of NLP, MT is 9 big application to help a uscr translate automatically a sentence from one langnage to another language MT is very useful in real life: MT help uy gurl the websile in forcign languages, which we don’t undawtaud, or help you
understand the content of an advertising beard on the street However, the high quality
MT is still challenges for researchers Firstly, (he reason comes from the ambiguity of natural language at various levels At lexical level, we have problem with the morphclogy
of the word sucht as Ue word Lune or word segmentation, such as Vicluamese, Japanesu, Chinese or 'Lhai, in which there is no symbol to separate two words For an example, in Vielnamexe, we have a seuleuce "hoe sinh hye sink hge.", "hgc"is a verb, which means
*study” in English, "hoe sinl"is 2 noun, which means a pupil or student in English,
“sinh hạc”is a noun, which means a subject (biology) in Tnglish At the syntax level,
we have the ambiguity of coordinate For example, we have another sentence the man saws the girl with the telescope We can understand that the man used the telescope
vo see the gitl or the girl with the llescope is seen by the inan, So on, Uke ambiguity is more difficult in the semantic level
Secondly, JnrafRky and Martin (2009) shows that there are some differences in a pair
of language such ay the diflerence in structure, Ixieul, ete , which make MT become challenges
Specially, oue of the differences between two languages, which we want to alm on this thesis, is the order of words in each language Mor example, Lnglish is a type of Subjecl-Verb-Objvct (SVO} language, which micany subject comes first, Chew verb follows the subject and the end of the sentence is Object In the sentence "1 go to school”,
*T” is its subject the verb is go to and the object is school The different from English, Japanese is a type of SOV language, or Classical Arabic is VSO language
Tn the past, the rule-hased method were favorite They built, MT system with some manually rules, which are created by hurnan, so thut, in dosed domain or restricted ura,
Trang 7
2 Chapter 1 Intraductian
the quality of rule based system is very high Ilowever, with the increase of internet and social nctwork, we need a wide broad of MT system, and the cule based method is aot suitable So that, we need a new way to help the M'I, and the statistic is applied to the field of MT At the same time, the statistical method is applied in many studies: automatic speech recognition, ete So that, the idea of using statistical for M'’ has been coming ont Nowadays, there are some MT aystems, in which statistical method is used can compare with human translation such as GOOGLE!
1.1.1 <A Short Compurison Between English and Vietnamese
Finglish and Vietnamese have some similarities such as they hase on the Latin character
or are the type of SVO structure For an cxample:
en: 1 go to school
wm: Tai di hoe
But the order of words in an English noun phrase is different from thal in a Vicuamese one, Mor example:
em: a black hat,
vi¿ một mũ mầu đen
In the above English example, hat is the head of the noun phrase and it stands at the end of the phrase And in Vietnamese, mit is also the head nom, but it is in the middle
of phrase The reorder of words can be seen in wh-question, too
en: what is your job?
vn: công_ việc của anh là gì ?
Tn this example, the word what mean gi in Vietnamese The position of these two words
can be easil seen Because, |'nglish follows with $-Structure and Vietnamese follows with
D-Strnetnre
In this sevtiou, we would like Lo give short of approaches in the field of machine trans- lation We would fike to begin with complex method (interlingua) and en with simple one (direct method) From a sonrce sentence, we se some analyzing methods to get the complex structures, and then generate the structures or sentences in the target language The highest complex strnature is the interlingna language (figure 1}
Tht
‘translate google.com
Trang 81.2 Machine Translation Approaches 3
Tignre 1: The machine translation pyramid
1.2.1 Interlingua
‘The interlingua systems (I'arwell and Wilks, 1991; Mitamura, 1999) are based on the idea
of finding a langnage, which calied interlingua language to represent the sourea langnage ancl is easy enough to generate the sentence in other language In the figure 1, we can see the process of this approach ‘The analyzing method is the understanding process, in this step, from source sentence we can use some technical in \LP to map source sentence to data, structure im the imerlingua, then retrieve the target sentence by generating process The problem ix how complex vhe interlingua is If Une inlerlingua is simple, we can get many translation options Tn other way, the more complex the interlingua is, the mare
cost effort the analyzing aud the yenerating are
1.2.2 Transfer-based Machine Translation
Another spproach is analyzing the complex structure (simpler Laan intcrlingua structure),
then using some transfer rules ta get the similar structure in the target language ‘hen generating vie target sentence, On this model, MT involves three phrases: analysis, transfer and generation, Normally, we can use all three phrases However, we some- times use two of three phrases such as transfer from the source sentence to the structure
in target language then generate the target sentence bor example, we would like to
introduce a simple transfer rule to translare source sentence ro the targer sentence?
[Nominal > AdjNeoun|source language > [Norninul > NourAdj|rarge: language
‘This example is take from Juraisky and Marti (2009)
Trang 94 Chapter 1 Intraductian
1.2.3 Direct Translation
1.2.3.1 Example-based Machine Translation
Fxample based machine translation was first imtroduced by Nagan (1984), the anthor used a bilingual corpuy wilh purullc) Wexty us ity nin kuewledge base, wl ru lime, The idea, is behind i, is finding the pattem in the bilmgual and combining with the parallel lext lo geucrate the new turget semtenve, This aicthod is similar with Uhe process in human brain Finally, the problem of example based machine translation comes from the smatuhing criteria, Une leugth of the fraguneuts, eve
1.2.3.2 Statistical Machine Translation
Extonding the idea of using statistical for spcech recognition, Brown ct al (1990, 1993) introduced the method using statiatical, aversion af noisy channel to MT Applyied noisy dhunuel Lu machine translation, the vargel senteuce is transformed le the wouree sentence
by noisy channel We can represent MT problem as three tasks of noisy channel:
forward task: computa the fluency af the target, sentence
learning task: feom parallel corpus fiud the coudivional probability between the Larget
sentence and the source sentence
deroding task: find the best target sentence from source sentence
So that the decoding task cau be represented as Ubis formula:
ê= argmax Pr(e|ƒ) Applying the Bayes rule, we have:
Pr( fie) * Prle)
Prif)
rg sna Recanse of the same denominator, we have:
é — arg max Pr( fie) * Pr(e)
(Jurafsky and Martin, 2000, 2009) define the Pr (e) as the fluency of the target sen tence, known aa the fangnage madel Tt ia uaually modeled by n-gram or n-th Markov model The Pr( fle) is defined ag the [auithfuluesy between the source and target language,
‘We use the alignment model to compute this value base on the unit, af the SMT Basing
ou the definition of the translation unit we lave some of approaches:
« word bascd; using word as a translation unit (Brown ot al., 1993)
« phrase based: using phrase as a translation unit (Koehn et al., 2003)
@ syntax based: using a ayntax as a.translafion mit (Yamada and Knight, 2001)
Trang 101.3 The Reardering Problem and Motivations 5
Jn the field of M1’, the reordering problem is the task to reorder the words, in the target
language, vo get the best target sentence, Sometimes, we call the reordering model ox
distortion model
Phrase-based Statistical Machine ‘Iranslation (PBSM'I), which was introduced by Koehn et al (2003); Och and Ney (2004), is currently the state of the art model in word
choice and local word reordering The trauslation unit is Une sequence of words without
linguistic information So that, in this thesis, we would like integrate some linguistic information such as a chunking, a synlax shallow tree or trausformation rule aud wilh a special aim at solving the global reordering problem
‘There are some studies on integrating syntactic resources within $M‘ Chiang Chi- aug (2005) shows significant improvement by keeping the strengths of phrases, while
incorporating syntax into SMT Chiang (2005) built a kind of the syntax tree based on
synchronous Context Free Grammar (CFG), known as the hierarchical of phrase Chiang
(2005) used log lincar model ta determine the weighted of extrectcd rules and devcl-
oped various of CYK algorithm to implement decoding So that, the reordering phrase is defined by the synchronous CFC
Some approaches have been applied at the word-level (Collins et al., 2005) They are particularly useful for language with rich morphology, for reducing data sparseness Other kinds of syntax reordering methods require parser trees , such as the work in Quirk et al (2005); Collins ct al (2005): Huang and Mi (2010) The parsed tree is more powerful in
capturing the sentence structure However, it is expensive to create tree structure, and
uilding » good quality parser is also a hard task All dhe above upprouches require much decoding time, which is expensive
‘Lhe approach we are interested in here is to balance the quality of translation with de-
coding time Reordering approaches such us a preprocessing siep Mla and McCord (2064);
Ku et al (2009); Talbot et al (2011); Katz-Lrown et al (2011) is very effective (improve ment significant over state of-the-art, phrase-hased and hierarchical machine translation
systems and separately quality evaluation nf reordering models}
Inspiring this preprocess approach, we have proposed a combination approach which pre- serves the strength of phrase-based SM'l’ in local reordering and decoding time as well as the strength of integrating syntax in reordering As the result, we use an intermediate syntax between the arts of Speech (I$) tag and parse tree: shallow parsing Mirstly,
we nse shallow parsing far prepracess with training and testing Secondly, we apply a seriey of translurmation rules bo tke shullow tree We lave get two sels of transformation rules: the first set is written by hand, and the other is extracted automatically from the bilingual corpus The experiment resully from English-Vickimunese pair showed (hut our approach achieves significant improvements over MOSTS, which is the state-of-the art
phrase based system.
Trang 116 Chapter 1 Intraductian
Chapter 2 will describe the other studies, which is related to our method Our method
will be introduced in chapter 3 Tn the next chapter (chapter 4), we would give the detail
of some experiments and the results to prove the effect of our method ‘Lhe sum up of cur study will be seen in chapter 5 This chapter will also discuas abont the future works
Trang 12as the language model, can be modeled by -gram ‘I'he faithfulness can be modeled by 9 vauslation unit with an aliguneut model belween two languages The aligumeut aivdel can be extracted automatically from the bilmgual corpus (Brown et al., 1990, 1993; Och and Ney, 2003) With many kind of translation units, we have some methods:
© Word-based translation models: using word as the translation unit
@ Phrase-hased translation models: using phrase as the translation unit
« Syntax-based translation models: using syntax as the translation unit
By passing thy word-bused translation inodels, we will deserlbe the PBSMT in section 2.1
‘Yhen section 2.2 gives some basic kinds of movements of phrase in the reordering problem And, one of famons lexical reordering model, integrated in the decoding process is seen
in section 2.3 Section 2.4 gives some briefs of method treated the reordering as the pre- process task in training and decading, using the transformation rules and syntax tree Finally, we would like to introduce the Moses Decoder (Koehn et al., 2007}, which is used
ta decode and train aur models
PBSMI (Koehn et al., 2003: Och and Ney, 20U4} is extended from word-based SM‘ invdel In this model, the faithfulocss (Pr (f — ¢)) is extracted rout bilingual corpus
by using the alignment model (the 13M model (Brown et al., 1993}) with the contiguous sequence of words Koch ot al (2003); Juralsky and Murtin (2009) deseribe the general process of phrase based translation in three steps First, the words in the source sentence are gronped into phrases: 4,43 -+-@] Next we translate each phrase 4 into the target phrase f, Finally, each of target phrase f; is (optionally) reordered Koehn et al (2003) nse the TBM Models to build the translation model Two aligned phrases are the phrase
in which, each word in the source phrase can be aligned with other word in the target
phrase Specially, no word outside the phrase can be aligned the word in side the phrase
aud inverse, Aud, the probability of plrase wauslation can be computed by formula
_ count( F, 2)
Pr(/Is) caum‡(8)
Trang 138 Chapter 2 Related works
Some alternative methods to extract phrases and learn phrase translation tables have propescd {Muzcu aud Woug, 2002; Venugopal ct al., 2003) und compared in Kochu’s publication (Koebn et al., 2003)
Another phrase based model, introduced by Och and Ney (2004), use discriminative model Tn this vious wedel, plu: translution probubilily aud others features are integrated to log linear model following the formula’
oy — PD Atle f)
TrÚI) = S SWAhie,/)
where h;(e, ƒ) is a fearure funetion that bases on the source and target sentence The 4;
is the weight of feature h,(e, f) Och (2003) introduce method to estimate these weights using Minimum Lrror Rate ‘Lraining (MIL) In their studies, they also use some basic features snch as:
© bidirectional phrase madels (models that score phrase translation)
© bidirectional lexical models (models that consider the appearance of entries from a conventional translation lexical in the phrase translation}
@ lunguage model of Uhe target lunguage (usually a-grse model or o-th Markov model,
which is trained from mono corpus!}
® distortion model
After having the translation model, we need method to decode input sentence to
output sentence Normally, a kind of beam search or A* method is used Koehn (2004) introduced an effective methad, called stack decoding The basic ideal of this method
iy using a Tiniled stuck of the sume length phrase For un exumple, in the processing
of translation, when three words are transtated, the result phrase will he stared in the
slack and the length of stuck is Une The pseudo code of stack decoding iy described us
follows:
‘Tillmann (2004); Galley and Manning (2008) give the basic type of reordering phrase in the sentence There are three basic types:
* monotone: continuous phrase in the source language is also continuous phrase in
the target langnage
* swap: the next phrase of one phrase in the target language is the previous phrase
of the aligned with the one in the source language
Lmoro corpus is usnally same with the training corpus, which is nsed learn the phrase table ar a big set of target sentence, independent with the corpus to train and test
Trang 142.2 The Lexical Reordering Mndel 8
Algorithm £ A Stack decoding algoritlun, whieh 1s used in Phraoh system, to get the
Vargel sentence from source sentence
@ sentence, which we would like to translate
a phrase bose model
» language model initialize hypothesisStack[0,nf]
for all i in range(0, nf - 1) do
for all hyp in hypathesisStack[i] do
for all new_hyp can be derived from hyp do
njlnew hyp] + number of foreign words covered by new_hyp ald new_hyp to hypothesisStack|n f|new_hypll
prunc hypothesisStack|nf new _hypl]
eud for
end for
end for
find best hypothesis best_hyp in hypothesisStack|n#]
return a best path lead to best_hyp via backtrace
Tor example, we have a pair of sentences:
en tom ’s two blue books ure god
vn hai cuốn_ sách miu_xanh cia tom 1a tat
In this example, ane good and ia i8t is represented as monotone With swap instance,
we soe lwo phrase blue booRs and cuốn sách rrầu xanh Fluully, two blue ond het
man_aurh.is the example of discontinons phrase
2.2.1 The Distance Based Reordering Model
Koehn et al (2003) introduced the method of phrase translation madel and a simple distortion model based on the exponent of the penalize a as the formula:
ale, 1-1) — oh
„ so Lhal, the distance of two continuous phrases is based on the difference of the correlative
phrase in the target language [for example, distance of two phrases Aai mdu_zanh is
(4,3) — nt —
Galley and Manning (2008), based on the log linear model, introduced the new reordering
mde ax a fcature of translation model The feavures are parameterized ay follows: with
Trang 1510 Chapter 2 Related works
a source sentence f, a sequence of phrase e — (&,€+++é,), and phrase alignment model
œ = (đ,đy-đ,) hát defines a source fy, for cach translated phrase & gj, those models estimate the probability of a sequence of orientation o — (0, 62° dq) as follows:
in which each 9; takes value over the set of possible orientation A = M,5,0 We can
define three type as
© = M ifa— a=
eS ita —aa—-1
en —Difa ain £1
At decoding time, they define there feavure function such as:
* fm = XE log pte; = MI )
*® Á— X— legn(a: — SỊ )
© fa — Des bog plo; — DI )
As mentioned in chapter 1, some approaches using syntactic information are applied
to solve the reordering, problen ue of approuches is syulaclie parsing, of the source language and reordering rules as preprocessing steps The main idea is transferring the sourec gonlcnces to gel very close turgel sculcuees in word urder ax possible, xo EM training is much easier and word alignment quality becomes better There are several studies tu improve reordering problem such as Xia aud McCord (2004); Collins et al (2005); Nguyen and Shimazu (2006); Wang et al (2007); Habash (2007) Xu et al (2009) They all performed reordering during preprocessing step based on the source tree parsing combining cithcr automatic extracted syntactic rulca Xia and McCord (2004); Nguyen and Shimazu (2006); [labash (2007) or mamually written rules Collins et al (2005); Wang ct al (2007); Xu ct al (2009)
‘Xu et al (2009) described method using dependency parse tree and a flexible rule to perfortn the reordering of subject, objec These cules were writtea by und, but
Xu et al (2009) showed that an automatic rule learner can be used
Colling etal (2005) developed a clause detection and nsed some handwritten roles
to reorder words in the clause Partly, Xia and MeCord (2004); Habash (2007) built an
antomatic extracted syntactic rules
Coupared with theses approuches, our work hus a lew dilfercuees Firstly, we aim
ta develop the phrase-based translation model to translate from Tinglish to Vietnamese Booondly, we build a shullow trve by chunking in recursively (dunk ef duu) Thirdly, we
Trang 16
In the previous sections, we introduced about the method which help computer can trans- late automatically the sentence from one language ta others Now, the question is how good the results arc To answer this question, this scction will introduced some methods
to evalute MT Tn general, there are two ways to evaluate the MT task: evaluate by human or dutomaticully by building ð aucirie to simulate the humun processing
2.5.1 Automatic Metrics
2.5.1.1 BLEU Scores
Today, BLUE Sore is ove of the most favorite method bo cvuluate the MT This measure
was introduced by IBM (Papineni et al., 20U2) ‘I’his method is one kind of a unigram precision metric, The simple unigram precision metric was based ou the frequency of vo
occurrence word in both output and reference sentence Similarly, in the BLUE Score, we
also campute the n-gram oa-occurrence between the ourput and the reference translation,
and then taking the weighted gcometrie mean So that, we can compute the BLUE Score
Vinatchedn,
Bue, — los(—
shortest ref length
length penalty = win(O,1 Nitec