A study of english vietnamese statistical machine translation = nghiên cứu về dịch máy thống kê anh việt

We make a serious research to the core of any SMT system such asexploiting bilingual corpora, improving word alignment or phrase translation modelingquality.. We hope ourwork research wi

Trang 1

A Study of English-Vietnamese Statistical

Machine Translation

Hoang Cuong

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

Supervised by Prof Pham Bao Son

A thesis submitted in fulfillment of the requirements for the degree of

Master of Computer ScienceDecember, 2012

Trang 3

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge

it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree

substan-or diploma at Vietnam National University, Hanoi substan-or any other educational institution,except where due acknowledgement is made in the thesis Any contribution made to theresearch by others, with whom I have worked at Institute for INFOCOMM Research,Singapore (I2R), Vietnam Institute for Advanced Study in Mathematics, Hanoi (VI-ASM) or elsewhere, is explicitly acknowledged in the thesis I also declare that theintellectual content of this thesis is the product of my own work, except to the extent thatassistance from others in the project’s design and conception or in style, presentationand linguistic expression is acknowledged.’

Signed

i

Trang 4

I, the supervisor, hereby approve that the Thesis in its current form is ready as the finalversion at the University of Engineering and Technology, Vietnam National University,Hanoi

Prof Pham Bao Son

ii

Trang 5

x

Trang 6

Previous works from Vietnamese statistical machine translation (SMT) community search just focus on some top “researches” of the field Some are based on the ideaswhich are really simple We lack a fundamental work on the core of SMT system to make

re-a significre-antly solid work on the stre-atisticre-al English-Vietnre-amese trre-anslre-ation We re-also lre-acksome large bilingual corpora with high quality This work will overcome that problem

We present a fundamental and primitive study of English-Vietnamese statistical chine translation We make a serious research to the core of any SMT system such asexploiting bilingual corpora, improving word alignment or phrase translation modelingquality We also focus on developing a better evaluation metric for tuning SMT system

ma-We especial try our best to make a fundamental and solid work on building or improvingthe performance of the English-Vietnamese SMT system in overall

Though we focus on the English-Vietnamese pair In every aspect, we also deployand compare our research to the pair English-French to have a deeper view We hope ourwork research will be a solid work for other studies on deploying and improving the SMTfor English-Vietnamese machine translation systems

Publications:

• Cuong Hoang, Cuong-Anh, Le, Thai-Phuong, Nguyen, Bao-Tu, Ho ExploitingNon-Parallel Corpora for Statistical Machine Translation In Proceedings of theinternational conference on Information and Communication Technologies(RIVF2012)1

• Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham A Systematic Comparison tween Various Statistical Alignment Models for Statistical English-VietnamesePhrase-Based Translation In Proceedings of the 4th international conference onKnowledge and Systems Engineering(KSE 2012)

Be-1 Best Student Paper Award

iv

Trang 7

• Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham Refining Lexical TranslationTraining Scheme for Improving The Quality of Statistical Phrase-Based Trans-lation In Proceedings of the 3th international symposium on Information and Com-munication Technology(SoICT 2012)

• Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham Improving the Quality of WordAlignment By Integrating Pearson’s Chi-square Test Information In Proceed-ings of the international conference on Asian Language Processing(IALP 2012)

Trang 8

Life is so valuable when we found something which is merit to chase

First, I would like to express my deep gratitude to my supervisor - Prof Pham BaoSon - who has been also my iconic researcher in Vietnam since I was as a freshman I alsowant to thank Prof Le Anh Cuong for his so-much-careful supervision for me, though

he do not register me as his student to the school For both of them, I own their patientguidance and support through-out the years

I would like to give my honest appreciation to my other unofficial supervisors - Prof

Ho Tu Bao (JAIST, Japan), Prof Zhang Min (I2R, Singapore), Prof Nguyen Xuan Long(Michigan, USA) - for their great support They are, in diversified perspectives, have beenhelping my passion in Computer Science increases intensively

I sincerely acknowledge Vietnam National University, Hanoi I want to thank some

of my best teachers - Dr Nguyen Van Vinh, Dr Nguyen Phuong Thai or Prof Nguyen

Le Minh (JAIST, Japan) who make many useful discussion Especially, I want to give

my honest appreciation to the Statistical Language Processing Laboratory I work at comm for Institute Research, Singapore (I2R) for the infrastructure and other uncountablesupport I would like to thank my Chinese friends here - Yun Huang, Prof Yue Zhang,Jun SUN and Yanxia Qin I wish I could work with them as much longer as possible Ialso want to make a appreciation to some of my best friends - Dinh Xuan Nhat, NguyenDao Thai - for their helps in my work

Info-Finally, this thesis would not have been possible without the support and love of myFamily - Dad, Mum, my Sister - Lan Ni and her small family Without their support invariety perspectives, I sure that I cannot finish my Master Degree in this way!

And

To love, My Mimosa ♥ !!!

vi

Trang 9

Table of Contents

1.1 Statistical Machine Translation - An Overview 2

1.2 Literature Survey on English-Vietnamese Machine Translation 7

1.3 Our Work 9

1.4 Thesis Contents 9

1.4.1 Exploiting non-parallel corpora for statistical machine translation 10 1.4.2 Systematic comparison between various statistical alignment mod-els for statistical English-Vietnamese phrase-based translation 10

1.4.3 Improving word alignment models 11

1.4.4 Improving phrase translation modeling 11

1.4.5 Developing an evaluation metric for SMT 12

vii

Trang 10

List of Figures

1.1 The architecture of the translation approach based on source-channel models 31.2 An example of word alignments between the pair of English-French 41.3 An example of phrase alignments between the pair of English-German 41.4 The architecture of the translation approach based on log-linear models 6

viii

Trang 11

List of Tables

ix

Trang 13

List of Abbreviations

AFA Average Fraction of best Alignment

GIS Generalized Iterative Scaling

MERT Minimum Error Rate TrainingMIRA Margin Infused Relaxed Algorithm

SMT Statistical Machine TranslationTER Translation Edit Rate

1

Trang 14

Chapter 1

Introduction

Machine translation (MT) is the automatic translation from one natural language into other using computers It has since remained a key application in the field of natural lan-guage processing (NLP) Statistical machine translation (SMT) is a machine translationapproach in which we treat the translation complication as a machine-learning problem.From the analysis of many “samples” of human-produced translation, the probabilisticparameters of the translation system will be generated based on the basis of statisticalmodels

SMT treats translation as a machine learning problem This means that we apply a ing algorithm to a large body of previously translated text, known variously as a parallelcorpus, parallel text, bitext, or multitext The learner is then able translate previouslyunseen sentences Basically, the quality of any SMT system depends crucially on thequantity, quality, and domain of the data As the result, constructing bilingual corporawhich contain millions of words is a vital work in building a statistical machine transla-tion system

learn-We are given a source sentence f = f1J = f1, , fj, fJ, which is to be translated into

a target sentence e = eI1 = e1, , ei, , eI Among all possible target sentences, we willchoose the sentence with the highest probability:

Trang 15

form the following maximization:

Figure 1.1: The architecture of the translation approach based on source-channel models

Basically, two “features” in this model are translation model (P r(f1J|eI

1)) and guage model (P r(eI

lan-1)) For the first one, it is easier to model the “generative language”though the role of language model is so important Building language model from largemonolingual corpora could significantly improve the quality of an SMT system (Brants

et al., 2007) An important notice is that syntactic language model is gaining attention inlanguage model research (Charniak et al., 2003)

For the second one, the statistical translation models were initially word based Theidea of word-based translation could be traced back to (Brown et al., 1990) They intro-

Trang 16

4 Chapter 1 Introduction

duce the idea of an alignment between a pair of strings as an object indicating for eachword in the French string that word in the English string from which it arose We take anexample from (Brown et al., 1993) Alignments are shown graphically, as in Figure 1.2,

by drawing lines, which we call connections, from some of the English words to some ofthe French words For the estimation of word translation parameter, we use each word-based translation model such as IBM Models 1-5 (Brown et al., 1993), Hidden MarkovModel (HMM) (Vogel et al., 1996) or IBM Model 6 (Och & Ney, 2003)

Figure 1.2: An example of word alignments between the pair of English-French

However, the significant advances were made with the introduction of phrase basedmodels (Koehn et al., 2003) We take an example from (Koehn, 2010) Phrase alignmentsare shown graphically, as in Figure 1.3, by drawing lines, which we call connections, fromgroups of English words to groups of the French words

Figure 1.3: An example of phrase alignments between the pair of English-German

In fact, the best performing systems are based in some ways on phrases (or the groups

of words) in SMT The basic idea of phrase-based translation is to learn to break a given

Trang 17

source sentence into phrases, then translate each phrase and finally compose the targetsentence from these phrase translations (Koehn et al., 2003; Och & Ney, 2004) However,the step of phrase learning, which is a vital component in a phrase-based SMT system,heavily relies on the alignments between words

One of the main disadvantages of phrase-based translation is the way of extractingphrase or phrase translation modeling is unrelated to the lexical probability values whichare estimated by word-based translation models Some researches also try to automati-cally extract phrases without relying on the alignments (Marcu & Wong, 2002) However,

it has been proved in practice that it does not gain a better result than our traditional proach (Koehn et al., 2003)

ap-Another important disadvantage of our purity phrase-based translation model is theintegrating linguistics information problem It is very hard to integrate linguistics infor-mation to improve the extracting phrases (Koehn et al., 2003) Some researches focus on

a syntactic transformation model in the pre-processing phase which reorder the structure

of source sentence so that it is closer to the structure of target sentence (Collins et al.,2005) Basically, the idea of pre-processing is simple but we could obtain a significantlyimprovement

Lately, modern SMT paradigm has been moving from the source-channel approach to

a generalization model as the log-linear model Basically, instead of using only two tures, we could combine many features (including both of them) as a mixture framework.That framework contains as special case the source channel approach Basically, we com-bine many features hm(eI1, f1J) that are weighted as λm (m = 1, , M ) and multipliedtogether We have the decision rule:

be described in Fig 1.4 (Och & Ney, 2002)

This approach has been suggested by (Och & Ney, 2002; Och & Ney, 2004) ing log-linear model, we could obtain a significantly better translating system Till now,almost state-of-the-art translation frameworks use this model (Och & Ney, 2004; Mari`oo

Deploy-et al., 2006; Chiang, 2007) Basically, using more features helps us gain a much bDeploy-ettertranslating system As the result, many researches focus on automatic finding appropriatefeatures in which the range of them could be from 10-15 (Och & Ney, 2002), thousands(Chiang et al., 2009) or even millions of features

Trang 18

6 Chapter 1 Introduction

Figure 1.4: The architecture of the translation approach based on log-linear models

Another big advantage of this approach is the easy integration of phrase-based lation with syntax-based translation model Basically, syntax-based translation is based

trans-on the idea of translating syntactic units, rather than single words or strings of words(Yamada, 2003; Galley et al., 2004) The syntax-based translation is widely focusedbased on the advent of strong stochastic parsing techniques (Collins, 2003; Steedman,2000) The idea for that integration is the hierarchical phrase-based translation (Chiang,2005; Chiang, 2007) In practice, hierarchical phrase pairs improve translation accuracysignificantly compared with a state-of-the-art phrase-based system

Often, the training procedure for statistical machine translation models is based onmaximum likelihood or related criteria Hence, from the beginning days, we use theGIS (Generalized Iterative Scaling) algorithm (Darroch & Ratcliff, 1972) However, ageneral problem of this approach is that there is only a loose relation to the final translationquality on unseen text Interestingly, (Och, 2003) presented alternative training criteria forlog-linear statistical machine translation models which are directly related to translationquality helps us a significantly better result

Trang 19

1.2 Literature Survey on English-Vietnamese Machine Translation 7

Therefore, we encounter with a very exciting problem Giving a development corpus,

how to develop a metric which is better to obtain a better optimizing weight results Forthe field, BLEU (Papineni et al., 2002) is widely consider as the standard automaticSMT evaluation method The original idea of BLEU, compares SMT output with expertreference translations in terms of the statistics of short sequences of words (word n-grams,

1 ≤ n ≤ N , N is fixed)

Following to BLEU, researchers also developed other variants, such as: METEOR

(Banerjee & Lavie, 2005), or NIST (Doddington, 2002), etc This idea is elegant inits simplicity In fact, the matching n-grams approach has frequently been reported ascorrelating well with human judgement (Burch et al., ) Integrating those evaluationmetrics, we use MERT (Och, 2003) for a small scale or MIRA (Margin Infused RelaxedAlgorithm) (Crammer et al., 2006) for a larger scale to tune the weights However, thetraditional N -grams approach has many problems which we will exploit later

Trans-lation

From 1990s, many researches focus on English-Vietnamese machine translation in thestatistical perspective Basically, these researches focus on the top perspective of SMTsuch as adopting syntax-based translation system to our pair of languages, incorporatelinguistic and syntactic information into the translation model, etc Some others try toimproving other studies such as exploiting bilingual corpora , improving word alignmentquality, etc We will take a glance on notable researches

Some researches focus on a syntactic transformation model in the pre-processing step

That is, we reorder the structure of source sentence so that it is closer to the structure oftarget sentence This idea inspires from (Collins et al., 2005) Also in the pre-processingstep, a dependency-based parser together with a set of additional “hand-crafted” rules togenerate the transformation is used (Hoang et al., 2008) To besides, reordering at trunklevel and incorporate the global reordering model into the decoder could also improve notonly the quality but also the speed of the system (Nguyen et al., 2008b)

One another problem is how to incorporate linguistic and syntactic information

di-rectly into the translation model From above we known that syntactic information can

be integrate into phrase-based SMT using syntax-based SMT approaches It is expectedthat the combination models will outperform the best phrase-based systems in the near

Định dạng
Số trang	38
Dung lượng	362,64 KB