We make a serious research to the core of any SMT system such asexploiting bilingual corpora, improving word alignment or phrase translation modelingquality.. We hope ourwork research wi
Trang 1A Study of English-Vietnamese Statistical
Machine Translation
Hoang Cuong
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
Supervised by Prof Pham Bao Son
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer ScienceDecember, 2012
Trang 3ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree
substan-or diploma at Vietnam National University, Hanoi substan-or any other educational institution,except where due acknowledgement is made in the thesis Any contribution made to theresearch by others, with whom I have worked at Institute for INFOCOMM Research,Singapore (I2R), Vietnam Institute for Advanced Study in Mathematics, Hanoi (VI-ASM) or elsewhere, is explicitly acknowledged in the thesis I also declare that theintellectual content of this thesis is the product of my own work, except to the extent thatassistance from others in the project’s design and conception or in style, presentationand linguistic expression is acknowledged.’
Signed
i
Trang 4I, the supervisor, hereby approve that the Thesis in its current form is ready as the finalversion at the University of Engineering and Technology, Vietnam National University,Hanoi
Prof Pham Bao Son
ii
Trang 5x
Trang 6Previous works from Vietnamese statistical machine translation (SMT) community search just focus on some top “researches” of the field Some are based on the ideaswhich are really simple We lack a fundamental work on the core of SMT system to make
re-a significre-antly solid work on the stre-atisticre-al English-Vietnre-amese trre-anslre-ation We re-also lre-acksome large bilingual corpora with high quality This work will overcome that problem
We present a fundamental and primitive study of English-Vietnamese statistical chine translation We make a serious research to the core of any SMT system such asexploiting bilingual corpora, improving word alignment or phrase translation modelingquality We also focus on developing a better evaluation metric for tuning SMT system
ma-We especial try our best to make a fundamental and solid work on building or improvingthe performance of the English-Vietnamese SMT system in overall
Though we focus on the English-Vietnamese pair In every aspect, we also deployand compare our research to the pair English-French to have a deeper view We hope ourwork research will be a solid work for other studies on deploying and improving the SMTfor English-Vietnamese machine translation systems
Publications:
• Cuong Hoang, Cuong-Anh, Le, Thai-Phuong, Nguyen, Bao-Tu, Ho ExploitingNon-Parallel Corpora for Statistical Machine Translation In Proceedings of theinternational conference on Information and Communication Technologies(RIVF2012)1
• Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham A Systematic Comparison tween Various Statistical Alignment Models for Statistical English-VietnamesePhrase-Based Translation In Proceedings of the 4th international conference onKnowledge and Systems Engineering(KSE 2012)
Be-1 Best Student Paper Award
iv
Trang 7• Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham Refining Lexical TranslationTraining Scheme for Improving The Quality of Statistical Phrase-Based Trans-lation In Proceedings of the 3th international symposium on Information and Com-munication Technology(SoICT 2012)
• Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham Improving the Quality of WordAlignment By Integrating Pearson’s Chi-square Test Information In Proceed-ings of the international conference on Asian Language Processing(IALP 2012)
Trang 8Life is so valuable when we found something which is merit to chase
First, I would like to express my deep gratitude to my supervisor - Prof Pham BaoSon - who has been also my iconic researcher in Vietnam since I was as a freshman I alsowant to thank Prof Le Anh Cuong for his so-much-careful supervision for me, though
he do not register me as his student to the school For both of them, I own their patientguidance and support through-out the years
I would like to give my honest appreciation to my other unofficial supervisors - Prof
Ho Tu Bao (JAIST, Japan), Prof Zhang Min (I2R, Singapore), Prof Nguyen Xuan Long(Michigan, USA) - for their great support They are, in diversified perspectives, have beenhelping my passion in Computer Science increases intensively
I sincerely acknowledge Vietnam National University, Hanoi I want to thank some
of my best teachers - Dr Nguyen Van Vinh, Dr Nguyen Phuong Thai or Prof Nguyen
Le Minh (JAIST, Japan) who make many useful discussion Especially, I want to give
my honest appreciation to the Statistical Language Processing Laboratory I work at comm for Institute Research, Singapore (I2R) for the infrastructure and other uncountablesupport I would like to thank my Chinese friends here - Yun Huang, Prof Yue Zhang,Jun SUN and Yanxia Qin I wish I could work with them as much longer as possible Ialso want to make a appreciation to some of my best friends - Dinh Xuan Nhat, NguyenDao Thai - for their helps in my work
Info-Finally, this thesis would not have been possible without the support and love of myFamily - Dad, Mum, my Sister - Lan Ni and her small family Without their support invariety perspectives, I sure that I cannot finish my Master Degree in this way!
And
To love, My Mimosa ♥ !!!
vi
Trang 9Table of Contents
1.1 Statistical Machine Translation - An Overview 2
1.2 Literature Survey on English-Vietnamese Machine Translation 7
1.3 Our Work 9
1.4 Thesis Contents 9
1.4.1 Exploiting non-parallel corpora for statistical machine translation 10 1.4.2 Systematic comparison between various statistical alignment mod-els for statistical English-Vietnamese phrase-based translation 10
1.4.3 Improving word alignment models 11
1.4.4 Improving phrase translation modeling 11
1.4.5 Developing an evaluation metric for SMT 12
vii
Trang 10List of Figures
1.1 The architecture of the translation approach based on source-channel models 31.2 An example of word alignments between the pair of English-French 41.3 An example of phrase alignments between the pair of English-German 41.4 The architecture of the translation approach based on log-linear models 6
viii
Trang 11List of Tables
ix
Trang 13List of Abbreviations
AFA Average Fraction of best Alignment
GIS Generalized Iterative Scaling
MERT Minimum Error Rate TrainingMIRA Margin Infused Relaxed Algorithm
SMT Statistical Machine TranslationTER Translation Edit Rate
1
Trang 14Chapter 1
Introduction
Machine translation (MT) is the automatic translation from one natural language into other using computers It has since remained a key application in the field of natural lan-guage processing (NLP) Statistical machine translation (SMT) is a machine translationapproach in which we treat the translation complication as a machine-learning problem.From the analysis of many “samples” of human-produced translation, the probabilisticparameters of the translation system will be generated based on the basis of statisticalmodels
SMT treats translation as a machine learning problem This means that we apply a ing algorithm to a large body of previously translated text, known variously as a parallelcorpus, parallel text, bitext, or multitext The learner is then able translate previouslyunseen sentences Basically, the quality of any SMT system depends crucially on thequantity, quality, and domain of the data As the result, constructing bilingual corporawhich contain millions of words is a vital work in building a statistical machine transla-tion system
learn-We are given a source sentence f = f1J = f1, , fj, fJ, which is to be translated into
a target sentence e = eI1 = e1, , ei, , eI Among all possible target sentences, we willchoose the sentence with the highest probability:
Trang 151.1 Statistical Machine Translation - An Overview 3
form the following maximization:
Figure 1.1: The architecture of the translation approach based on source-channel models
Basically, two “features” in this model are translation model (P r(f1J|eI
1)) and guage model (P r(eI
lan-1)) For the first one, it is easier to model the “generative language”though the role of language model is so important Building language model from largemonolingual corpora could significantly improve the quality of an SMT system (Brants
et al., 2007) An important notice is that syntactic language model is gaining attention inlanguage model research (Charniak et al., 2003)
For the second one, the statistical translation models were initially word based Theidea of word-based translation could be traced back to (Brown et al., 1990) They intro-
Trang 164 Chapter 1 Introduction
duce the idea of an alignment between a pair of strings as an object indicating for eachword in the French string that word in the English string from which it arose We take anexample from (Brown et al., 1993) Alignments are shown graphically, as in Figure 1.2,
by drawing lines, which we call connections, from some of the English words to some ofthe French words For the estimation of word translation parameter, we use each word-based translation model such as IBM Models 1-5 (Brown et al., 1993), Hidden MarkovModel (HMM) (Vogel et al., 1996) or IBM Model 6 (Och & Ney, 2003)
Figure 1.2: An example of word alignments between the pair of English-French
However, the significant advances were made with the introduction of phrase basedmodels (Koehn et al., 2003) We take an example from (Koehn, 2010) Phrase alignmentsare shown graphically, as in Figure 1.3, by drawing lines, which we call connections, fromgroups of English words to groups of the French words
Figure 1.3: An example of phrase alignments between the pair of English-German
In fact, the best performing systems are based in some ways on phrases (or the groups
of words) in SMT The basic idea of phrase-based translation is to learn to break a given
Trang 171.1 Statistical Machine Translation - An Overview 5
source sentence into phrases, then translate each phrase and finally compose the targetsentence from these phrase translations (Koehn et al., 2003; Och & Ney, 2004) However,the step of phrase learning, which is a vital component in a phrase-based SMT system,heavily relies on the alignments between words
One of the main disadvantages of phrase-based translation is the way of extractingphrase or phrase translation modeling is unrelated to the lexical probability values whichare estimated by word-based translation models Some researches also try to automati-cally extract phrases without relying on the alignments (Marcu & Wong, 2002) However,
it has been proved in practice that it does not gain a better result than our traditional proach (Koehn et al., 2003)
ap-Another important disadvantage of our purity phrase-based translation model is theintegrating linguistics information problem It is very hard to integrate linguistics infor-mation to improve the extracting phrases (Koehn et al., 2003) Some researches focus on
a syntactic transformation model in the pre-processing phase which reorder the structure
of source sentence so that it is closer to the structure of target sentence (Collins et al.,2005) Basically, the idea of pre-processing is simple but we could obtain a significantlyimprovement
Lately, modern SMT paradigm has been moving from the source-channel approach to
a generalization model as the log-linear model Basically, instead of using only two tures, we could combine many features (including both of them) as a mixture framework.That framework contains as special case the source channel approach Basically, we com-bine many features hm(eI1, f1J) that are weighted as λm (m = 1, , M ) and multipliedtogether We have the decision rule:
be described in Fig 1.4 (Och & Ney, 2002)
This approach has been suggested by (Och & Ney, 2002; Och & Ney, 2004) ing log-linear model, we could obtain a significantly better translating system Till now,almost state-of-the-art translation frameworks use this model (Och & Ney, 2004; Mari`oo
Deploy-et al., 2006; Chiang, 2007) Basically, using more features helps us gain a much bDeploy-ettertranslating system As the result, many researches focus on automatic finding appropriatefeatures in which the range of them could be from 10-15 (Och & Ney, 2002), thousands(Chiang et al., 2009) or even millions of features
Trang 186 Chapter 1 Introduction
Figure 1.4: The architecture of the translation approach based on log-linear models
Another big advantage of this approach is the easy integration of phrase-based lation with syntax-based translation model Basically, syntax-based translation is based
trans-on the idea of translating syntactic units, rather than single words or strings of words(Yamada, 2003; Galley et al., 2004) The syntax-based translation is widely focusedbased on the advent of strong stochastic parsing techniques (Collins, 2003; Steedman,2000) The idea for that integration is the hierarchical phrase-based translation (Chiang,2005; Chiang, 2007) In practice, hierarchical phrase pairs improve translation accuracysignificantly compared with a state-of-the-art phrase-based system
Often, the training procedure for statistical machine translation models is based onmaximum likelihood or related criteria Hence, from the beginning days, we use theGIS (Generalized Iterative Scaling) algorithm (Darroch & Ratcliff, 1972) However, ageneral problem of this approach is that there is only a loose relation to the final translationquality on unseen text Interestingly, (Och, 2003) presented alternative training criteria forlog-linear statistical machine translation models which are directly related to translationquality helps us a significantly better result
Trang 191.2 Literature Survey on English-Vietnamese Machine Translation 7
Therefore, we encounter with a very exciting problem Giving a development corpus,
how to develop a metric which is better to obtain a better optimizing weight results Forthe field, BLEU (Papineni et al., 2002) is widely consider as the standard automaticSMT evaluation method The original idea of BLEU, compares SMT output with expertreference translations in terms of the statistics of short sequences of words (word n-grams,
1 ≤ n ≤ N , N is fixed)
Following to BLEU, researchers also developed other variants, such as: METEOR
(Banerjee & Lavie, 2005), or NIST (Doddington, 2002), etc This idea is elegant inits simplicity In fact, the matching n-grams approach has frequently been reported ascorrelating well with human judgement (Burch et al., ) Integrating those evaluationmetrics, we use MERT (Och, 2003) for a small scale or MIRA (Margin Infused RelaxedAlgorithm) (Crammer et al., 2006) for a larger scale to tune the weights However, thetraditional N -grams approach has many problems which we will exploit later
Trans-lation
From 1990s, many researches focus on English-Vietnamese machine translation in thestatistical perspective Basically, these researches focus on the top perspective of SMTsuch as adopting syntax-based translation system to our pair of languages, incorporatelinguistic and syntactic information into the translation model, etc Some others try toimproving other studies such as exploiting bilingual corpora , improving word alignmentquality, etc We will take a glance on notable researches
Some researches focus on a syntactic transformation model in the pre-processing step
That is, we reorder the structure of source sentence so that it is closer to the structure oftarget sentence This idea inspires from (Collins et al., 2005) Also in the pre-processingstep, a dependency-based parser together with a set of additional “hand-crafted” rules togenerate the transformation is used (Hoang et al., 2008) To besides, reordering at trunklevel and incorporate the global reordering model into the decoder could also improve notonly the quality but also the speed of the system (Nguyen et al., 2008b)
One another problem is how to incorporate linguistic and syntactic information
di-rectly into the translation model From above we known that syntactic information can
be integrate into phrase-based SMT using syntax-based SMT approaches It is expectedthat the combination models will outperform the best phrase-based systems in the near