Improving Word Alignment for Statistical MachineTranslation based on Constraints Le Quang Hung Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com Le A
Trang 1Improving Word Alignment for Statistical Machine
Translation based on Constraints
Le Quang Hung
Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com
Le Anh Cuong
University of Engineering and Technology Vietnam National University, Hanoi Email: cuongla@vnu.edu.vn
Abstract— Word alignment is an important and fundamental
task for building a statistical machine translation (SMT) system.
However, obtaining word-level alignments in parallel corpora with
high accuracy is still a challenge In this paper, we propose a new
method, which is based on constraint approach, to improve the
quality of word alignment Our experiments show that using
constraints for the parameter estimation of the IBM models
reduces the alignment error rate down to7.26% and increases
the BLEU score to 5%, in the case of translation from English
to Vietnamese.
I INTRODUCTION Word alignment is a core component of every SMT
sys-tem In fact, the initial quality of statistical word alignment
dominates the quality of SMT [1] Most current SMT systems
[2], [3] use statistical models for word alignment like GIZA++
that implements the IBM models [4] However, the quality of
alignment is typically quite low for the language pairs which
are much different in syntactic structures such as
English-Vietnamese, English-Chinese Therefore, it is necessary to
incorporate auxiliary information to alleviate this problem
In our opinion, there are two factors that could reduce the
alignment error rate: (1) adding more training data (parallel
corpora); (2) developing more efficient methods for exploiting
existing training data In this work, we take the second We
address the problem of efficiently exploiting existing parallel
corpora by directly adding constraints to the Expectation
Maximization (EM) parameter estimation procedure in the
IBM models From our surveys, there aren’t factors to prevent
undesirable alignments in the original IBM-models, and thus
each word in source sentence aligns to all words in target
sentence To guide the model to correct alignments, we employ
constraints to limit a range which a word aligns with other
words (in a parallel sentence pair)
There are some previous works have incorporated auxiliary
information into the process of estimating IBM models’
pa-rameters, such as [5], [6] Och and Ney [5] used a bilingual
dictionary as an additional knowledge source for extending
estimation The knowledge sources such as cognate relations, bilingual dictionary, numeric pattern matching generate anchor constraints These anchor constraints were then used to help decide which word pairings were permissible during parameter estimation We differ from these two in that we do not use
a bilingual dictionary to generate anchor constraints Instead,
we use lexical pairs and cognates [7] that are extracted from the training data as anchor points Thus, our method has advantages that no need for extra resources when compared
to approaches in [5], [6]
In this paper, we first propose: (1) a new constraint type relies on distance between position of source word and position
of target word in a parallel sentence pair; (2) a novel method
to generate anchor constraints Then, we incorporate the con-straints into the parameter estimation of the IBM models to improve the quality of word alignments
For the remaining of the paper, the section II will present IBM model-1 and the EM algorithm Section III describes our method to improve word alignment models Experimental results are shown in section IV Finally, conclusion is derived
in section V
II IBMMODEL-1ANDEMALGORITHM Suppose that we are working with the two language English
and Vietnamese Given an English sentence e consisting ofI
wordse1, , e I and a Vietnamese sentence f consisting ofJ
wordsf1, , f J , we define the alignment a between e and f
as a subset of the Cartesian product of the word positions:
a⊆ {(j, i) : j = 1, , J; i = 1, , I} (1)
In statistical alignment models, a hidden alignment a
con-nects words in the target language sentence to words in the
source language sentence The set of alignments a is defined as
the set of all possible connections between each word position
j in the target language sentence to exactly one word position
i in the source language sentence The translation probability
P r(f|e) can be calculated as the sum of P r(f, a|e) over all
2012 International Conference on Asian Language Processing
2012 International Conference on Asian Language Processing
Trang 2The joint probabilityP r(f, a|e) only depends on the
param-etert(f j |e a j ) which is the probability that the target word fj
is aligned to the source word at positiona jis a translation of
the worde a j
P r(f, a|e) =
(I + 1) J
J
j=1
t(f j |e a j) (3)
The EM algorithm [8] is used to iteratively estimate alignment
model probabilities according to the likelihood of the model
on a parallel corpus This algorithm consists of two steps:
1) Expectation-step (E-step): Apply model to the data,
alignment probabilities are computed from the model
parameters
2) Maximization-step (M-step): Estimate model from data,
parameter values are re-estimated based on the
align-ment probabilities and the corpus
The IBM model-1 [4] was originally developed to provide
reasonable initial parameter estimates for more complex
word-alignment models [9] The next section will present how to use
the constraints for the parameter estimation of this model
III THE PROPOSED APPROACH
In this work, we add constraints directly to the EM and
mod-ify the standard parameter estimation procedure for the IBM
model-1 Note that the constrained alignment parameter is
t(f j |e i)1 We use two types of constraint: anchor and distance
The constraints are formulated as boolean functions These
are then used in the standard forward-backward recursions to
directly restrict the posterior distribution inferred in the E-step
Both types of constraint are shown as below
A Anchor constraint
Anchor constraints are exclusive constraints that force a
con-fident alignment between two words The alignment between
words in an anchor point was forced by setting translation
probabilities to zero at that position for all other words in the
E-step [6] Give a sentence pair (f, e), if word pair (f j , e i) is
an anchor point then we will assign translation probabilities
t(f j |e k) = 0, ∀k = i, and t(fl |e i) = 0, ∀l = j Figure 1
shows an example of anchor constraints As can be seen in the
figure, word pair (tôi, me) is an anchor point; thus, translation
probabilities between tôi and other words such as (tôi, a), (tôi,
car), (tôi, passed), are set to zero In this work, we used the
following knowledge sources to generate anchor constraints:
1) Cognate: According to Kondrak [10], the term cognates
denotes words in different languages that are similar in their
orthographic or phonetic form and are possible translations of
each other The cognates are particularly useful when
machine-readable bilingual dictionaries are not available We differ
from Kondrak’s method [10]; he used three word similarity
measures: Simard’s condition, Dice’s coefficient, and LCSR
to extract the cognates Here, we select the words that are
not translated and they co-occur in an aligned sentence pair
1 In the next sections, we uset(f j |e i ) instead of t(f j |e a j)
Fig 1 An example of anchor constraints (black), setting translation probabilities to zero for all other word pairs (dark grey).
(e.g., abbreviations, numbers, punctuation, ) Note that in our method, these cognates were extracted directly from corpus during training progress
2) Lexical pairs: From our observation, the most frequent
source words are likely to be translated into words, which are also frequent on the target side In order to extract the anchor points this kind, we combine translation probability (t(f j |e i))
and frequency (count(f j , e i)) of word pairs in training data.
Therefore, we can select word pairs with high-accurate to generate anchor points We define a lexical list L as a set
of entries:
L = {(f j , e i)|t(fj |e i) > α, count(fj , e i) > β}. (4) Here, e i is a source language word, f j is a target language word, andα, β are a predefined thresholds (in the experiment,
we used thresholdsα = 0.5, and β = 10) Now, we formulate
constraint based on anchor points by a boolean function
anchor_constraint(f j , e i), as follows:
anchor_constraint(fj, ei) =
true if(f j = e i ) ∨ (f j, ei ) ∈ L
f alse otherwise
(5)
B Distance constraint
In our surveys, we see that words in source language sentence have usually relationship about distance with words
in target language sentence2 Under this point of view, we proposed a new constraint type that relies on distance between the positions of source word and target word in a parallel sentence pair We formulate this by using the boolean function
distance_constraint(i, j), as follows:
distance_constraint(i, j) =
true ifabs(i − j) ≤ δ
f alse otherwise
(6) Here,abs(i−j) is the distance from a source position i to target
position j, and δ is a predefined threshold (in experiments, we
set δ = 2) It means that given a sentence pair (f, e), each
2 This is will be confirmed in the section experiment.
Trang 3Fig 2 An example of distance constraint with thresholdδ = 2, each target
positionj (black) only align with source positions in range [j −δ, j +δ] (dark
grey).
target positionj is only aligned with source positions in range
[j − δ, j + δ] As can be seen in the figure 2, the target word
at position 5 (including word: vợ) only aligns with the source
words at positions in range [3, 7] (including words: a, wife,
and, kid, to) It is worth to emphasize that differ from the
anchor constraints which set translation probabilities to zero
for all unconstrained words We estimate translation
probabil-ity t(f j |e i) with a mixture of constrained and unconstrained
for each word pair (fj , e i) To amplify the contribution of
constraints this kind; we weight the statistics collected in the
EM algorithm We use a single parameter λ that assigns a
high weight if a word pair(fj , e i) constrained and a very low
weight if otherwise That means, the translation probability
t(f j |e i) is multiplied by λ when constrained and by (1 − λ)
otherwise (in the experiment,λ is set to 0.99).
Here, the translation probability t(f j |e i) used as collect
counts in the EM algorithm Similar to Brown et al [4], we
call the expected number of times that e connects to f in the
translation(f|e) the count of f given e for (f|e) and denote it
by c(f |e; f,e) Denote E1, andE2 are set of English words,
which distance constraint are satisfied, unsatisfied respectively
Now we have to collect countsc(f |e; f,e) from a sentence pair
(f,e), as follows3
c(f |e; f,e) = ( λt(f |e)
e k ∈E1t(f |e k)
+(1 − λ)t(f|e)
e l ∈E2t(f |e l))
J
j=1
δ(f, f j)
I
i=0
δ(e, e i)
(7)
After collecting these counts over a corpus, we estimate the
translation probabilityt(f |e; f,e) by equation (8).
t(f |e; f,e) =
(f,e)c(f |e; f,e)
f
(f,e)c(f |e; f,e) (8)
IV EXPERIMENTAL SETUP
In this section, we present results of experiments on a
parallel corpus of English-Vietnamese The experiments were
3 In equation (7), I
i=0 δ(e, e i ) is count of e in e, andJ
j=1 δ(f, f j) is
count of f in f.
carried out using the English-Vietnamese data sets which was credited by [11] We design four training data sets which con-sists of 60k, 70k, 80k, and 90k sentence pairs We performed main experiments which fell into the following categories: 1) Verifying that the use of constraints (as proposed) has
an impact on the quality of alignments
2) Evaluating whether improved parameter estimates of alignment quality lead to improved translation quality
To choose the thresholds α, β, δ, and parameter λ, we have
trained 60k sentence pairs, and have selectedα = 0.5, β = 10,
δ = 2, and λ = 0.99 that achieve high performance.
Our baseline is a phrase based SMT system that uses the Moses toolkit [2] for translation model training and decoding, GIZA++ [5] for word alignment
A Word alignment experiments
We used the alignment error rate (AER) metric as defined
by Och and Ney [5] to measure the quality of alignments, as follows
precision = |A ∩ P |
recall = |A ∩ S|
AER = 1 − |A ∩ S| + |A ∩ P |
where, S denotes the annotated set of sure alignments, P
denotes the annotated set of possible alignments, and A
denotes the set of alignments produced by the model under test [9]
In all the experiments below, we perform the same training scheme with the actual number of training order is: 6 iterations
of Model 1, 3 iterations of Model 2, and 3 iterations of Model
3 We used 150 sentence pairs as a held out hand-aligned set
to measure the word alignment quality Table I gives quality
of alignments for IBM models when training with GIZA++
on four training data sets
We obtained better results when incorporating the con-straints into the parameter estimation of the IBM models Table
II shows the results for the different corpus sizes The best-performing model in the GIZA++ was trained on 90k sentence pairs, which had an alignment error rate of 24.48% In our modified IBM models the best-performing model trained on 90k sentence pairs with an alignment error rate of 22.53% We have shown a relative reduction of AER of about7.26% on all
training data set In the baseline, increased size of training data enables reduces alignment error rate but only very minimal improvements, average improvement 0.91%/10k sentence pairs (see in the figure 3)
B Machine translation experiments
In order to test that our improved parameter estimates lead
to better translation quality, we used a phrase-based decoder [2] to translate a set of English sentences into Vietnamese The phrase-based decoder extracts phrases from the word alignments produced by experiments in the previous section
Trang 4Q UALITY OF ALIGNMENTS FOR IBM MODELS ( BASELINE ).
Size of training data Precision Recall AER
60k 67.79 83.49 25.16
70k 68.22 83.68 24.83
80k 68.66 83.49 24.64
90k 68.63 83.93 24.48
TABLE II
Q UALITY OF ALIGNMENTS FOR IBM MODELS TRAINED WITH
CONSTRAINTS
Size of training data Precision Recall AER
60k 71.82 81.44 23.66
70k 71.82 82.87 23.04
80k 72.21 83.18 22.68
90k 72.57 83.06 22.53
(word alignment experiments) We trained a language model
using the 90k Vietnamese sentences from the training set For
the evaluation of translation quality; we used the BLEU metric
[12] A test set including 5k sentence pairs is used to evaluate
SMT quality Table III shows that our method leads to a better
translation quality than the baseline [2] We achieve a higher
BLEU score on all training data set The average improvement
is1.04 BLEU points absolute (5.0% relative) when compared
to the baseline
V CONCLUSION
In this paper, we have proposed a novel method to
improve word alignments by incorporating constraints into the
parameter estimation of the IBM models These constraints
are used to prevent undesirable alignments that cannot be
shown in standard IBM models Experimental results show
that our proposed method significantly improves the alignment
accuracy and increases translation qualities, for the case of
translating from English to Vietnamese When we improve
IBM model-1, the initializing transferring results to higher
IBM models is better Therefore, it improves the quality in
the overall We believe that our method can be applied for
other pairs of languages because that the constraints were
used in the proposed method is independent of language In
the future, we will extend our work that uses the advanced
constraints to improve the quality of alignments
ACKNOWLEDGEMENT This work is supported by the
project ”Studying Methods for Analyzing and Summarizing
Opinions from Internet and Building an Application” which
is funded by Vietnam National University of Hanoi
REFERENCES [1] J.-H Lee, S.-W Lee, G Hong, Y.-S Hwang, S.-B Kim, and H.-C Rim,
“A post-processing approach to statistical word alignment reflecting
alignment tendency between part-of-speeches,” in Coling 2010: Posters.
Beijing, China: Coling 2010 Organizing Committee, August 2010, pp.
623–629.
Fig 3 Comparison word alignment quality of baseline with our method.
TABLE III
C OMPARISON SMT QUALITY OF BASELINE WITH OUR METHOD
Size of training data Baseline Our method Δ(%)
[2] P Koehn, F J Och, and D Marcu, “Statistical phrase-based translation,”
in Proceedings of the 2003 Conference of the North American Chapter
of the Association for Computational Linguistics on Human Language Technology - Volume 1, ser NAACL ’03. Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp 48–54 [3] F J Och and H Ney, “The alignment template approach to statistical
machine translation,” Comput Linguist., vol 30, no 4.
[4] P F Brown, V J D Pietra, S A D Pietra, and R L Mercer, “The mathematics of statistical machine translation: parameter estimation,”
Comput Linguist., vol 19, no 2, pp 263–311, Jun 1993.
[5] F J Och, H Ney, F Josef, and O H Ney, “A systematic comparison of
various statistical alignment models,” Computational Linguistics, vol 29,
2003.
[6] D Talbot, “Constrained em for parallel text alignment,” Nat Lang Eng.,
vol 11, no 3, pp 263–277, Sep 2005.
[7] M Simard, G F Foster, and P Isabelle, “Using cognates to align
sentences in bilingual corpora,” in Proceedings of the 1993 conference of
the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2, ser CASCON ’93 IBM Press, 1993, pp 1071–
1082.
[8] A P Dempster, N M Laird, and D B Rubin, “Maximum likelihood
from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL
STATISTICAL SOCIETY, SERIES B, vol 39, no 1, pp 1–38, 1977.
[9] R C Moore, “Improving ibm word-alignment model 1,” in
Proceed-ings of the 42nd Annual Meeting on Association for Computational Linguistics, ser ACL ’04. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004.
[10] G Kondrak, D Marcu, and K Knight, “Cognates can improve statistical
translation models,” in Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings
of HLT-NAACL 2003–short papers - Volume 2, ser NAACL-Short ’03.
Stroudsburg, PA, USA: Association for Computational Linguistics, 2003,
pp 46–48.
[11] C Hoang, A.-C Le, P.-T Nguyen, and T.-B Ho, “Exploiting
non-parallel corpora for statistical machine translation.” in RIVF. IEEE,
2012, pp 1–6.
[12] K Papineni, S Roukos, T Ward, and W.-J Zhu, “Bleu: a method for
automatic evaluation of machine translation,” in Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics, ser ACL
’02, 2002, pp 311–318.