Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square TestInformation Cuong Hoang1, Cuong Anh Le1, Son Bao Pham1,2 1 University of Engineering and Technology Vietna
Trang 1Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square Test
Information
Cuong Hoang1, Cuong Anh Le1, Son Bao Pham1,2
1 University of Engineering and Technology Vietnam National University, Hanoi
2 Information Technology Institute Vietnam National University, Hanoi
{cuongh.mi10, cuongla, sonpb}@vnu.edu.vn
Abstract—Previous researches mainly focus on the
ap-proaches which are essentially inspirited from the log-linear
model background in machine learning or other adaptations.
However, not a lot of studies deeply focus on improving
word-alignment models to enhance the quality of phrase translation
table This research will follow on that approach The
experi-ments show that this scheme could also improve the quality of
the word-alignment component better Hence, the improvement
impacts the quality of translation system in overall around 1%
for the BLEU score metric.
Keywords-Machine Translation, Pearson’s Chi-Square Test,
Word-Alignment Model, Log-Linear Model
I INTRODUCTION
Modern Statistical Machine Translation (SMT) systems
are usually built from log-linear models [1][2] In addition,
the best performing systems are based in some way on
phrases (or the groups of words) [2][3] The basic idea of
phrase-based translation is to learn to break given source
sentence into phrases, then translate each phrase and finally
compose target sentence from these phrase translations
The step of phrase learning in a statistical phrase-based
translation system usually relies on the alignments between
words For finding the best alignments between phrases, first
we generate word alignments Phrase alignments are then
heuristically “extracted” from them [3] In fact, previous
experiments point out that automatic word-based alignment
process is a vital component of an SMT system [3].
There are a lot of studies, which are inspirited from the
log-linear model background in machine learning or other
adaptations[4][5][6], to focus on improving the quality of
the translation system However, not many works scrutinize
to enhance the quality of word-alignment component to
improve the accuracy of phrase translation table and hence
to enhance the system in overall [7]
focuses on the aspect that the lexical translation modelling improving grants us to “boost” the “hidden” merit of IBM higher alignment models and therefore improves the quality
of statistical phrase-based translation better1 We found that this is a quite important aspect However, there are just only some works deeply concentrate on that point [8]
II IBM MODELS ANDTHEEFFECTS OF THEM TO
HIGHER ALIGNMENT MODELS
A IBM Model 1
Model 1 is a probabilistic generative model within a framework that assumes a source sentence f J of lengthJ
translates as a target sentencee I
1of lengthI It is defined as a
particularly simple instance of this framework, by assuming all possible lengths for f J (less than some arbitrary upper
bound) have a uniform probability [9] yields the following
summarization equation:
P r(f|e) =
(I + 1) J
J
j=1
I
i=0
t(f j |e i) (1) The parameters of Model 1 for a given pair of languages are normally estimated using EM [10] We call the expected number of times that word e i connects to f j in the pair
of translation (f J |e I
1) the count of f j givene i for(f j |e i) and denote it byc(f j |e i ; f J , e I
1) Follow some mathematical inductions by [9], c(f j |e i ; f J , e I
1) could be calculated as follow equation:
c(f j |e i ; f J , e I
1) = I t(f j |e i)
i=0 t(f j |e i)·
J
j=1
σ(f J , f j)
·
I
i=0
σ(e I
1, e j) (2)
λ
2012 International Conference on Asian Language Processing
2012 International Conference on Asian Language Processing
Trang 2B The Effect of IBM Model 1 to Higher Models
Simply, for Model 2, we make the same assumptions as in
Model 1 except we assumeP r(a j |a j−1
1 , f j−1
1 , J, e) depends
on j, a j, and J, as well as on I The equation gives the
Model 2 estimate for the probability of a target sentence,
given a source sentence:
P r(f|e) =
J
j=1
I
i=0
t(f j |e i )a(i|j, J, I) (4) IBM Model 3-5, yield more accurate results than Model
1-2 mainly based on fertility-based scheme However, consider
the original and general equation for IBM Model 3-5, which
is proposed by [9], as described the “joint likelihood” for a
tableau,τ, and a permutation, π, is:
P r(τ, π|e) = I
i=1
P r(φ i |φ i−1
1 , e)P r(φ0|φ I
1, e)
I
i=0
φ i
k=1
P r(τ ik |τ i k−1
1 , τ i−1
0 , φ I
0, e)
I
i=1
φ i
k=1
P r(π ik |π i k−1
1 , π i−1
1 , τ I
0, φ I
0, e)
φ0
k=1
P r(π 0k |π0k−1
1 , π I
1, τ I
0, φ I
0, e) (5)
We could see that the alignment position information
and the fertility information are the great ideas However,
the problem is quite intuitive here: for Model 1 and 2,
we restrict ourselves to the type of alignments which are
the “lexical” connection - each “cept” connection is either
a single source word or it could be an NULL word In
contrast, as in the equation (6) and (7), we use these
prob-abilities as initial result for parameterizing φ, calculating
P r(a j |a j−1
1 , f j−1
1 , J, e) and other translation probabilities.
This fact reduces deeply the value of these higher models
because the lexical translation probabilities are quite
error-prone
III IMPROVINGIBM MODELS
A The Problem
Basically, in order to estimate the word translation
t(e i |f j), we consider the translation probabilities of all of
the possible equivalence words e k (k = i) in e I
1 of f j.
In more detail, our focus is on the case in which two
words f j and e i co-occurrence many times The lexical
translation probabilityt(f j |e i) could be derived a high value
However, the two words f j ande i are actually not existed
any “meaning relationship” in linguistics These are just
appearing many times in more by “chance” This case lets
us an important following result
That is, the “corrected” translation words of e i, for an
example,f kis never gain an expected translation probability
as it could be The translation probabilityt(f k |e i) is small for two reasons First, it is not sure that the two words are co-occurred many times Second, it comes from the impact
of “noisy” probability t(f j |e i) This is quite important In addition, [8] points out thatt(f k |e i) will be usually smaller thant(f j |e i ) when f j is just less occurrence thanf k
In order to abate the “wrong” translation probabilities
t(f j |e i), following to [8], the purely statistical method based
on the co-occurrence is quite hard to achieve that goal Previous works mainly focus on integrating syntactic knowl-edge to improve the quality of alignment [11] In this work,
we will propose a new approach, that is we combine the traditional IBM models with the another statistical method
- the Person’s Chi-square test information
B Adding Mutual Information
In fact, the essence of Pearson’s Chi-square test is to compare the observed frequencies in a table with the frequencies expected for independence If the difference between observed and expected frequencies is large, then
we can reject the null hypothesis of independence In the simplest case, theX2test is applied to 2-by-2 tables TheX2
statistic sums the differences between observed and expected values in all squares of the table, scaled by the magnitude
of the expected values, as follows:
X2=
i,j
(O ij − E ij)2
where ranges over rows of the table, sometimes it is called
“contingency tables”, ranges over columns, O ij is the
ob-served value for cell(i, j) and E ijis the expected value [12]
realized that it seems to be a particularly good choice for using the “independence” information Actually they used a measure of judge which they call φ2, which is a X2-like
statistic The value ofφ2 is bounded between 0 and 1.
For more detail on the tutorial to calculate φ2, please
refer to [12] Actually, the performance for identifying word correspondences by usingφ2 method is not good as using
IBM Model 1 together with EM training scheme [13] However, we believe this information is quite valuable We adjust the equation for each IBM model to gain better accurate in lexical translation results In more detail, for convenient, letϕ ijis denoted the probabilityφ2(e i , f j) The probability c(f|e; f J , a J , e I
1) could be calculated as follow equation:
c(f|e; f, e) = I t(f j |e i ) · (λ + (1 − λ) · ϕ ij)
i=0 t(f j |e i ) · (λ + (1 − λ) · ϕ ij)
J
j=1
σ(f, f j)
I
i=0
σ(e, e j) (7)
The parameterλ defines the weight of the Pearson’s
Chi-square test information Good values for this parameter are around 0.3
Trang 3IV EXPERIMENT
A The Preparation
This experiment is deployed on two pairs of languages:
English-Vietnamese (E-V) and English-French (E-F) to have
an accurate and reliable result The60.000 E-V training data
was credited by [14] Similarly, the60.000 E-F training
cor-pus was the Hansards corcor-pus [13] In this work, we directly
test our improving method to the phrase-based translation
system in overall We learn phrase alignments from a corpus
that has been word-aligned by the alignment toolkit Our
phrase learning component used the best “Viterbi” sequences
which are trained from each IBM model We use LGIZA2
as a lightweight statistical machine translation toolkit that is
used to train IBM Models 1-3 More information on LGIZA
could be referred to [15]
We deploy the training scheme on the training data is
152333 for the pair E-V This scheme is suggested for
that pair by [15] for the best performance We also deploy
the training scheme152533, which was suggested by [13],
for the pair E-F In addition, we use MOSES [16] as the
phrase-based SMT framework We measure performance
using BLEU metric [17], which estimates the accuracy of
translation output with respect to a reference translation
B The “boosting” performance
We concentrate our evaluations on two aspects The first
one is the “boosting” performance on each word alignment
model The last one is the “boosting” performance of using
higher alignment model based on the improvement of itself
plus the improvement from the lower alignment models
Table 1 describes the “boosting” results of our proposed
methods for the E-V translation system Similarly, Table 2
describes the improvements for the pair E-F
Model 2 (+Improved Model 1) 19.58 20.31 0.73
Model 3 (+Improved Model 1-2) 18.88 20.12 1.24
Table I
T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -V IETNAMESE
TRANSLATION SYSTEM
Model 2 (+Improved Model 1) 26.51 27.05 0.54
Model 3 (+Improved Model 1-2) 26.30 27.44 1.14
Table II
T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -F RENCH
TRANSLATION SYSTEM
2 LGIZA is available on: http://code.google.com/p/lgiza/
1) Improving Word Alignment Quality: From the Table
1 and 2, we could see that adding the Pearson’s Chi-square improves our systems better in quality It boosts the performance when we use IBM Model 1 as the word-alignment component around 0.51% for the pair E-V and 0.44% for the pair E-F Also, we have the similar improving
results for other IBM models
There are two exciting things we found here First, we obtain a better improving performance for the fertility-based alignment models (0.84% for the first pair and 0.61% for
the second pair) Second, we obtain a better improving performance for the pair E-V than the second pair This
is logical with the previous work from [15], in which they point out that modelling the word-alignment for the pair
which is quite different in grammar such as the pair E-V
is quite difficult than the other cases It means for that pair, thet translation parameter is quite less accurate.
2) The total performance: This experimental evaluation
focuses on the another aspect which we concentrate on That
is, for the training scheme - for example: 152333 for the
pair E-V or152533for the pair E-F, we scrutinize that if we
simultaneously improve not only Model 1-2 but also Model
3, is the obtaining improvement better than the case that
we only try to simply improve Model 3 The experimental result is quite impressive (1.24% vs 0.84% and 1.14% vs
0.61%) This steady confirms our analysis in the section 2.
The improving of lexical translation modelling grants us to
“boost” all of “hidden” power of IBM higher models and therefore improving the quality of statistical phrase-based translation deeply
V CONCLUSION
Phrase-based models represent the current state-of-the-art
in statistical machine translation Together, the step of learn-ing phrase in statistical phrase-based translation is absolutely important This work focuses on an approach in which we integrate the Pearson’s Chi-square test information to IBM Model for obtaining a better performance We directly test our improving to the overall system to have a better accuracy
To besides, we also point out the fact that the improving
of lexical translation modelling grants us to “boost” all of
“hidden” power of IBM higher models and therefore improv-ing the quality of statistical phrase-based translation deeply
In summary, we believe attacking lexical translation is a good way for improving statistical phrase-based translation
in overall quality Hence, the quality of statistical phrase-based systems with the improving in phrase learning together with integrating linguistics information could tend closely to state-of-the-art
VI ACKNOWLEDGEMENT This work is supported by the project ”Studying Methods for Analyzing and Summarizing Opinions from Internet and Building an Application” which is funded by Vietnam
Trang 4National University of Hanoi It is also supported by the
project KC.01.TN04/11-15
REFERENCES
[1] F J Och and H Ney, “Discriminative training
and maximum entropy models for statistical machine
translation,” in Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, ser ACL ’02.
Stroudsburg, PA, USA: Association for Computational
Linguistics, 2002, pp 295–302 [Online] Available:
http://dx.doi.org/10.3115/1073083.1073133
[2] ——, “The alignment template approach to
statistical machine translation,” Comput Linguist.,
vol 30, pp 19–51, June 2004 [Online] Available:
http://dx.doi.org/10.1162/089120103321337421
[3] P Koehn, F J Och, and D Marcu, “Statistical phrase-based
translation,” in Proceedings of the 2003 Conference of the
North American Chapter of the Association for
Computa-tional Linguistics on Human Language Technology - Volume
1, ser NAACL ’03 Stroudsburg, PA, USA: Association for
Computational Linguistics, 2003, pp 48–54.
[4] D Chiang, “A hierarchical phrase-based model for statistical
machine translation,” in Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics,
ser ACL ’05 Stroudsburg, PA, USA: Association for
Computational Linguistics, 2005, pp 263–270 [Online].
Available: http://dx.doi.org/10.3115/1219840.1219873
[5] G Sanchis-Trilles and F Casacuberta, “Log-linear weight
optimisation via bayesian adaptation in statistical machine
translation,” in Proceedings of the 23rd International
Con-ference on Computational Linguistics: Posters, ser COLING
’10 Stroudsburg, PA, USA: Association for Computational
Linguistics, 2010, pp 1077–1085 [Online] Available:
http://dl.acm.org/citation.cfm?id=1944566.1944690
[6] J B Mari`oo, R E Banchs, J M Crego, A de Gispert,
P Lambert, J A R Fonollosa, and M R Costa-juss`a,
“N-gram-based machine translation,” Comput Linguist.,
vol 32, no 4, pp 527–549, Dec 2006 [Online] Available:
http://dx.doi.org/10.1162/coli.2006.32.4.527
[7] D Vilar, M Popovi´c, and H Ney, “AER: Do we need to
”improve” our alignments?” in International Workshop on
Spoken Language Translation, Kyoto, Japan, Nov 2006, pp.
205–212.
[8] C Hoang, A Le, and B Pham, “Refining lexical translation
training scheme for improving the quality of statistical
phrase-based translation (to appear),” in Proceedings of The 3th
International Symposium on Information and Communication
Technology. ACM digital library, 2012.
[9] P F Brown, V J D Pietra, S A D Pietra, and
R L Mercer, “The mathematics of statistical machine
translation: parameter estimation,” Comput Linguist.,
vol 19, pp 263–311, June 1993 [Online] Available:
http://dl.acm.org/citation.cfm?id=972470.972474
[10] A P Dempster, N M Laird, and D B Rubin,
“Maxi-mum likelihood from incomplete data via the em algorithm,”
JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES
B, vol 39, no 1, pp 1–38, 1977.
[11] H Wu, H Wang, and Z yi Liu, “Alignment model adaptation
for domain-specific word alignment,” in ACL, 2005.
[12] W A Gale and K W Church, “Identifying word
correspondence in parallel texts,” in Proceedings of the
workshop on Speech and Natural Language, ser HLT ’91.
Stroudsburg, PA, USA: Association for Computational Linguistics, 1991, pp 152–157 [Online] Available: http://dx.doi.org/10.3115/112405.112428
[13] F J Och and H Ney, “A systematic comparison of
various statistical alignment models,” Comput Linguist.,
vol 29, pp 19–51, March 2003 [Online] Available: http://dx.doi.org/10.1162/089120103321337421
[14] C Hoang, A Le, P Nguyen, and T Ho, “Exploiting
non-parallel corpora for statistical machine translation,” in
Pro-ceedings of The 9th IEEE-RIVF International Conference
on Computing and Communication Technologies. IEEE Computer Society, 2012, pp 97 – 102.
[15] C Hoang, A Le, and B Pham, “A systematic comparison
of various statistical alignment models for statistical
english-vietnamese phrase-based translation (to appear),” in
Proceed-ings of The 4th International Conference on Knowledge and Systems Engineering. IEEE Computer Society, 2012 [16] P Koehn, H Hoang, A Birch, C Callison-Burch, M Fed-erico, N Bertoldi, B Cowan, W Shen, C Moran, R Zens,
C Dyer, O Bojar, A Constantin, and E Herbst, “Moses: open source toolkit for statistical machine translation,” in
Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ser ACL
’07 Stroudsburg, PA, USA: Association for Computational Linguistics, 2007, pp 177–180.
[17] K Papineni, S Roukos, T Ward, and W.-J Zhu, “Bleu: a method for automatic evaluation of machine translation,” in
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser ACL ’02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp 311–318.