DSpace at VNU: Improving the quality of word alignment by integrating Pearson's Chi-square test information

Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square TestInformation Cuong Hoang1, Cuong Anh Le1, Son Bao Pham1,2 1 University of Engineering and Technology Vietna

Trang 1

Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square Test

Information

Cuong Hoang1, Cuong Anh Le1, Son Bao Pham1,2

1 University of Engineering and Technology Vietnam National University, Hanoi

2 Information Technology Institute Vietnam National University, Hanoi

{cuongh.mi10, cuongla, sonpb}@vnu.edu.vn

Abstract—Previous researches mainly focus on the

ap-proaches which are essentially inspirited from the log-linear

model background in machine learning or other adaptations.

However, not a lot of studies deeply focus on improving

word-alignment models to enhance the quality of phrase translation

table This research will follow on that approach The

experi-ments show that this scheme could also improve the quality of

the word-alignment component better Hence, the improvement

impacts the quality of translation system in overall around 1%

for the BLEU score metric.

Keywords-Machine Translation, Pearson’s Chi-Square Test,

Word-Alignment Model, Log-Linear Model

I INTRODUCTION

Modern Statistical Machine Translation (SMT) systems

are usually built from log-linear models [1][2] In addition,

the best performing systems are based in some way on

phrases (or the groups of words) [2][3] The basic idea of

phrase-based translation is to learn to break given source

sentence into phrases, then translate each phrase and ﬁnally

compose target sentence from these phrase translations

The step of phrase learning in a statistical phrase-based

translation system usually relies on the alignments between

words For ﬁnding the best alignments between phrases, ﬁrst

we generate word alignments Phrase alignments are then

heuristically “extracted” from them [3] In fact, previous

experiments point out that automatic word-based alignment

process is a vital component of an SMT system [3].

There are a lot of studies, which are inspirited from the

log-linear model background in machine learning or other

adaptations[4][5][6], to focus on improving the quality of

the translation system However, not many works scrutinize

to enhance the quality of word-alignment component to

improve the accuracy of phrase translation table and hence

to enhance the system in overall [7]

focuses on the aspect that the lexical translation modelling improving grants us to “boost” the “hidden” merit of IBM higher alignment models and therefore improves the quality

of statistical phrase-based translation better1 We found that this is a quite important aspect However, there are just only some works deeply concentrate on that point [8]

II IBM MODELS ANDTHEEFFECTS OF THEM TO

HIGHER ALIGNMENT MODELS

A IBM Model 1

Model 1 is a probabilistic generative model within a framework that assumes a source sentence f J of lengthJ

translates as a target sentencee I

1of lengthI It is deﬁned as a

particularly simple instance of this framework, by assuming all possible lengths for f J (less than some arbitrary upper

bound) have a uniform probability [9] yields the following

summarization equation:

P r(f|e) =

(I + 1) J

J

j=1

I

i=0

t(f j |e i) (1) The parameters of Model 1 for a given pair of languages are normally estimated using EM [10] We call the expected number of times that word e i connects to f j in the pair

of translation (f J |e I

1) the count of f j givene i for(f j |e i) and denote it byc(f j |e i ; f J , e I

1) Follow some mathematical inductions by [9], c(f j |e i ; f J , e I

1) could be calculated as follow equation:

c(f j |e i ; f J , e I

1) = I t(f j |e i)

i=0 t(f j |e i)·

J

j=1

σ(f J , f j)

·

I

i=0

σ(e I

1, e j) (2)

λ

2012 International Conference on Asian Language Processing

Trang 2

B The Effect of IBM Model 1 to Higher Models

Simply, for Model 2, we make the same assumptions as in

Model 1 except we assumeP r(a j |a j−1

1 , f j−1

1 , J, e) depends

on j, a j, and J, as well as on I The equation gives the

Model 2 estimate for the probability of a target sentence,

given a source sentence:

P r(f|e) =

J

j=1

I

i=0

t(f j |e i )a(i|j, J, I) (4) IBM Model 3-5, yield more accurate results than Model

1-2 mainly based on fertility-based scheme However, consider

the original and general equation for IBM Model 3-5, which

is proposed by [9], as described the “joint likelihood” for a

tableau,τ, and a permutation, π, is:

P r(τ, π|e) = I

i=1

P r(φ i |φ i−1

1 , e)P r(φ0|φ I

1, e)

I

i=0

φ i

k=1

P r(τ ik |τ i k−1

1 , τ i−1

0 , φ I

0, e)

I

i=1

φ i

k=1

P r(π ik |π i k−1

1 , π i−1

1 , τ I

0, φ I

0, e)

φ0

k=1

P r(π 0k |π0k−1

1 , π I

1, τ I

0, φ I

0, e) (5)

We could see that the alignment position information

and the fertility information are the great ideas However,

the problem is quite intuitive here: for Model 1 and 2,

we restrict ourselves to the type of alignments which are

the “lexical” connection - each “cept” connection is either

a single source word or it could be an NULL word In

contrast, as in the equation (6) and (7), we use these

prob-abilities as initial result for parameterizing φ, calculating

P r(a j |a j−1

1 , f j−1

1 , J, e) and other translation probabilities.

This fact reduces deeply the value of these higher models

because the lexical translation probabilities are quite

error-prone

III IMPROVINGIBM MODELS

A The Problem

Basically, in order to estimate the word translation

t(e i |f j), we consider the translation probabilities of all of

the possible equivalence words e k (k = i) in e I

1 of f j.

In more detail, our focus is on the case in which two

words f j and e i co-occurrence many times The lexical

translation probabilityt(f j |e i) could be derived a high value

However, the two words f j ande i are actually not existed

any “meaning relationship” in linguistics These are just

appearing many times in more by “chance” This case lets

us an important following result

That is, the “corrected” translation words of e i, for an

example,f kis never gain an expected translation probability

as it could be The translation probabilityt(f k |e i) is small for two reasons First, it is not sure that the two words are co-occurred many times Second, it comes from the impact

of “noisy” probability t(f j |e i) This is quite important In addition, [8] points out thatt(f k |e i) will be usually smaller thant(f j |e i ) when f j is just less occurrence thanf k

In order to abate the “wrong” translation probabilities

t(f j |e i), following to [8], the purely statistical method based

on the co-occurrence is quite hard to achieve that goal Previous works mainly focus on integrating syntactic knowl-edge to improve the quality of alignment [11] In this work,

we will propose a new approach, that is we combine the traditional IBM models with the another statistical method

- the Person’s Chi-square test information

B Adding Mutual Information

In fact, the essence of Pearson’s Chi-square test is to compare the observed frequencies in a table with the frequencies expected for independence If the difference between observed and expected frequencies is large, then

we can reject the null hypothesis of independence In the simplest case, theX2test is applied to 2-by-2 tables TheX2

statistic sums the differences between observed and expected values in all squares of the table, scaled by the magnitude

of the expected values, as follows:

X2=

i,j

(O ij − E ij)2

where ranges over rows of the table, sometimes it is called

“contingency tables”, ranges over columns, O ij is the

ob-served value for cell(i, j) and E ijis the expected value [12]

realized that it seems to be a particularly good choice for using the “independence” information Actually they used a measure of judge which they call φ2, which is a X2-like

statistic The value ofφ2 is bounded between 0 and 1.

For more detail on the tutorial to calculate φ2, please

refer to [12] Actually, the performance for identifying word correspondences by usingφ2 method is not good as using

IBM Model 1 together with EM training scheme [13] However, we believe this information is quite valuable We adjust the equation for each IBM model to gain better accurate in lexical translation results In more detail, for convenient, letϕ ijis denoted the probabilityφ2(e i , f j) The probability c(f|e; f J , a J , e I

1) could be calculated as follow equation:

c(f|e; f, e) = I t(f j |e i ) · (λ + (1 − λ) · ϕ ij)

i=0 t(f j |e i ) · (λ + (1 − λ) · ϕ ij)

J

j=1

σ(f, f j)

I

i=0

σ(e, e j) (7)

The parameterλ deﬁnes the weight of the Pearson’s

Chi-square test information Good values for this parameter are around 0.3

Trang 3

IV EXPERIMENT

A The Preparation

This experiment is deployed on two pairs of languages:

English-Vietnamese (E-V) and English-French (E-F) to have

an accurate and reliable result The60.000 E-V training data

was credited by [14] Similarly, the60.000 E-F training

cor-pus was the Hansards corcor-pus [13] In this work, we directly

test our improving method to the phrase-based translation

system in overall We learn phrase alignments from a corpus

that has been word-aligned by the alignment toolkit Our

phrase learning component used the best “Viterbi” sequences

which are trained from each IBM model We use LGIZA2

as a lightweight statistical machine translation toolkit that is

used to train IBM Models 1-3 More information on LGIZA

could be referred to [15]

We deploy the training scheme on the training data is

152333 for the pair E-V This scheme is suggested for

that pair by [15] for the best performance We also deploy

the training scheme152533, which was suggested by [13],

for the pair E-F In addition, we use MOSES [16] as the

phrase-based SMT framework We measure performance

using BLEU metric [17], which estimates the accuracy of

translation output with respect to a reference translation

B The “boosting” performance

We concentrate our evaluations on two aspects The ﬁrst

one is the “boosting” performance on each word alignment

model The last one is the “boosting” performance of using

higher alignment model based on the improvement of itself

plus the improvement from the lower alignment models

Table 1 describes the “boosting” results of our proposed

methods for the E-V translation system Similarly, Table 2

describes the improvements for the pair E-F

Model 2 (+Improved Model 1) 19.58 20.31 0.73

Model 3 (+Improved Model 1-2) 18.88 20.12 1.24

Table I

T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -V IETNAMESE

TRANSLATION SYSTEM

Model 2 (+Improved Model 1) 26.51 27.05 0.54

Model 3 (+Improved Model 1-2) 26.30 27.44 1.14

Table II

T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -F RENCH

TRANSLATION SYSTEM

2 LGIZA is available on: http://code.google.com/p/lgiza/

1) Improving Word Alignment Quality: From the Table

1 and 2, we could see that adding the Pearson’s Chi-square improves our systems better in quality It boosts the performance when we use IBM Model 1 as the word-alignment component around 0.51% for the pair E-V and 0.44% for the pair E-F Also, we have the similar improving

results for other IBM models

There are two exciting things we found here First, we obtain a better improving performance for the fertility-based alignment models (0.84% for the ﬁrst pair and 0.61% for

the second pair) Second, we obtain a better improving performance for the pair E-V than the second pair This

is logical with the previous work from [15], in which they point out that modelling the word-alignment for the pair

which is quite different in grammar such as the pair E-V

is quite difﬁcult than the other cases It means for that pair, thet translation parameter is quite less accurate.

2) The total performance: This experimental evaluation

focuses on the another aspect which we concentrate on That

is, for the training scheme - for example: 152333 for the

pair E-V or152533for the pair E-F, we scrutinize that if we

simultaneously improve not only Model 1-2 but also Model

3, is the obtaining improvement better than the case that

we only try to simply improve Model 3 The experimental result is quite impressive (1.24% vs 0.84% and 1.14% vs

0.61%) This steady conﬁrms our analysis in the section 2.

The improving of lexical translation modelling grants us to

“boost” all of “hidden” power of IBM higher models and therefore improving the quality of statistical phrase-based translation deeply

V CONCLUSION

Phrase-based models represent the current state-of-the-art

in statistical machine translation Together, the step of learn-ing phrase in statistical phrase-based translation is absolutely important This work focuses on an approach in which we integrate the Pearson’s Chi-square test information to IBM Model for obtaining a better performance We directly test our improving to the overall system to have a better accuracy

To besides, we also point out the fact that the improving

of lexical translation modelling grants us to “boost” all of

“hidden” power of IBM higher models and therefore improv-ing the quality of statistical phrase-based translation deeply

In summary, we believe attacking lexical translation is a good way for improving statistical phrase-based translation

in overall quality Hence, the quality of statistical phrase-based systems with the improving in phrase learning together with integrating linguistics information could tend closely to state-of-the-art

VI ACKNOWLEDGEMENT This work is supported by the project ”Studying Methods for Analyzing and Summarizing Opinions from Internet and Building an Application” which is funded by Vietnam

Trang 4

National University of Hanoi It is also supported by the

project KC.01.TN04/11-15

REFERENCES

[1] F J Och and H Ney, “Discriminative training

and maximum entropy models for statistical machine

translation,” in Proceedings of the 40th Annual Meeting on

Association for Computational Linguistics, ser ACL ’02.

Stroudsburg, PA, USA: Association for Computational

Linguistics, 2002, pp 295–302 [Online] Available:

http://dx.doi.org/10.3115/1073083.1073133

[2] ——, “The alignment template approach to

statistical machine translation,” Comput Linguist.,

vol 30, pp 19–51, June 2004 [Online] Available:

http://dx.doi.org/10.1162/089120103321337421

[3] P Koehn, F J Och, and D Marcu, “Statistical phrase-based

translation,” in Proceedings of the 2003 Conference of the

North American Chapter of the Association for

Computa-tional Linguistics on Human Language Technology - Volume

1, ser NAACL ’03 Stroudsburg, PA, USA: Association for

Computational Linguistics, 2003, pp 48–54.

[4] D Chiang, “A hierarchical phrase-based model for statistical

machine translation,” in Proceedings of the 43rd Annual

Meeting on Association for Computational Linguistics,

ser ACL ’05 Stroudsburg, PA, USA: Association for

Computational Linguistics, 2005, pp 263–270 [Online].

Available: http://dx.doi.org/10.3115/1219840.1219873

[5] G Sanchis-Trilles and F Casacuberta, “Log-linear weight

optimisation via bayesian adaptation in statistical machine

translation,” in Proceedings of the 23rd International

Con-ference on Computational Linguistics: Posters, ser COLING

’10 Stroudsburg, PA, USA: Association for Computational

Linguistics, 2010, pp 1077–1085 [Online] Available:

http://dl.acm.org/citation.cfm?id=1944566.1944690

[6] J B Mari`oo, R E Banchs, J M Crego, A de Gispert,

P Lambert, J A R Fonollosa, and M R Costa-juss`a,

“N-gram-based machine translation,” Comput Linguist.,

vol 32, no 4, pp 527–549, Dec 2006 [Online] Available:

http://dx.doi.org/10.1162/coli.2006.32.4.527

[7] D Vilar, M Popovi´c, and H Ney, “AER: Do we need to

”improve” our alignments?” in International Workshop on

Spoken Language Translation, Kyoto, Japan, Nov 2006, pp.

205–212.

[8] C Hoang, A Le, and B Pham, “Reﬁning lexical translation

training scheme for improving the quality of statistical

phrase-based translation (to appear),” in Proceedings of The 3th

International Symposium on Information and Communication

Technology. ACM digital library, 2012.

[9] P F Brown, V J D Pietra, S A D Pietra, and

R L Mercer, “The mathematics of statistical machine

translation: parameter estimation,” Comput Linguist.,

vol 19, pp 263–311, June 1993 [Online] Available:

http://dl.acm.org/citation.cfm?id=972470.972474

[10] A P Dempster, N M Laird, and D B Rubin,

“Maxi-mum likelihood from incomplete data via the em algorithm,”

JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES

B, vol 39, no 1, pp 1–38, 1977.

[11] H Wu, H Wang, and Z yi Liu, “Alignment model adaptation

for domain-speciﬁc word alignment,” in ACL, 2005.

[12] W A Gale and K W Church, “Identifying word

correspondence in parallel texts,” in Proceedings of the

workshop on Speech and Natural Language, ser HLT ’91.

Stroudsburg, PA, USA: Association for Computational Linguistics, 1991, pp 152–157 [Online] Available: http://dx.doi.org/10.3115/112405.112428

[13] F J Och and H Ney, “A systematic comparison of

various statistical alignment models,” Comput Linguist.,

vol 29, pp 19–51, March 2003 [Online] Available: http://dx.doi.org/10.1162/089120103321337421

[14] C Hoang, A Le, P Nguyen, and T Ho, “Exploiting

non-parallel corpora for statistical machine translation,” in

Pro-ceedings of The 9th IEEE-RIVF International Conference

on Computing and Communication Technologies. IEEE Computer Society, 2012, pp 97 – 102.

[15] C Hoang, A Le, and B Pham, “A systematic comparison

of various statistical alignment models for statistical

english-vietnamese phrase-based translation (to appear),” in

Proceed-ings of The 4th International Conference on Knowledge and Systems Engineering. IEEE Computer Society, 2012 [16] P Koehn, H Hoang, A Birch, C Callison-Burch, M Fed-erico, N Bertoldi, B Cowan, W Shen, C Moran, R Zens,

C Dyer, O Bojar, A Constantin, and E Herbst, “Moses: open source toolkit for statistical machine translation,” in

Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ser ACL

’07 Stroudsburg, PA, USA: Association for Computational Linguistics, 2007, pp 177–180.

[17] K Papineni, S Roukos, T Ward, and W.-J Zhu, “Bleu: a method for automatic evaluation of machine translation,” in

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser ACL ’02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp 311–318.

Định dạng
Số trang	4
Dung lượng	203,49 KB