DSpace at VNU: Improving word alignment for statistical machine translation based on constraints

Improving Word Alignment for Statistical MachineTranslation based on Constraints Le Quang Hung Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com Le A

Trang 1

Improving Word Alignment for Statistical Machine

Translation based on Constraints

Le Quang Hung

Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com

Le Anh Cuong

University of Engineering and Technology Vietnam National University, Hanoi Email: cuongla@vnu.edu.vn

Abstract— Word alignment is an important and fundamental

task for building a statistical machine translation (SMT) system.

However, obtaining word-level alignments in parallel corpora with

high accuracy is still a challenge In this paper, we propose a new

method, which is based on constraint approach, to improve the

quality of word alignment Our experiments show that using

constraints for the parameter estimation of the IBM models

reduces the alignment error rate down to7.26% and increases

the BLEU score to 5%, in the case of translation from English

to Vietnamese.

I INTRODUCTION Word alignment is a core component of every SMT

sys-tem In fact, the initial quality of statistical word alignment

dominates the quality of SMT [1] Most current SMT systems

[2], [3] use statistical models for word alignment like GIZA++

that implements the IBM models [4] However, the quality of

alignment is typically quite low for the language pairs which

are much different in syntactic structures such as

English-Vietnamese, English-Chinese Therefore, it is necessary to

incorporate auxiliary information to alleviate this problem

In our opinion, there are two factors that could reduce the

alignment error rate: (1) adding more training data (parallel

corpora); (2) developing more efficient methods for exploiting

existing training data In this work, we take the second We

address the problem of efficiently exploiting existing parallel

corpora by directly adding constraints to the Expectation

Maximization (EM) parameter estimation procedure in the

IBM models From our surveys, there aren’t factors to prevent

undesirable alignments in the original IBM-models, and thus

each word in source sentence aligns to all words in target

sentence To guide the model to correct alignments, we employ

constraints to limit a range which a word aligns with other

words (in a parallel sentence pair)

There are some previous works have incorporated auxiliary

information into the process of estimating IBM models’

pa-rameters, such as [5], [6] Och and Ney [5] used a bilingual

dictionary as an additional knowledge source for extending

estimation The knowledge sources such as cognate relations, bilingual dictionary, numeric pattern matching generate anchor constraints These anchor constraints were then used to help decide which word pairings were permissible during parameter estimation We differ from these two in that we do not use

a bilingual dictionary to generate anchor constraints Instead,

we use lexical pairs and cognates [7] that are extracted from the training data as anchor points Thus, our method has advantages that no need for extra resources when compared

to approaches in [5], [6]

In this paper, we first propose: (1) a new constraint type relies on distance between position of source word and position

of target word in a parallel sentence pair; (2) a novel method

to generate anchor constraints Then, we incorporate the con-straints into the parameter estimation of the IBM models to improve the quality of word alignments

For the remaining of the paper, the section II will present IBM model-1 and the EM algorithm Section III describes our method to improve word alignment models Experimental results are shown in section IV Finally, conclusion is derived

in section V

II IBMMODEL-1ANDEMALGORITHM Suppose that we are working with the two language English

and Vietnamese Given an English sentence e consisting ofI

wordse1, , e I and a Vietnamese sentence f consisting ofJ

wordsf1, , f J , we define the alignment a between e and f

as a subset of the Cartesian product of the word positions:

a⊆ {(j, i) : j = 1, , J; i = 1, , I} (1)

In statistical alignment models, a hidden alignment a

con-nects words in the target language sentence to words in the

source language sentence The set of alignments a is defined as

the set of all possible connections between each word position

j in the target language sentence to exactly one word position

i in the source language sentence The translation probability

P r(f|e) can be calculated as the sum of P r(f, a|e) over all

2012 International Conference on Asian Language Processing

Trang 2

The joint probabilityP r(f, a|e) only depends on the

param-etert(f j |e a j ) which is the probability that the target word fj

is aligned to the source word at positiona jis a translation of

the worde a j

P r(f, a|e) =

(I + 1) J

J

j=1

t(f j |e a j) (3)

The EM algorithm [8] is used to iteratively estimate alignment

model probabilities according to the likelihood of the model

on a parallel corpus This algorithm consists of two steps:

1) Expectation-step (E-step): Apply model to the data,

alignment probabilities are computed from the model

parameters

2) Maximization-step (M-step): Estimate model from data,

parameter values are re-estimated based on the

align-ment probabilities and the corpus

The IBM model-1 [4] was originally developed to provide

reasonable initial parameter estimates for more complex

word-alignment models [9] The next section will present how to use

the constraints for the parameter estimation of this model

III THE PROPOSED APPROACH

In this work, we add constraints directly to the EM and

mod-ify the standard parameter estimation procedure for the IBM

model-1 Note that the constrained alignment parameter is

t(f j |e i)1 We use two types of constraint: anchor and distance

The constraints are formulated as boolean functions These

are then used in the standard forward-backward recursions to

directly restrict the posterior distribution inferred in the E-step

Both types of constraint are shown as below

A Anchor constraint

Anchor constraints are exclusive constraints that force a

con-fident alignment between two words The alignment between

words in an anchor point was forced by setting translation

probabilities to zero at that position for all other words in the

E-step [6] Give a sentence pair (f, e), if word pair (f j , e i) is

an anchor point then we will assign translation probabilities

t(f j |e k) = 0, ∀k = i, and t(fl |e i) = 0, ∀l = j Figure 1

shows an example of anchor constraints As can be seen in the

figure, word pair (tôi, me) is an anchor point; thus, translation

probabilities between tôi and other words such as (tôi, a), (tôi,

car), (tôi, passed), are set to zero In this work, we used the

following knowledge sources to generate anchor constraints:

1) Cognate: According to Kondrak [10], the term cognates

denotes words in different languages that are similar in their

orthographic or phonetic form and are possible translations of

each other The cognates are particularly useful when

machine-readable bilingual dictionaries are not available We differ

from Kondrak’s method [10]; he used three word similarity

measures: Simard’s condition, Dice’s coefficient, and LCSR

to extract the cognates Here, we select the words that are

not translated and they co-occur in an aligned sentence pair

1 In the next sections, we uset(f j |e i ) instead of t(f j |e a j)

Fig 1 An example of anchor constraints (black), setting translation probabilities to zero for all other word pairs (dark grey).

(e.g., abbreviations, numbers, punctuation, ) Note that in our method, these cognates were extracted directly from corpus during training progress

2) Lexical pairs: From our observation, the most frequent

source words are likely to be translated into words, which are also frequent on the target side In order to extract the anchor points this kind, we combine translation probability (t(f j |e i))

and frequency (count(f j , e i)) of word pairs in training data.

Therefore, we can select word pairs with high-accurate to generate anchor points We define a lexical list L as a set

of entries:

L = {(f j , e i)|t(fj |e i) > α, count(fj , e i) > β}. (4) Here, e i is a source language word, f j is a target language word, andα, β are a predefined thresholds (in the experiment,

we used thresholdsα = 0.5, and β = 10) Now, we formulate

constraint based on anchor points by a boolean function

anchor_constraint(f j , e i), as follows:

anchor_constraint(fj, ei) =

true if(f j = e i ) ∨ (f j, ei ) ∈ L

f alse otherwise

(5)

B Distance constraint

In our surveys, we see that words in source language sentence have usually relationship about distance with words

in target language sentence2 Under this point of view, we proposed a new constraint type that relies on distance between the positions of source word and target word in a parallel sentence pair We formulate this by using the boolean function

distance_constraint(i, j), as follows:

distance_constraint(i, j) =

true ifabs(i − j) ≤ δ

f alse otherwise

(6) Here,abs(i−j) is the distance from a source position i to target

position j, and δ is a predefined threshold (in experiments, we

set δ = 2) It means that given a sentence pair (f, e), each

2 This is will be confirmed in the section experiment.

Trang 3

Fig 2 An example of distance constraint with thresholdδ = 2, each target

positionj (black) only align with source positions in range [j −δ, j +δ] (dark

grey).

target positionj is only aligned with source positions in range

[j − δ, j + δ] As can be seen in the figure 2, the target word

at position 5 (including word: vợ) only aligns with the source

words at positions in range [3, 7] (including words: a, wife,

and, kid, to) It is worth to emphasize that differ from the

anchor constraints which set translation probabilities to zero

for all unconstrained words We estimate translation

probabil-ity t(f j |e i) with a mixture of constrained and unconstrained

for each word pair (fj , e i) To amplify the contribution of

constraints this kind; we weight the statistics collected in the

EM algorithm We use a single parameter λ that assigns a

high weight if a word pair(fj , e i) constrained and a very low

weight if otherwise That means, the translation probability

t(f j |e i) is multiplied by λ when constrained and by (1 − λ)

otherwise (in the experiment,λ is set to 0.99).

Here, the translation probability t(f j |e i) used as collect

counts in the EM algorithm Similar to Brown et al [4], we

call the expected number of times that e connects to f in the

translation(f|e) the count of f given e for (f|e) and denote it

by c(f |e; f,e) Denote E1, andE2 are set of English words,

which distance constraint are satisfied, unsatisfied respectively

Now we have to collect countsc(f |e; f,e) from a sentence pair

(f,e), as follows3

c(f |e; f,e) = ( λt(f |e)

e k ∈E1t(f |e k)

+(1 − λ)t(f|e)

e l ∈E2t(f |e l))

J

j=1

δ(f, f j)

I

i=0

δ(e, e i)

(7)

After collecting these counts over a corpus, we estimate the

translation probabilityt(f |e; f,e) by equation (8).

t(f |e; f,e) =

(f,e)c(f |e; f,e)

f

(f,e)c(f |e; f,e) (8)

IV EXPERIMENTAL SETUP

In this section, we present results of experiments on a

parallel corpus of English-Vietnamese The experiments were

3 In equation (7), I

i=0 δ(e, e i ) is count of e in e, andJ

j=1 δ(f, f j) is

count of f in f.

carried out using the English-Vietnamese data sets which was credited by [11] We design four training data sets which con-sists of 60k, 70k, 80k, and 90k sentence pairs We performed main experiments which fell into the following categories: 1) Verifying that the use of constraints (as proposed) has

an impact on the quality of alignments

2) Evaluating whether improved parameter estimates of alignment quality lead to improved translation quality

To choose the thresholds α, β, δ, and parameter λ, we have

trained 60k sentence pairs, and have selectedα = 0.5, β = 10,

δ = 2, and λ = 0.99 that achieve high performance.

Our baseline is a phrase based SMT system that uses the Moses toolkit [2] for translation model training and decoding, GIZA++ [5] for word alignment

A Word alignment experiments

We used the alignment error rate (AER) metric as defined

by Och and Ney [5] to measure the quality of alignments, as follows

precision = |A ∩ P |

recall = |A ∩ S|

AER = 1 − |A ∩ S| + |A ∩ P |

where, S denotes the annotated set of sure alignments, P

denotes the annotated set of possible alignments, and A

denotes the set of alignments produced by the model under test [9]

In all the experiments below, we perform the same training scheme with the actual number of training order is: 6 iterations

of Model 1, 3 iterations of Model 2, and 3 iterations of Model

3 We used 150 sentence pairs as a held out hand-aligned set

to measure the word alignment quality Table I gives quality

of alignments for IBM models when training with GIZA++

on four training data sets

We obtained better results when incorporating the con-straints into the parameter estimation of the IBM models Table

II shows the results for the different corpus sizes The best-performing model in the GIZA++ was trained on 90k sentence pairs, which had an alignment error rate of 24.48% In our modified IBM models the best-performing model trained on 90k sentence pairs with an alignment error rate of 22.53% We have shown a relative reduction of AER of about7.26% on all

training data set In the baseline, increased size of training data enables reduces alignment error rate but only very minimal improvements, average improvement 0.91%/10k sentence pairs (see in the figure 3)

B Machine translation experiments

In order to test that our improved parameter estimates lead

to better translation quality, we used a phrase-based decoder [2] to translate a set of English sentences into Vietnamese The phrase-based decoder extracts phrases from the word alignments produced by experiments in the previous section

Trang 4

Q UALITY OF ALIGNMENTS FOR IBM MODELS ( BASELINE ).

Size of training data Precision Recall AER

60k 67.79 83.49 25.16

70k 68.22 83.68 24.83

80k 68.66 83.49 24.64

90k 68.63 83.93 24.48

TABLE II

Q UALITY OF ALIGNMENTS FOR IBM MODELS TRAINED WITH

CONSTRAINTS

Size of training data Precision Recall AER

60k 71.82 81.44 23.66

70k 71.82 82.87 23.04

80k 72.21 83.18 22.68

90k 72.57 83.06 22.53

(word alignment experiments) We trained a language model

using the 90k Vietnamese sentences from the training set For

the evaluation of translation quality; we used the BLEU metric

[12] A test set including 5k sentence pairs is used to evaluate

SMT quality Table III shows that our method leads to a better

translation quality than the baseline [2] We achieve a higher

BLEU score on all training data set The average improvement

is1.04 BLEU points absolute (5.0% relative) when compared

to the baseline

V CONCLUSION

In this paper, we have proposed a novel method to

improve word alignments by incorporating constraints into the

parameter estimation of the IBM models These constraints

are used to prevent undesirable alignments that cannot be

shown in standard IBM models Experimental results show

that our proposed method significantly improves the alignment

accuracy and increases translation qualities, for the case of

translating from English to Vietnamese When we improve

IBM model-1, the initializing transferring results to higher

IBM models is better Therefore, it improves the quality in

the overall We believe that our method can be applied for

other pairs of languages because that the constraints were

used in the proposed method is independent of language In

the future, we will extend our work that uses the advanced

constraints to improve the quality of alignments

ACKNOWLEDGEMENT This work is supported by the

project ”Studying Methods for Analyzing and Summarizing

Opinions from Internet and Building an Application” which

is funded by Vietnam National University of Hanoi

REFERENCES [1] J.-H Lee, S.-W Lee, G Hong, Y.-S Hwang, S.-B Kim, and H.-C Rim,

“A post-processing approach to statistical word alignment reflecting

alignment tendency between part-of-speeches,” in Coling 2010: Posters.

Beijing, China: Coling 2010 Organizing Committee, August 2010, pp.

623–629.

Fig 3 Comparison word alignment quality of baseline with our method.

TABLE III

C OMPARISON SMT QUALITY OF BASELINE WITH OUR METHOD

Size of training data Baseline Our method Δ(%)

[2] P Koehn, F J Och, and D Marcu, “Statistical phrase-based translation,”

in Proceedings of the 2003 Conference of the North American Chapter

of the Association for Computational Linguistics on Human Language Technology - Volume 1, ser NAACL ’03. Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp 48–54 [3] F J Och and H Ney, “The alignment template approach to statistical

machine translation,” Comput Linguist., vol 30, no 4.

[4] P F Brown, V J D Pietra, S A D Pietra, and R L Mercer, “The mathematics of statistical machine translation: parameter estimation,”

Comput Linguist., vol 19, no 2, pp 263–311, Jun 1993.

[5] F J Och, H Ney, F Josef, and O H Ney, “A systematic comparison of

various statistical alignment models,” Computational Linguistics, vol 29,

2003.

[6] D Talbot, “Constrained em for parallel text alignment,” Nat Lang Eng.,

vol 11, no 3, pp 263–277, Sep 2005.

[7] M Simard, G F Foster, and P Isabelle, “Using cognates to align

sentences in bilingual corpora,” in Proceedings of the 1993 conference of

the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2, ser CASCON ’93 IBM Press, 1993, pp 1071–

1082.

[8] A P Dempster, N M Laird, and D B Rubin, “Maximum likelihood

from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL

STATISTICAL SOCIETY, SERIES B, vol 39, no 1, pp 1–38, 1977.

[9] R C Moore, “Improving ibm word-alignment model 1,” in

Proceed-ings of the 42nd Annual Meeting on Association for Computational Linguistics, ser ACL ’04. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004.

[10] G Kondrak, D Marcu, and K Knight, “Cognates can improve statistical

translation models,” in Proceedings of the 2003 Conference of the North

American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings

of HLT-NAACL 2003–short papers - Volume 2, ser NAACL-Short ’03.

Stroudsburg, PA, USA: Association for Computational Linguistics, 2003,

pp 46–48.

[11] C Hoang, A.-C Le, P.-T Nguyen, and T.-B Ho, “Exploiting

non-parallel corpora for statistical machine translation.” in RIVF. IEEE,

2012, pp 1–6.

[12] K Papineni, S Roukos, T Ward, and W.-J Zhu, “Bleu: a method for

automatic evaluation of machine translation,” in Proceedings of the 40th

Annual Meeting on Association for Computational Linguistics, ser ACL

’02, 2002, pp 311–318.

Định dạng
Số trang	4
Dung lượng	359,32 KB