Luận văn bilingual sentence alignment based on sentence length and word translation

VIETNAM KATIONAL UNIVERSITY, HANOL UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAT-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION MASTER THESIS O

Trang 1

VIETNAM KATIONAL UNIVERSITY, HANOL UNIVERSITY OF ENGINEERING AND TECHNOLOGY

HAT-LONG TRIEU

BILINGUAL SENTENCE ALIGNMENT

BASED ON SENTENCE LENGTH AND

WORD TRANSLATION

MASTER THESIS OF INFORMATION TECHNOLOGY

Hanoi - 2014

Trang 2

VIETNAM NATIONAL UNIVERSITY, ILANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

HAL-LONG TRIEU

BILINGUAL SENTENCE ALIGNMENT

BASED ON SENTENCE LENGTH AND

WORD TRANSLATION

Major Computer scicnce

Code: 60 48 01

MASTER THESIS OF INFORMATION TECHNOLOGY

SUPERVISOR: PhD Phuong-Thai Nguyen

Hanoi - 2014

Trang 3

ORIGINALITY STATEMENT

‘T hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (VET) or any other educational institution, except where due acknowledgement is made itt the thesis T also

a

the extent [hal

olare that the intel

ual content of this thesis is the product of my own werk, excupt lo

anee from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged

Signed

Trang 4

Acknowledgements

I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his supervision but also for his enthusiastic encowagement, right suggestion and knowledge which | have been giving during studying in Master’s course | would also like to show

my deep gratitude M.A Phuong-Ihao Thi Nguyen from Institute of Information Technology - Vietnam Avademy of Science and Technology - who provided valuable dala in my cvalualing process T would like to thank PhD Van-Vinh Nguyen for examining and giving some advices lo my work, MLA Kim-Anh Nguyen, M.A Truong Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh Nguyen for supporting and checking some issues in my research

In addition, I would like to express my thanks to lectures, professors in Faculty off Information Technology, University of Engineering and Technology (UET), Vietnam

University, Hanoi who teach me and helping me whole time T study in UET

Vinally, | would like to thank my family and friends for their support, share, and confidence throughout my study

Trang 5

Abstract

Sentence alignment plays an important role in machine translation It is an essential task in processing, parallel corpora which are ample and substantial resources for natural language processing, In order to apply these abundant materials into useful applications, parallel corpora first have to be aligned at the sentence level

This process maps sentences in texts of source language to their corresponding units in texts of target language Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine ‘Translation, word disambiguation, cross language information retrieval ‘This task also helps to extract structural information and derive statistical parameters from bilingual corpora

There have been a number of algorithms proposed with different approaches for sentence alignment However, they may be classified into some major categories First of

all, there are methods based on the similarity of sentence lengths which can be measured

by words or characters of sentences ‘hese methods are simple but effective to apply for Janguage pairs that have a high similarity in sentence lengths ‘The second set of methods

is based on word correspondences or lexicon These methods take inte account the lexical

information about texts, which is based on matching content in texts or uses cognates An

extemal dictionary may be used in these methods, so these methods are more accurate but

slower than the first ones There are also methods based on the hybrids of these firs! lwo approaches that combine their advantages, so they oblain quite high quality of alignments

in this thesis, 1 summarize general issues related to sentence alignment, and | evaluate approaches proposed for this task and focus on the hybrid method, especially the proposal

of Moore (2002), an effvelive method with high performance im term of precision, From

analyzing the limits of this method, I propose an algorithm using a new Icature, bilingual

word clustering, to improve the quality of Moore’s method The baseline method (Moore,

2002) will be introduced based on analyzing of the framework, and | describe advantages

as well as weaknesses of this approach In additian to this, I describe the basis knowledge,

algorithm of bilingual word clustering, and the new feature used in sentence alignment Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method

Keywords: science alignment, parallel corpora, matural language processing, word clustering

Trang 7

2.6.1 Microsoft's Bilingual Sentenee Alianer: Moore, 2002 coceoeoooe 30

3.7, Other PropOsals iioioeiroree onctuntimiituncenessennnstsnsneeenen SS

CHAPTER THREE Our Approach

Trang 8

3.2, Moore's Approach esses snusonetsoeusnneoseeneveeneeeieeintn

Trang 9

Paragraph longth (Galc and Church, 193) sen,

Equation in dynamic programming (Gale and Church, 1993) - 26

A bitext space in Melamed’s method (Melamed, 1996) - 29

The method of Varga et al., 2005 — 31

The method of Braunc and Fraser, 2010 cà dD

Framework of sentence alignment in our algorithm - - 43

‘An example of Brown's cluster algorithm

English word clustering data TH, 012 maie 44

Looking up the probability of a word pair AT Looking up ina word cluster - 48

Handling in the case: one werd is contained in dictionary - 48

Comparison in PTeeiSi0H s cn H1200100111 1 1 cc1 Ee.errree 55

Comparison in Recall -

Trang 10

An engy in a probabilistie đicdonary (Gale and Church, 1993) 15 Alignrment pairs (Sennrich and Volk, 3010) cn ni 36 Trainine data-L ào ooneerie

Trang 11

translation, cross language uformahion retrieval, word disambiguafion, sense

disambiguation, bilingual lexicography, aulomalic translation verificalion, automatic

acquisition of knowledge about translation, and cross-language information retrieval Building a parallel corpus, therefore, helps connecting considered languages [1, 5, 7, 12-

13, 15-16]

Parallel texts, however, are useful only when they have to be sontence-aligned The

parallel corpus fist is collected from various resources, which has a very large size of Ihe

translated segments forming iL This sive is usually of the order of entire documents and

causes an atulnguous task in learting word correspondences The solution to reduce the

ambiguity is first deoreasing the size of the segments within each pair, which is known as

sentence alignment task [7, 12-13, 16]

Sentence alignment is a process that maps scritences im the text of the souree language

to their corresponding unils in the lex of the targel language [3, 8, 12, 14, 20] This task

is the work of constructing a detailed map of the correspondence between a text and its translation (a bitext map) [14] This is the first stage for Statistical Machine Translation With aligned sentences, we can perform further analyses such as phrase and word

alignment analysis, bilingual terminology, and collocation extraction analysis as well as other applications [3, 7-9, 17] Lfficient and powerful sentence alignment algorithms,

therefore, become increasingly important

A manber of sontenice alignment algorithms have been proposed [], 7, 9, 12, 15, 17, 20], Some of these algorithms are based on sentence length [3, 8, 20]; some use word correspondences [5, 11, 13-14], same are hybrid of these two approaches [2, 6, 15, 19] Additionally, there are also some other outstanding methods for this task [7, 17] For details of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6

T propose an improvement io an effective hybrid algorithm [15] thal is used in

sentence alignment For details of our approach, sce Section 3.4 T also create cxpernmonis

11

Trang 12

to illustrate my research, For details of the corpora used in our experiments, see Section

4.2 For resulls and discussions of expenments, see Sections 4.4, 4.5

Tn the rest of this chapter, T describe some issues related to the sentence alignment

task In addition to this, I introduce objectives of the thesis and our cantributions Finally,

T describe the structure of this thesis

12, Parallel Corpora

1.2.1 Definitions

Parallel corpora are a collection of documents which are translations of each other [16] Aligned parallel corpora are collections of pairs of sentences where one sentence is a translation of the other [1]

1.2.2 Applications

Bilingual corpora are an essential resource in multilingual natural language processing

systems This resource helps to develop data-driven natural language processing

approaches This alsa contributes to applying machine learning to machine translation

[15-16],

1.2.3 Aligned Parallel Corpora

Once the parallel text is sentence aligned, i provides the maximum utility [13]

Therefore, this makes the task of aligning parallel corpora of considerable imerest, and a

number of approaches have been proposed and developed to resolve this issue

more defiaitions aboul “alignment” as well as issues relaled to il

Brown et al, 1991, assumed that every parallel corpus can be aligned in terms of a sequence of minimal alignment scgincnts, which they call “beads”, in which sentences align 1-to-1, 1-to-2, 2-to-1, 1-to-0, 0-to-]

Trang 13

17e\ / 2Be j| 120 Te

Yigure 1.1 A sequenee of bends (Browu cÝ aL, 1991)

Groups of sentence lengths are circled to show the correct alignment Each of the groupings is called a bead, and there is a number to show sentence length of a sentence in the bead In figure 1.1, “17e” means the sentence length (17 words) of an English sentence, and “19f" means the sentence length (19 words) of a French sentence There is a sequence of beads as follows,

* An ef-bead (one Hnglish sentence aligned with one French sentence) followed by

® Aneff-bead (one Linglish sentence aligned with two Hrench sentences) followed

by

® Ane-bead (ane English sentence) followed by

An alignment, then, is simply a sequence of beads that accounts for the observed sequences of sentence lengths and paragraph markers [3]

‘There are quite a umber of beads, but it is possible to only consider some of them including 1-to-I (one sentence of source language aligned with one sentence of target language), 1-to-2 (one sentence of source language aligned with two sentences of target

language), etc, Brown ef al., 1991 [3] mentioned to beads 1-f0-1, 1-0-0, 0-10-1, 1-0-2, 2-

fol, and a bead of paragraphs (Te, Tf, Tef) because of considering alignmnenis by

paragraphs of this method Moore, 2002 [15] only considers [ive of these beads: 1-10-1, 1-

10-0, 0-to-1, 1-to-2, 2-to-1 in which cach of them is called as follows:

« 1-to-0 bead (a deletion)

® 0-to-1 bead (an insertion)

® 2-to-1 bead (a contraction)

13

Trang 14

The common infonnation related to this is the frequency of beads ‘fable 1.1 shows

frequencies of types of beads proposed by Gale and Church, 1993 [8]

Table 1.1 Frequency of alignments (Gale and Church, 1993}

ProbGnat

Calagory Frequency

0.89 0.0099 0.089

Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as ‘lable 1.2:

Table 1.2 Frequency of heads (Ma, 2006)

Table 1.3 also describes these frequencies of types of beads in Moore, 2002 {15}

Table 1.3 Frequency of beads (Moore, 2002)

Trang 15

Generally, the frequency of bead 1-to-1 in alnost all corpora is largest in alll types of beads, with frequency around 90% whereas other types are only about few percentages, 1.3.3 Applications

Sentence alignment is an imporlanl Lopic in Machine Translation This is an important

first step for Statistical Machine Translation It is also the first stage to extract structural

and semantic information and to derive statistical parameters from bilingual corpora [17, 20], Moreover, this is the first step to construct probabilistic dictionary (‘able 1.4) for use

in aligning words in machine translation, or to construct a bilingual concordance for use

the les the se the il the de the a the que

1.3.4 Challenges

Although this process might seem very easy, iUhas some important, challenges which

make the task difficult [9]:

The sentence alignment task is non-trivial because sentences do not always align I-to-

1 At times a single sentence in one language might be translated as two or more sentences in the other language Ihe input text also affects the accuracies The performance of sentence alignment algorithms decreases significantly when input data

becnmes very noisy Noisy data mcans that there are more 1-0 and 0-1 alignments in the

data, For example, there arc 89% 1-1 alignmonts im English-French corpus (Gale and Church, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus Whereas in UN

15

Trang 16

Chinese Hnglish corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1 alignments arc 6.4% in this corpus Although some methods work very well on clean data, their performance gous down quickly as data becomes noisy [13]

In addition, it is difficult to achieve perfect accurate alignments even if the texts are easy and “clean” For instance, the success of an alignment program may decline dramatically when applied on a novel or philosophy text, but this program gives wonderful results when applied on a scientific text

The performance alignment also depends on languages of corpus For example, an algorithm based on cognates (words in language pairs that resemble each other phonetically) is likely to work better for linglish-l'rench than for Linglish-Ilindi because there are fewer cognates for Linelish-11indi [1]

1.3.5 Algoritions

A sentence alignment program is called “ideal” if it is fast, highly accurate, and requires no special knowledge about the corpus or the two languages [2, 9, 15] A common requirement for sentence alignment approaches is the achievement of both high accuracy and minimal consumption of computational resources [2, 9] Furthermore, a method for sentence alignment should also work in an unsupervised fashion and be language pair independent in order to be applicable to parallel corpora in any language without requiring a separate training set A method is unsupervised if it is an alignment model direelly from the dala set o be aligned Meanwhile, language pair independence means thal approaches require no specilic knowledge aboul the langaages af the parallel

fexts to align

14, Thesis Contents

‘This section introduces the organization of coments in this thesis including: objectives, our contributions, and the oulline

1.4.1 Objectives of the Thesis

In this thesis, 1 report results of my study of sentence alignment and approaches proposed for this task Uspecially, 1 focus on Moore’s method (2002), a method which is outstanding and has a number of advantages I also discover a new feature, word clustering, which may apply for tlis task to improve the accuracy of alignment I cxaming this proposal in experiments and compare results to those in the baseline icthod to prove advantages of my approach

Trang 17

1.4.2 Contributions

My main contributions are as follows

« Evaluating methods in sentence alignment and introducing an algorithm that

unproves Moore’s method

* Using new feature - word clustering, helps to improve accuracy of alignment

‘This contributes in complementing strategies in the sentence alignment

problan 1.4.3, Outline

The rest of the thesis is organized as follows:

Chapter 2 — Related Works

To this chapter T introduce some recent rescarch aboul senlence aligranent In order ta

have a general view of methods proposed lo deal this problem, an overall presentation

aboul methods of sentence alignment is introduced in this chapler Methods are classified into some types in which each method is given by describing its algorithm along with

evaluations related to it

Chapter 3— Our Approach

This chapter describes the method we proposed in sentence alignment to improve

Moore’s method Friially, an analysis of Moore’s method and evaluations about il arc algo mentioned in this chapler The major content of this chapler is the framework of the

proposed method, an algorilhin using bilmgual word clustering Art example is described

in this chapter ta illustrate the approach clearly

Chapter 4— Experiments

This chapter shows experiments performed in our approach Data corpora used in experiments are presented completely Results of experiments as well as discussions about them are clearly described for evaluating our approach lo the baseline method

Chapter 5—Conclusions and Future Works

In this last chapter, advantages and restrictions of my works are summarized in a general conclusion Besides, some research directions are mentioned to improve the cuzrent model in the future

Finally, referers

are given (o show reacarch published thai my system re!

Trang 18

1.5 Summary

This chapter introduces my research work I have given background information about

parallel corpora, sentence alignment, definitions of issues as well as some initial problems

related to sentence alignment algorillins Terms of alignment which are used in this Lask

have been defined a this chapter In addition, an oulline of my research work m this thesis has also been provided A discussion of future proposed work is alsa presented

18

Trang 19

CHAPTER TWO

Related Works

2.1 Overview

This chapler is an ilroduction to some research in sentence alignmeril m recent years

and some evaluations about these approaches A number of problems related lo this work

are also discussed: factors that affeel the performance of alignment algorithms, searching

and resources for each method, Evaluations of algorithm are introduced to give a general view of advantages as well as weaknesses of each algorithm

Scetion 2.2 provides an overview of sentence alignment approaches Section 2.3 introduces and evaluates some primary approaches in longth-based methods Section 2

introduces and evaluales proposals of word-correspondence-based approaches Proposals

as well as evaluations for each of them in hybrid methods are presented in Sechon 2.5

Cerlatnly, there are some other oulslanding approaches aboul ths task, which are also

introduced in Section 2.6 Section 2.7 conchudes this chapter

2.2 Overview of Approaches

2.2.1 Classification

From the first approaches proposed in 1990s, there have been a number of

publications reported in sentence alignment with different techniques

4n various sentence alignment algorithms which have been proposed, there are three widespread approaches based respectively on a comparison of sentence length, lexical

correspondence and a combination of these first lwe methods

There are also some other techniques such as methods based on BLE score, support

vector machine, and hidden Markov model classifiers

2.2.2 Length-hased Methods

Length-based approaches are based on modeling the relationship between the lengths

of sentences that are mutual translations The length is measured by characters or words

of a sentence In these approaches, semantics of the text are not considered Statistical methods are used for this task instead of the content of texts In other words, these

19

Trang 20

methods only consider the length of sentences in order to make the decision for

alignment

These methods are based on the fact that longer sentences in one language tend to be

iranslated into longer sentences in the other language, and that shorter sentences tend to

be translated into shorter sentences A probahilistic scare is assigned to each proposed

correspondence of sentences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference there are two random

variables 1, and Jy which are the lengths of the two sentences under consideration TL is

assumed that these random variables are mdependent, and identically distibuted wilh a

normal distribtion [8]

Given the two parallel texts ST (source text) and TT (target text), the goal of this task

is to find alignment A whieh is highest probability

maxA P(A,ST,TT)

Tn order to estimate this probability, aligned text is decomposed in a sequence of

aligned sentence beads where each bead is assumed to be independent of others

The algorithms of this type were first proposed in Brown, ct al, 1991 and Gale and Church, 1993 These approaches use scntence-length statistics in order to model the relationship between groups of sentences that arc translations of cach other Wu (Wu, 1994) also uses the length-based method by applying the algorithm proposed by Gale and Church, and he further uses lexical cnes from corpus-specific bilingual lexicon to improve

alignment

The methods proposed in this type of semence alignment algorithm are based solely

on the Iengths of sentences, so they require almost no prior knowledge Furthermore, these methods are highly accurate despite their simplicity They can also perform in a high speed When aligning lexls whose languages are similar or have a high length correlation such as English, French and German, these approaches are especially useful and work remarkably well They also perform fairly well if the input text is clean such as

in Canadian Hansards corpus [3] The Gale and Church algorithm is still widely used today, for instance to align Europarl (Koehn, 2005)

Nevertheless, these methods are not robusl since they only use the sentence Tenth information They will no longer be rehable if there is too much noise in the input bilingual texls As shown in (Chen, 1993) [5] the aveuracy of sentence-length based methods decreases drastically when aligning texts containing small deletions or free translation; they can easily misalign small passages because they ignore word identities

‘The algorithm of Brown et al requires corpus-dependent anchor points while the method

20

Trang 21

proposed by Gale and Church depends on prior alignment of paragraphs to constrain the search, When aligning texts where the length correlation breaks down, such as the Chinese-English language pair, the performance of longth-basud algorithms dectines quickly

2.2.3 Word Correspondences Methods

The second approach, one that tries to overcome the disadvantages of length-based approaches, is the word-based method that is bascd on lexical information from translation lexicons, and/or through the recognition of cognates These methods take into account the lexical information about texts Most algorithms match content in one text

with their corresponderices in the olfier text, and use these matches as anchor poms in the

task sentence alignment Words which are Wanslations of each other may have sunilar

distribution in the source language and target language texts Meanwhile, some methods uuse cognates (words in language pairs that resemble each other phonetically) rather than

the content of word pairs to determine beads of sentences

This type of sentence alignment methods may be illustrated in some ouistanding approaches such as Kay and Réscheisen, 1993 [11], Chen, 1993 (5], Melamed, 1996 [14], and Ma, 2006 [13] Kay's wark has not proved efficient enough to be suitable for large corpora while Chen constructs a word-to-word translation madel during alignment to assess the probability of an alignment Word correspondence was further developed in IBM Model-I (Brown et al, 1993) for statistical machine translation Meanwhile, word correspondence in another way (geometric comespondence) for sentence alignment is proposed by Melamed, 1996,

These algorithms have higher accuracy in comparison with length-based methods Because they use the lexical information from source and translation lexicons rather than only sentence length ta determine the translation relationship between sentences in the source text and the target text, these algorithms usually are more robust than the length- based algorithms

Nevertheless, algorithms based on a lexicon are slower than those based on length

sentence because they requive considerably more expensive compulation In addition to

this, they usually depend on cognates or a bilingual lexicon The method of Chen requires

an initiat bilingual lexicon; the proposal of Melamed, meamwhile, depends on finding cognates in the two languages to suggest word correspondences

2.2.4 Hybrid Methods

Sentence fength and lexical information are also combined in order that different approaches can complement on each other and achieve more efficient algorithms

21

Trang 22

‘These approaches are proposed in Moore, 2002, Varga et al, 2005, and Braune and Fraser, 2010 Bolh approaches have Iwo pass

st which « lunglti-based method is used

for a first aligrment and this subscquently serves as training data for a translation model, which is then used in a complex similarity score Moore, 2002 proposes a two-phase

method that combines sentence length (word count) in the first pass and word

correspondences (IRM Model-1) in the second one Varga et al (2005) also use the

hybrid technique in sentence alignment by combining sentence length with ward

comespondences (using a dictionary-based translation model in which the dictionary can

be manually expanded) Braune and Fraser, 2010 also propose an algorithm similar to Moore except that this approach has the technique to build 1-temany and many-to-I alignments rather than focus only on 1-to-] alignment as Moore’s method

The hybrid approaches achieve a relatively high performance and overcome limits of the first two methods along with combining their advantages The approach of Moore,

2002 obtains a high precision (fraction of retrieved documents that are in fact relevant) and computational efficiency Meanwhile, the algorithm proposed by Varga et al 2005 which has the same idea as Moore, 2002 gains a very high recall rate (fraction of relevant documents that arc retrieved by the algorithm)

Nonetheless, there are still weaknesses which should be handled in order to obtain a more efficient sentence alignment algorithm In Moore’s method, the recall rate is rather low, and this fact is especially problematic when aligning parallel corpara with much noise or sparse data ‘he approach of Varga et al., 2005, meanwhile, gets a very high recall value, however, it still has a quite low precision rate

2,3 Some Important Problems

2.3.2 Linguistic Distances

Another parameter which can also affect the performance of sentence alignment algorithms is the linguistic distanco betwoon source language and targct language Linguistic distance means the extent to which languages differ from cach other For

32

Trang 23

example, English is linguistically “closer” to Western Huropean languages (such as French and German) than it is to ast Asian languages (such as Korean and Japanese) Thore are some measures lo assess thie linguistic distance such as the number of cognate

words, syntactic features TL is important 1o Tecognize thal some algorithms may nal

perform so well if they rely on the closeness between languages while these languages are distant An obvious example for this is that a method is likely to work better for English- French or English-German than for English-Hindi if is based on cognates because of fewer cognates in English-Ilindi Ilindi belongs to the Indo-Aryan branch whereas English and German belongs to the Indo-Germanic one

2.3.3 Searching

Dynamic programming is the techroque that mos) sentence alignment tools use in

searching the best path of sentence pairs through a parallel text This also means that the

texts are ordered monotonically and none of these algorithms is able to extract sentence

pairs in crossing positions Nevertheless, most of these programs have advantages in using this technique in searching and none of them reports weaknesses about it it is

because the characteristic of translations is that almost all sentences have same order in

both source and target texts

In this aspect, algorithms may be confronted with problems of the search space Thus, pruning strategies to restrict the search space is also an issue that algorithms have to resolve

Trang 24

‘Yo perform searching, for the best alignment, Brown et al use dynamic programming This technique requires lime quadratic in the length of the text aligned, so it is not practical to align a large corpus as a single uni The computation of searching may be reduced dranatically if the bilngual corpus is subdivided into smaller chunks This subdivision is performed by using anchors in this algorithm An anchor is a piece of text likely to be present at the same location in both of the parallel corpora of a bilingual corpus Dynamic programming first is used to align anchors and then this technique is applied again to align the text between anchors

The alignment, compulation of this algorithnr is fast since il makes no use of the Texteal delails of the sentence Therclore, il is practical to apply this method to very large collections of text, especially lo high correlation languages pairs

2.4.2 Vanilla: Gale and Church, 1993

This algorithm perfonns sentence alignment based on a statistical mode! of sentence Iengihs measured by charavlors Tl uses the fact hai longer sentences in one language lend

to be translated into longer sentences iti another language

This algorithm is similar to the proposal of Brown et al except that the former is based on the number of words whereas the latter is based on the number of characters in sentences In addition, the algorithm of Brown et al aligns a subset of the corpus for further research instead of focusing on entire articles ‘the work of Gale and Church

(1991) supporls this promise of wider applicabilily

This sentence alignment program has two steps First paragraphs are aligned, and then sentences within a paragraph are aligned This algorithm reports that paragraph lengths are highly correlated Figure 2.1 illustrates this correlation of the languages pair: English and German,

Trang 25

English paragraph length

Figure 2.1 Paragraph length (Gale and Church, 1993),

‘A probabilistic score is assigned to each proposed correspondence of sentences, based

on the sealed dillerence of longibs of the two sentences and the variance of this diffrence This score is used ina dynamic progranuning framework to find the maximum likelihood aligrancnt of sontenecs, The use of dynamic programming allows the system to consider all possible alignments and find the minimum cost alignment effectively

A distance function d is defined in a general way to allow for insertions, deletions, substitution, ele The function takes four arguments: 344 ¥4, X2, ¥2

Tet d(x1,¥1; 9,0) be the cost of substitution x, with y;,

(x1, 0; 0,0) be the cost of deleting x,

d (0,31; 0,0) be the cost of insertion of y,,

d (£1, y1} 2,0) be the cost of contracting x, and x, to'y;,

d(24,¥1; 0,92) be the cosh of expanding x, lo yy, and yo, and

(41,91; Xo,¥2) be the cost of merging x, and x, and matching with Vị and y>

The Dynamic Programming Algoridin is summarized in the following recursion

equation.

Trang 26

Let 5;,1 = 1 1, be the sentences of one language, and t),/ = 1 /, be the translations

of those sentences in the other language

Let d be the distance fiction, and let D(i,j) be the minimum distance between senteness sy) 5; and (heir translations ty, ,4), under the maximum-likeliood alignment D(i,/) is computed by minimizing over six cases (substitution, deletion, insertion, contraction, expansion, and merger) These, in effect, impose a set of slope constraints D(,/} is defined with the initial condition D(i, /) = 0

This algorithm has some main characteristics as follows:

«Firstly, this is a simple algorithm The number of characters in the sentences is counted simply, and the dynamic programming model is used to find the correct pairs of alignment, Many later rescarchers have integrated this method

lo their methods because of this simplicity

* Secondly, this algorithm can be used between any pairs of languages because it does not use any Jexical information

«Thirdly, it has a low time cost, one of the most important criteria to apply this method to a very large bilingual corpus

As the report of Gale and Church, in comparison with the length-based method based

on word count like the proposal of Brown et al., it is better to use characters rather than

words in counting sentence length The performance of this approach is beltsr since there

ia less variability im the differcnecs of scutence lengths so measured, which using words

as unils tereases the error rale by half! This method performs well at least on related

languages The accuracy of this method also depends on the type of alignment It gets best results on 1-to-1 alignments, but it has a high error on more difficult alignments This algorithm is still widely used today to align some corpora like the EuroparÌ corpus

(Koehn, 2005) and the JRC-Acquis (Stein-berger et al 2006)

36

Trang 27

2.4.3 Wu, 1994

Wu applies Gale and Church’s method to the language pair English-Chinese In order

to improve the accuravy, he ulilizes lexical information from translaGon lexicons, and/or

through the idenlification of cognates Lexical cuss used in this method are in the form of

a sinall corpus-specifie lexicon

This method is important in two respects:

«Firstly, this method has indicated that length-based methods give satisfactory resulls even belween unrelated Iauguages (languages from unrelated families such as English and Chinese), a surprising result

* Secondly, lexival cues used in this method increase accuracy of alignment ‘Chis proves the cffcct of lexical information to the accuracy when adding lo a lengll-based method,

1.5 Word-based Proposals

2.5.1 Kay and Roscheisen, 1993

This algorithm is based on word correspondences, which is an iterative relaxation approach The iterations are starled by the assumpliou Uhal the first and last sonlences off the texts align, and these arc the initial anchors before these below steps continue until

most sentences are aligned:

«Step 3: Find pairs of source and target sentences which contain many possible lexical correspondences

A sel of partial aligranents which will be part of the final result is duced by using the

most rehable of these pairs

However, a weakness of this method is that it is not efficient enough to apply to large corpora

Trang 28

Dynamic programming with threshold is the search strategy of this algorithm ‘The

search 1s linear im the length of the corpus beeause of the use of threshold As a resull, the corpus needs nol be subdivided imo smaller chunks Tt also deals well wilh deletions This

also makes the search strategy robust despite large deletions whereas identifying the

beginning and the end of deletions is performed confidently thanks to using lexical

information With an intelligent threshold, great benefits may be obtained due to the fact that most alignments are one-to-one The computation of this algorithm is reduced to a

linear one because of considering only a subset of all possible alignments

This method gives beller accuracy than the longih-based one, but is “lens of times

slower (han the Brown [3] and Gale [8] algorithms” [5] Furthermore, il is also language

independent Tt cari handle large deletions in texl

‘This algorithm requires that for each language pair 100 sentences need to be aligned

by hand to bootstrap the translation model, therefore, it depends on a minimum of human intervention This method takes a great computational cost because of using the lexical information However, alignment is a one-time cost, and it may be very useful after it is

contributed whereas Ihe computing power is also available These causes dicate thal the computational cost sometimes may be acceplable

2.5.3 Melamed, 1996

‘This method is based on word correspondences [14] A bitext map of words is used to

mark the poinls of correspondences belween these words in a Lwo-dimensional graph Afler marking all possible points, the ruc correspondence poinls in graph are found by

some rules, sentences localion information and boundary of sentences

Melamed uses a term of bitext which comprises two versions of a text like a text in

two different languages A bitext is created in each time when translators translate a text Figure 2.3 illustrates # reclangular bilexl space thal cach bilext delines The lengths of the two components toxls (in charactors) are presented by the widih and the height of the rectangle.

Trang 29

x = character position in text 1

Figure 2.3 A bitext space in Melamed’s method (Melamed, 1996)

:y than Gale and Church’s one This method

This algorithm has slightly better accu

can give almost perfect resulis in alignment if a good bitext map can be formed This is also the power of this method If it is used in popular languages, it may be the best and may be possible to acquire a good bitexi map

However, this method requires a good bitext map to have satisfactory accuracy

2.5.4 Champollion: Ma, 2006

This algorithm, which is designed for robusl aligmen of polential noisy parallel text,

is a lexicon-based sentence aligner [13] Tt was first developed for aligning Chinese- English parallel text before parted to other language pairs such as Hindi-English or Arabic-English In this method, only if lexical matches are present, a match is considered

as possible In considering a stronger indication that two segments are a match, higher weights are assigned to less frequent words In order to weed out bogus matches, this algorithm also uses sentence length information

This method focuses on dealing wilh noisy data, Tt overcomes existing methods which work very well on clean dala bul decline quickly in their performance when data become noisy

Thore arc also some different characteristics between this method and other sentence aligners:

Trang 30

they will not be I-to-1 ones They have a significant amount of the number of

deletions and insertions This method assumes with such an input data It is

unreliable when using sentence length information to deal with noisy data This

information is only used as a mimor role when the method is based on lexical evidence

« Translated words are treated equally in most sentence alignment algorithms In

other words, when the method decides sentence correspondences, an equal weight

is assigned to translated words pairs ‘This method assigns weights to translated

words, which makes it different from other lexicon-based algorithms

e There are two steps to apply translation lexicons in sentence aligners: entries from

a translation lexicon are used to identify translated words before sentence

correspondences are identified by using statistics of translated words

In this method, assigning grcater weighls to less (requent translaled words helps to

increase the robustness of the alignment especially when dealing with noisy data Tt gains

high precision and recall rates, and it may be easy to use for new language pairs

However, it requires an extemally supplied bilingual lexicon

2.6 Hybrid Proposals

2.6.1 Microsoft's Bilingual Sentence Aligner: Moore, 2002

This algorithm combines length-based and ward-based approaches to achieve high

accuracy at a modest computational cost [15] A problem in using lexical information is

that it limits the use of algorithm only between a pair of languages Moore resolves this

problem by using a method similar to IBM translation model to extract a bilingual corpus with the texts at hand This method has two passes Sentence-length based statistics are

first used for extracting the training data for the IBM Model-] translation tables before this model based on sentence-length are combined with the acquired lexical statistics in

order to extract I-to-l correspondences with high accuracy A forward-backward computation is used as the search heurislic in which the forward pass is a pruned dynamic programming procedure

This is a highly accurate and language-independent algorithm, so it is a very promising method It constantly achieves the high precision Furthermore, it is also fast in comparison with methods which use solely Icxical information Requiring no knowledge

of the languages or the corpus is also another advantage of this method While lexical methods are generally more than sentence-length ones, most of them require additional linguistic resources or knowledge Moore has tried lo overcome this issue by using TBM

30

Tiêu đề	Bilingual Sentence Alignment Based on Sentence Length and Word Translation
Tác giả	Vietnam National University, Hanoi, Vietnam Academy of Science and Technology, Faculty of Information Technology, University of Engineering and Technology (UET), Vietnam University
Người hướng dẫn	PhD Phuong-Thai Nguyen, M.A Phuong-Ihao Thi Nguyen, PhD Van-Vinh Nguyen, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2014
Thành phố	Hanoi

Định dạng
Số trang	61
Dung lượng	864,5 KB