Bilingual sentence alignment based on sentence length and word translation

Parallel corpora aligned at sentence level become a usefulresource for a number of applications in natural language processing including StatisticalMachine Translation, word disambiguati

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

MASTER THESIS OF INFORMATION TECHNOLOGY

SUPERVISOR: PhD Phuong-Thai Nguyen

Hanoi - 2014

Trang 3

ORIGINALITY STATEMENT

„I hereby declare that this submission is my own work and to the best of myknowledge it contains no materials previously published or written by another person, orsubstantial proportions of material which have been accepted for the award of any otherdegree or diploma at University of Engineering and Technology (UET) or any othereducational institution, except where due acknowledgement is made in the thesis I alsodeclare that the intellectual content of this thesis is the product of my own work, except tothe extent that assistance from others in the project‟s design and conception or in style,presentation and linguistic expression is acknowledged.‟

Signed

Trang 4

I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for hissupervision but also for his enthusiastic encouragement, right suggestion and knowledgewhich I have been giving during studying in Master‟s course I would also like to show

my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of InformationTechnology - Vietnam Academy of Science and Technology - who provided valuable data

in my evaluating process I would like to thank PhD Van-Vinh Nguyen for examining andgiving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen fortheir help along with comments on my work, especially M.A Kim-Anh Nguyen forsupporting and checking some issues in my research

In addition, I would like to express my thanks to lectures, professors in Faculty ofInformation Technology, University of Engineering and Technology (UET), VietnamUniversity, Hanoi who teach me and helping me whole time I study in UET

Finally, I would like to thank my family and friends for their support, share, andconfidence throughout my study

Trang 5

Sentence alignment plays an important role in machine translation It is an essentialtask in processing parallel corpora which are ample and substantial resources for naturallanguage processing In order to apply these abundant materials into useful applications,parallel corpora first have to be aligned at the sentence level

This process maps sentences in texts of source language to their corresponding units intexts of target language Parallel corpora aligned at sentence level become a usefulresource for a number of applications in natural language processing including StatisticalMachine Translation, word disambiguation, cross language information retrieval Thistask also helps to extract structural information and derive statistical parameters frombilingual corpora

There have been a number of algorithms proposed with different approaches forsentence alignment However, they may be classified into some major categories First ofall, there are methods based on the similarity of sentence lengths which can be measured

by words or characters of sentences These methods are simple but effective to apply forlanguage pairs that have a high similarity in sentence lengths The second set of methods

is based on word correspondences or lexicon These methods take into account the lexicalinformation about texts, which is based on matching content in texts or uses cognates Anexternal dictionary may be used in these methods, so these methods are more accurate butslower than the first ones There are also methods based on the hybrids of these first twoapproaches that combine their advantages, so they obtain quite high quality of alignments

In this thesis, I summarize general issues related to sentence alignment, and I evaluateapproaches proposed for this task and focus on the hybrid method, especially the proposal

of Moore (2002), an effective method with high performance in term of precision Fromanalyzing the limits of this method, I propose an algorithm using a new feature, bilingualword clustering, to improve the quality of Moore‟s method The baseline method (Moore,2002) will be introduced based on analyzing of the framework, and I describe advantages

as well as weaknesses of this approach In addition to this, I describe the basis knowledge,algorithm of bilingual word clustering, and the new feature used in sentence alignment.Finally, experiments performed in this research are illustrated as well as evaluations toprove benefits of the proposed method

Keywords: sentence alignment, parallel corpora, natural language processing, word

clustering

Trang 6

Table of Contents

ORIGINALITY STATEMENT 3

Acknowledgements 4

Abstract 5

Table of Contents 6

List of Figures 9

List of Tables 10

CHAPTER ONE Introduction 11

1.1 Background 11

1.2 Parallel Corpora 12

1.2.1 Definitions 12

1.2.2 Applications 12

1.2.3 Aligned Parallel Corpora 12

1.3 Sentence Alignment 12

1.3.1 Definition 12

1.3.2 Types of Alignments 12

1.3.3 Applications 15

1.3.4 Challenges 15

1.3.5 Algorithms 16

1.4 Thesis Contents 16

1.4.1 Objectives of the Thesis 16

1.4.2 Contributions 17

1.4.3 Outline 17

1.5 Summary 18

CHAPTER TWO Related Works 19

2.1 Overview 19

2.2 Overview of Approaches 19

Trang 7

2.2.1 Classification 19

2.2.2 Length-based Methods 19

2.2.3 Word Correspondences Methods 21

2.2.4 Hybrid Methods 21

2.3 Some Important Problems 22

2.3.1 Noise of Texts 22

2.3.2 Linguistic Distances 22

2.3.3 Searching 23

2.3.4 Resources 23

2.4 Length-based Proposals 23

2.4.1 Brown et al., 1991 23

2.4.2 Vanilla: Gale and Church, 1993 24

2.4.3 Wu, 1994 27

2.5 Word-based Proposals 27

2.5.1 Kay and Roscheisen, 1993 27

2.5.2 Chen, 1993 27

2.5.3 Melamed, 1996 28

2.5.4 Champollion: Ma, 2006 29

2.6 Hybrid Proposals 30

2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002 30

2.6.2 Hunalign: Varga et al., 2005 31

2.6.3 Deng et al., 2007 32

2.6.4 Gargantua: Braune and Fraser, 2010 33

2.6.5 Fast-Champollion: Li et al., 2010 34

2.7 Other Proposals 35

2.7.1 Bleu-align: Sennrich and Volk, 2010 35

2.7.2 MSVM and HMM: Fattah, 2012 36

2.8 Summary 37

CHAPTER THREE Our Approach 39

3.1 Overview 39

Trang 8

3.2 Moore‟s Approach 39

3.2.1 Description 39

3.2.2 The Algorithm 40

3.3 Evaluation of Moore‟s Approach 42

3.4 Our Approach 42

3.4.1 Framework 42

3.4.2 Word Clustering 43

3.4.3 Proposed Algorithm 45

3.4.4 An Example 49

3.5 Summary 50

CHAPTER FOUR Experiments 51

4.1 Overview 51

4.2 Data 51

4.2.1 Bilingual Corpora 51

4.2.2 Word Clustering Data 53

4.3 Metrics 54

4.4 Discussion of Results 54

4.5 Summary 57

CHAPTER FIVE Conclusion and Future Work 58

5.1 Overview 58

5.2 Summary 58

5.3 Contributions 58

5.4 Future Work 59

5.4.1 Better Word Translation Models 59

5.4.2 Word-Phrase 59

Bibliography 60

Trang 9

List of Figures

Figure 1.1 A sequence of beads (Brown et al., 1991) 13

Figure 2.1 Paragraph length (Gale and Church, 1993) 25

Figure 2.2 Equation in dynamic programming (Gale and Church, 1993) 26

Figure 2.3 A bitext space in Melamed‟s method (Melamed, 1996) 29

Figure 2.4 The method of Varga et al., 2005 31

Figure 2.5 The method of Braune and Fraser, 2010 33

Figure 2.6 Sentence Alignment Approaches Review 38

Figure 3.1 Framework of sentence alignment in our algorithm 43

Figure 3.2 An example of Brown's cluster algorithm 44

Figure 3.3 English word clustering data 44

Figure 3.4 Vietnamese word clustering data 44

Figure 3.5 Bilingual dictionary 46

Figure 3.6 Looking up the probability of a word pair 47

Figure 3.7 Looking up in a word cluster 48

Figure 3.8 Handling in the case: one word is contained in dictionary 48

Figure 4.1 Comparison in Precision 55

Figure 4.2 Comparison in Recall 56

Figure 4.3 Comparison in F-measure 57

Trang 10

List of Tables

Table 1.1 Frequency of alignments (Gale and Church, 1993) 14

Table 1.2 Frequency of beads (Ma, 2006) 14

Table 1.3 Frequency of beads (Moore, 2002) 14

Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993) 15

Table 2.1 Alignment pairs (Sennrich and Volk, 2010) 36

Table 4.1 Training data-1 51

Table 4.2 Topics in Training data-1 52

Table 4.3 Training data-2 52

Table 4.4 Topics in Training data-2 52

Table 4.5 Input data for training clusters 53

Table 4.6 Topics for Vietnamese input data to train clusters 53

Table 4.7 Word clustering data sets 54

Trang 11

13, 15-16].

Parallel texts, however, are useful only when they have to be sentence-aligned Theparallel corpus first is collected from various resources, which has a very large size of thetranslated segments forming it This size is usually of the order of entire documents andcauses an ambiguous task in learning word correspondences The solution to reduce theambiguity is first decreasing the size of the segments within each pair, which is known assentence alignment task [7, 12-13, 16]

Sentence alignment is a process that maps sentences in the text of the source language

to their corresponding units in the text of the target language [3, 8, 12, 14, 20] This task isthe work of constructing a detailed map of the correspondence between a text and itstranslation (a bitext map) [14] This is the first stage for Statistical Machine Translation.With aligned sentences, we can perform further analyses such as phrase and wordalignment analysis, bilingual terminology, and collocation extraction analysis as well asother applications [3, 7-9, 17] Efficient and powerful sentence alignment algorithms,therefore, become increasingly important

A number of sentence alignment algorithms have been proposed [1, 7, 9, 12, 15, 17,20] Some of these algorithms are based on sentence length [3, 8, 20]; some use wordcorrespondences [5, 11, 13-14]; some are hybrid of these two approaches [2, 6, 15, 19].Additionally, there are also some other outstanding methods for this task [7, 17] Fordetails of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6

I propose an improvement to an effective hybrid algorithm [15] that is used insentence alignment For details of our approach, see Section 3.4 I also create experiments

Trang 12

to illustrate my research For details of the corpora used in our experiments, see Section4.2 For results and discussions of experiments, see Sections 4.4, 4.5.

In the rest of this chapter, I describe some issues related to the sentence alignmenttask In addition to this, I introduce objectives of the thesis and our contributions Finally,

I describe the structure of this thesis

1.2 Parallel Corpora

1.2.1 Definitions

Parallel corpora are a collection of documents which are translations of each other

[16].Aligned parallel corpora are collections of pairs of sentences where one sentence is atranslation of the other [1]

1.2.2 Applications

Bilingual corpora are an essential resource in multilingual natural language processingsystems This resource helps to develop data-driven natural language processingapproaches This also contributes to applying machine learning to machine translation[15-16]

1.2.3 Aligned Parallel Corpora

Once the parallel text is sentence aligned, it provides the maximum utility [13].Therefore, this makes the task of aligning parallel corpora of considerable interest, and anumber of approaches have been proposed and developed to resolve this issue

1.3 Sentence Alignment

1.3.1 Definition

Sentence alignment is the task of extracting pairs of sentences that are translation ofone another from parallel corpora Given a pair of texts, this process maps sentences inthe text of the source language to their corresponding units in the text of the targetlanguage [3, 8, 13]

Trang 13

Figure 1.1 A sequence of beads (Brown et al., 1991).

Groups of sentence lengths are circled to show the correct alignment Each of thegroupings is called a bead, and there is a number to show sentence length of a sentence inthe bead In figure 1.1, “17e” means the sentence length (17 words) of an Englishsentence, and “19f” means the sentence length (19 words) of a French sentence There is asequence of beads as follows:

 An -bead (one English sentence aligned with one French sentence) followed by

 An -bead (one English sentence aligned with two French sentences) followed by

 An -bead (one English sentence) followed by

 A ¶ ¶ bead (one English paragraph and one French paragraph).

An alignment, then, is simply a sequence of beads that accounts for the observedsequences of sentence lengths and paragraph markers [3]

There are quite a number of beads, but it is possible to only consider some of them including

1-to-1 (one sentence of source language aligned with one sentence of target language), 1-to-1-to-2 (one sentence of source language aligned with two sentences of target language), etc; Brown et al., 1991 [3] mentioned to beads 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-to-1, and a bead of paragraphs ( ¶ , ¶ ,¶

) because of considering alignments by paragraphs of this method Moore, 2002 [15] only considers five of these beads: 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-to-1 in which each of them is called as follows:

 1-to-1 bead (a match)

 1-to-0 bead (a deletion)

 0-to-1 bead (an insertion)

 1-to-2 bead (an expansion)

 2-to-1 bead (a contraction)

13

Trang 14

The common information related to this is the frequency of beads Table 1.1 shows frequencies of types of beads proposed by Gale and Church, 1993 [8].

Table 1.1 Frequency of alignments (Gale and Church, 1993)

Category1-11-0 or 0-12-1 or 1-22-2

Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as Table 1.2:

Table 1.2 Frequency of beads (Ma, 2006)

Category1-11-0 or 0-11-2 or 2-1OthersTotalTable 1.3 also describes these frequencies of types of beads in Moore, 2002 [15]:

Table 1.3 Frequency of beads (Moore, 2002)

Category1-11-22-11-00-1Total

Trang 15

Generally, the frequency of bead 1-to-1 in almost all corpora is largest in all types ofbeads, with frequency around 90% whereas other types are only about few percentages.

1.3.3 Applications

Sentence alignment is an important topic in Machine Translation This is an importantfirst step for Statistical Machine Translation It is also the first stage to extract structuraland semantic information and to derive statistical parameters from bilingual corpora [17,20] Moreover, this is the first step to construct probabilistic dictionary (Table 1.4) for use

in aligning words in machine translation, or to construct a bilingual concordance for use

in lexicography

Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993)

English

thethethethethethethethethe

1.3.4 Challenges

Although this process might seem very easy, it has some important challenges which make the task difficult [9]:

The sentence alignment task is non-trivial because sentences do not always align

1-to-1 At times a single sentence in one language might be translated as two or moresentences in the other language The input text also affects the accuracies Theperformance of sentence alignment algorithms decreases significantly when input databecomes very noisy Noisy data means that there are more 1-0 and 0-1 alignments in thedata For example, there are 89% 1-1 alignments in English-French corpus (Gale andChurch, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus Whereas in UN

Trang 16

Chinese English corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1alignments are 6.4% in this corpus Although some methods work very well on clean data,their performance goes down quickly as data becomes noisy [13].

In addition, it is difficult to achieve perfect accurate alignments even if the texts areeasy and “clean” For instance, the success of an alignment program may declinedramatically when applied on a novel or philosophy text, but this program giveswonderful results when applied on a scientific text

The performance alignment also depends on languages of corpus For example, analgorithm based on cognates (words in language pairs that resemble each otherphonetically) is likely to work better for English-French than for English-Hindi becausethere are fewer cognates for English-Hindi [1]

1.3.5 Algorithms

A sentence alignment program is called “ideal” if it is fast, highly accurate, andrequires no special knowledge about the corpus or the two languages [2, 9, 15] Acommon requirement for sentence alignment approaches is the achievement of both highaccuracy and minimal consumption of computational resources [2, 9] Furthermore, amethod for sentence alignment should also work in an unsupervised fashion and belanguage pair independent in order to be applicable to parallel corpora in any languagewithout requiring a separate training set A method is unsupervised if it is an alignmentmodel directly from the data set to be aligned Meanwhile, language pair independencemeans that approaches require no specific knowledge about the languages of the paralleltexts to align

1.4 Thesis Contents

This section introduces the organization of contents in this thesis including: objectives,our contributions, and the outline

1.4.1 Objectives of the Thesis

In this thesis, I report results of my study of sentence alignment and approachesproposed for this task Especially, I focus on Moore‟s method (2002), a method which isoutstanding and has a number of advantages I also discover a new feature, wordclustering, which may apply for this task to improve the accuracy of alignment I examinethis proposal in experiments and compare results to those in the baseline method to proveadvantages of my approach

Trang 17

1.4.2 Contributions

My main contributions are as follows:

 Evaluating methods in sentence alignment and introducing an algorithmthat improves Moore‟s method

 Using new feature - word clustering, helps to improve accuracy ofalignment This contributes in complementing strategies in the sentencealignment problem

1.4.3 Outline

The rest of the thesis is organized as follows:

Chapter 2 – Related Works

In this chapter I introduce some recent research about sentence alignment In order tohave a general view of methods proposed to deal this problem, an overall presentationabout methods of sentence alignment is introduced in this chapter Methods are classifiedinto some types in which each method is given by describing its algorithm along withevaluations related to it

Chapter 3 – Our Approach

This chapter describes the method we proposed in sentence alignment to improveMoore‟s method Initially, an analysis of Moore‟s method and evaluations about it arealso mentioned in this chapter The major content of this chapter is the framework of theproposed method, an algorithm using bilingual word clustering An example is described

in this chapter to illustrate the approach clearly

Chapter 4 – Experiments

This chapter shows experiments performed in our approach Data corpora used inexperiments are presented completely Results of experiments as well as discussions aboutthem are clearly described for evaluating our approach to the baseline method

Chapter 5 –Conclusions and Future Works

In this last chapter, advantages and restrictions of my works are summarized in ageneral conclusion Besides, some research directions are mentioned to improve thecurrent model in the future

Finally, references are given to show research published that my system refers to

Trang 18

1.5 Summary

This chapter introduces my research work I have given background information aboutparallel corpora, sentence alignment, definitions of issues as well as some initial problemsrelated to sentence alignment algorithms Terms of alignment which are used in this taskhave been defined in this chapter In addition, an outline of my research work in thisthesis has also been provided A discussion of future proposed work is also presented

Trang 19

Section 2.2 provides an overview of sentence alignment approaches Section 2.3introduces and evaluates some primary approaches in length-based methods Section 2.4introduces and evaluates proposals of word-correspondence-based approaches Proposals

as well as evaluations for each of them in hybrid methods are presented in Section 2.5.Certainly, there are some other outstanding approaches about this task, which are alsointroduced in Section 2.6 Section 2.7 concludes this chapter

There are also some other techniques such as methods based on BLEU score, supportvector machine, and hidden Markov model classifiers

2.2.2 Length-based Methods

Length-based approaches are based on modeling the relationship between the lengths

of sentences that are mutual translations The length is measured by characters or words

of a sentence In these approaches, semantics of the text are not considered Statisticalmethods are used for this task instead of the content of texts In other words, these

Trang 20

methods only consider the length of sentences in order to make the decision foralignment.

These methods are based on the fact that longer sentences in one language tend to betranslated into longer sentences in the other language, and that shorter sentences tend to betranslated into shorter sentences A probabilistic score is assigned to each proposedcorrespondence of sentences, based on the scaled difference of lengths of the two sentences(in characters) and the variance of this difference There are two random variables 1 and 2which are the lengths of the two sentences under consideration It is assumed that theserandom variables are independent and identically distributed with a normal distribution [8].Given the two parallel texts (source text) and (target text), the goal of this task is tofind alignment A which is highest probability

The methods proposed in this type of sentence alignment algorithm are based solely

on the lengths of sentences, so they require almost no prior knowledge Furthermore,these methods are highly accurate despite their simplicity They can also perform in ahigh speed When aligning texts whose languages are similar or have a high lengthcorrelation such as English, French, and German, these approaches are especially usefuland work remarkably well They also perform fairly well if the input text is clean such as

in Canadian Hansards corpus [3] The Gale and Church algorithm is still widely usedtoday, for instance to align Europarl (Koehn, 2005)

Nevertheless, these methods are not robust since they only use the sentence lengthinformation They will no longer be reliable if there is too much noise in the inputbilingual texts As shown in (Chen, 1993) [5] the accuracy of sentence-length basedmethods decreases drastically when aligning texts containing small deletions or freetranslation; they can easily misalign small passages because they ignore word identities.The algorithm of Brown et al requires corpus-dependent anchor points while the method

20

Trang 21

proposed by Gale and Church depends on prior alignment of paragraphs to constrain thesearch When aligning texts where the length correlation breaks down, such as theChinese-English language pair, the performance of length-based algorithms declinesquickly.

2.2.3 Word Correspondences Methods

The second approach, one that tries to overcome the disadvantages of length-basedapproaches, is the word-based method that is based on lexical information fromtranslation lexicons, and/or through the recognition of cognates These methods take intoaccount the lexical information about texts Most algorithms match content in one textwith their correspondences in the other text, and use these matches as anchor points in thetask sentence alignment Words which are translations of each other may have similardistribution in the source language and target language texts Meanwhile, some methodsuse cognates (words in language pairs that resemble each other phonetically) rather thanthe content of word pairs to determine beads of sentences

This type of sentence alignment methods may be illustrated in some outstandingapproaches such as Kay and Röscheisen, 1993 [11], Chen, 1993 [5], Melamed, 1996 [14],and Ma, 2006 [13] Kay‟s work has not proved efficient enough to be suitable for largecorpora while Chen constructs a word-to-word translation model during alignment toassess the probability of an alignment Word correspondence was further developed inIBM Model-1 (Brown et al., 1993) for statistical machine translation Meanwhile, wordcorrespondence in another way (geometric correspondence) for sentence alignment isproposed by Melamed, 1996

These algorithms have higher accuracy in comparison with length-based methods.Because they use the lexical information from source and translation lexicons rather thanonly sentence length to determine the translation relationship between sentences in thesource text and the target text, these algorithms usually are more robust than the length-based algorithms

Nevertheless, algorithms based on a lexicon are slower than those based on lengthsentence because they require considerably more expensive computation In addition tothis, they usually depend on cognates or a bilingual lexicon The method of Chen requires

an initial bilingual lexicon; the proposal of Melamed, meanwhile, depends on findingcognates in the two languages to suggest word correspondences

2.2.4 Hybrid Methods

Sentence length and lexical information are also combined in order that differentapproaches can complement on each other and achieve more efficient algorithms

Trang 22

These approaches are proposed in Moore, 2002; Varga et al., 2005; and Braune andFraser, 2010 Both approaches have two passes in which a length-based method is usedfor a first alignment and this subsequently serves as training data for a translation model,which is then used in a complex similarity score Moore, 2002 proposes a two-phasemethod that combines sentence length (word count) in the first pass and wordcorrespondences (IBM Model-1) in the second one Varga et al (2005) also use the hybridtechnique in sentence alignment by combining sentence length with wordcorrespondences (using a dictionary-based translation model in which the dictionary can

be manually expanded) Braune and Fraser, 2010 also propose an algorithm similar toMoore except that this approach has the technique to build 1-to-many and many-to-1alignments rather than focus only on 1-to-1 alignment as Moore‟s method

The hybrid approaches achieve a relatively high performance and overcome limits ofthe first two methods along with combining their advantages The approach of Moore,

2002 obtains a high precision (fraction of retrieved documents that are in fact relevant)and computational efficiency Meanwhile, the algorithm proposed by Varga et al., 2005which has the same idea as Moore, 2002 gains a very high recall rate (fraction of relevantdocuments that are retrieved by the algorithm)

Nonetheless, there are still weaknesses which should be handled in order to obtain amore efficient sentence alignment algorithm In Moore‟s method, the recall rate is ratherlow, and this fact is especially problematic when aligning parallel corpora with muchnoise or sparse data The approach of Varga et al., 2005, meanwhile, gets a very highrecall value; however, it still has a quite low precision rate

2.3 Some Important Problems

2.3.2 Linguistic Distances

Another parameter which can also affect the performance of sentence alignmentalgorithms is the linguistic distance between source language and target language.Linguistic distance means the extent to which languages differ from each other For

Trang 23

example, English is linguistically “closer” to Western European languages (such asFrench and German) than it is to East Asian languages (such as Korean and Japanese).There are some measures to assess the linguistic distance such as the number of cognatewords, syntactic features It is important to recognize that some algorithms may notperform so well if they rely on the closeness between languages while these languages aredistant An obvious example for this is that a method is likely to work better for English-French or English-German than for English-Hindi if is based on cognates because offewer cognates in English-Hindi Hindi belongs to the Indo-Aryan branch whereasEnglish and German belongs to the Indo-Germanic one.

2.3.3 Searching

Dynamic programming is the technique that most sentence alignment tools use insearching the best path of sentence pairs through a parallel text This also means that thetexts are ordered monotonically and none of these algorithms is able to extract sentencepairs in crossing positions Nevertheless, most of these programs have advantages inusing this technique in searching and none of them reports weaknesses about it It isbecause the characteristic of translations is that almost all sentences have same order inboth source and target texts

In this aspect, algorithms may be confronted with problems of the search space Thus,pruning strategies to restrict the search space is also an issue that algorithms have toresolve

2.3.4 Resources

All systems learn their respective models from the parallel text itself There are onlysome algorithms which support the use of external resources such as Hunalign (Varga etal., 2005) with bilingual dictionaries and Bleualign (Sennrich and Volk, 2010) withexisting MT systems

Trang 24

To perform searching for the best alignment, Brown et al use dynamic programming.This technique requires time quadratic in the length of the text aligned, so it is notpractical to align a large corpus as a single unit The computation of searching may bereduced dramatically if the bilingual corpus is subdivided into smaller chunks Thissubdivision is performed by using anchors in this algorithm An anchor is a piece of textlikely to be present at the same location in both of the parallel corpora of a bilingualcorpus Dynamic programming first is used to align anchors, and then this technique isapplied again to align the text between anchors.

The alignment computation of this algorithm is fast since it makes no use of thelexical details of the sentence Therefore, it is practical to apply this method to very largecollections of text, especially to high correlation languages pairs

2.4.2 Vanilla: Gale and Church, 1993

This algorithm performs sentence alignment based on a statistical model of sentencelengths measured by characters It uses the fact that longer sentences in one language tend

to be translated into longer sentences in another language

This algorithm is similar to the proposal of Brown et al except that the former isbased on the number of words whereas the latter is based on the number of characters insentences In addition, the algorithm of Brown et al aligns a subset of the corpus forfurther research instead of focusing on entire articles The work of Gale and Church(1991) supports this promise of wider applicability

This sentence alignment program has two steps First paragraphs are aligned, and thensentences within a paragraph are aligned This algorithm reports that paragraph lengthsare highly correlated Figure 2.1 illustrates this correlation of the languages pair: Englishand German

Trang 25

Figure 2.1 Paragraph length (Gale and Church, 1993).

A probabilistic score is assigned to each proposed correspondence of sentences, based

on the scaled difference of lengths of the two sentences and the variance of thisdifference This score is used in a dynamic programming framework to find the maximumlikelihood alignment of sentences The use of dynamic programming allows the system toconsider all possible alignments and find the minimum cost alignment effectively

A distance function is defined in a general way to allow for insertions, deletions, substitution, etc The function takes four arguments: 1 , 1 , 2 , 2

 Let ( 1 , 1 ; 0, 0) be the cost of substitution 1 with 1 ,

 ( 1 , 0; 0, 0) be the cost of deleting 1 ,

 (0, 1 ; 0, 0) be the cost of insertion of 1 ,

 ( 1 , 1 ; 2 , 0) be the cost of contracting 1 and 2 to 1 ,

 ( 1 , 1 ; 0, 2 ) be the cost of expanding 1 to 1 and 2 , and

 ( 1 , 1 ; 2 , 2 ) be the cost of merging 1 and 2 and matching with 1 and 2

The Dynamic Programming Algorithm is summarized in the following recursion equation

25

Trang 26

Figure 2.2 Equation in dynamic programming (Gale and Church, 1993)

Let , = 1 , be the sentences of one language, and , = 1 , be the translations of those sentences in the other language.

Let d be the distance function, and let ( , ) be the minimum distance between

sentences 1, …and their translations 1 , … , ,

alignment ( , ) is computed by minimizing over six

insertion, contraction, expansion, and merger) These, in

constraints ( , ) is defined with the initial condition

This algorithm has some main characteristics as follows:

 Firstly, this is a simple algorithm The number of characters in thesentences is counted simply, and the dynamic programming model is used tofind the correct pairs of alignment Many later researchers have integrated thismethod to their methods because of this simplicity

 Secondly, this algorithm can be used between any pairs of languages because it does not use any lexical information

 Thirdly, it has a low time cost, one of the most important criteria to apply this method to a very large bilingual corpus

 Finally, this is also quite an accurate algorithm especially when aligning

on data of language pairs with high correlation like French or German

English-As the report of Gale and Church, in comparison with the length-based method based

on word count like the proposal of Brown et al., it is better to use characters rather thanwords in counting sentence length The performance of this approach is better since there

is less variability in the differences of sentence lengths so measured, which using words

as units increases the error rate by half This method performs well at least on relatedlanguages The accuracy of this method also depends on the type of alignment It gets bestresults on 1-to-1 alignments, but it has a high error on more difficult alignments Thisalgorithm is still widely used today to align some corpora like the Europarl corpus(Koehn, 2005) and the JRC-Acquis (Stein-berger et al., 2006)

26

Trang 27

2.4.3 Wu, 1994

Wu applies Gale and Church‟s method to the language pair English-Chinese In order

to improve the accuracy, he utilizes lexical information from translation lexicons, and/orthrough the identification of cognates Lexical cues used in this method are in the form of

a small corpus-specific lexicon

This method is important in two respects:

 Firstly, this method has indicated that length-based methods givesatisfactory results even between unrelated languages (languages fromunrelated families such as English and Chinese), a surprising result

 Secondly, lexical cues used in this method increase accuracy ofalignment This proves the effect of lexical information to the accuracy whenadding to a length-based method

2.5 Word-based Proposals

2.5.1 Kay and Roscheisen, 1993

This algorithm is based on word correspondences, which is an iterative relaxationapproach The iterations are started by the assumption that the first and last sentences ofthe texts align, and these are the initial anchors before these below steps continue untilmost sentences are aligned:

 Step 1: Form an envelope of possible alignments

 Step 2: Choose word pairs that tend to co-occur in these potential partial alignments

 Step 3: Find pairs of source and target sentences which contain many possible lexical correspondences

A set of partial alignments which will be part of the final result is induced by using themost reliable of these pairs

However, a weakness of this method is that it is not efficient enough to apply to largecorpora

Trang 28

Dynamic programming with threshold is the search strategy of this algorithm Thesearch is linear in the length of the corpus because of the use of threshold As a result, thecorpus needs not be subdivided into smaller chunks It also deals well with deletions Thisalso makes the search strategy robust despite large deletions whereas identifying thebeginning and the end of deletions is performed confidently thanks to using lexicalinformation With an intelligent threshold, great benefits may be obtained due to the factthat most alignments are one-to-one The computation of this algorithm is reduced to alinear one because of considering only a subset of all possible alignments.

This method gives better accuracy than the length-based one, but is “tens of timesslower than the Brown [3] and Gale [8] algorithms” [5] Furthermore, it is also languageindependent It can handle large deletions in text

This algorithm requires that for each language pair 100 sentences need to be aligned

by hand to bootstrap the translation model; therefore, it depends on a minimum of humanintervention This method takes a great computational cost because of using the lexicalinformation However, alignment is a one-time cost, and it may be very useful after it iscontributed whereas the computing power is also available These causes indicate that thecomputational cost sometimes may be acceptable

2.5.3 Melamed, 1996

This method is based on word correspondences [14] A bitext map of words is used tomark the points of correspondences between these words in a two-dimensional graph.After marking all possible points, the true correspondence points in graph are found bysome rules, sentences location information and boundary of sentences

Melamed uses a term of bitext which comprises two versions of a text like a text intwo different languages A bitext is created in each time when translators translate a text.Figure 2.3 illustrates a rectangular bitext space that each bitext defines The lengths of thetwo components texts (in characters) are presented by the width and the height of therectangle

Trang 29

Figure 2.3 A bitext space in Melamed’s method (Melamed, 1996).

This algorithm has slightly better accuracy than Gale and Church‟s one This methodcan give almost perfect results in alignment if a good bitext map can be formed This isalso the power of this method If it is used in popular languages, it may be the best andmay be possible to acquire a good bitext map

However, this method requires a good bitext map to have satisfactory accuracy

2.5.4 Champollion: Ma, 2006

This algorithm, which is designed for robust alignment of potential noisy parallel text,

is a lexicon-based sentence aligner [13] It was first developed for aligning English parallel text before ported to other language pairs such as Hindi-English orArabic-English In this method, only if lexical matches are present, a match is considered

Chinese-as possible In considering a stronger indication that two segments are a match, higherweights are assigned to less frequent words In order to weed out bogus matches, thisalgorithm also uses sentence length information

This method focuses on dealing with noisy data It overcomes existing methods whichwork very well on clean data but decline quickly in their performance when data becomenoisy

There are also some different characteristics between this method and other sentencealigners:

Trang 30

 Noisy data is resources that contain a larger percentage of alignments in whichthey will not be 1-to-1 ones They have a significant amount of the number ofdeletions and insertions This method assumes with such an input data It isunreliable when using sentence length information to deal with noisy data Thisinformation is only used as a minor role when the method is based on lexicalevidence.

 Translated words are treated equally in most sentence alignment algorithms Inother words, when the method decides sentence correspondences, an equal weight

is assigned to translated words pairs This method assigns weights to translatedwords, which makes it different from other lexicon-based algorithms

 There are two steps to apply translation lexicons in sentence aligners: entriesfrom a translation lexicon are used to identify translated words before sentencecorrespondences are identified by using statistics of translated words

In this method, assigning greater weights to less frequent translated words helps toincrease the robustness of the alignment especially when dealing with noisy data It gainshigh precision and recall rates, and it may be easy to use for new language pairs.However, it requires an externally supplied bilingual lexicon

2.6 Hybrid Proposals

2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002

This algorithm combines length-based and word-based approaches to achieve highaccuracy at a modest computational cost [15] A problem in using lexical information isthat it limits the use of algorithm only between a pair of languages Moore resolves thisproblem by using a method similar to IBM translation model to extract a bilingual corpuswith the texts at hand This method has two passes Sentence-length based statistics arefirst used for extracting the training data for the IBM Model-1 translation tables beforethis model based on sentence-length are combined with the acquired lexical statistics inorder to extract 1-to-1 correspondences with high accuracy A forward-backwardcomputation is used as the search heuristic in which the forward pass is a pruned dynamicprogramming procedure

This is a highly accurate and language-independent algorithm, so it is a very promisingmethod It constantly achieves the high precision Furthermore, it is also fast incomparison with methods which use solely lexical information Requiring no knowledge

of the languages or the corpus is also another advantage of this method While lexicalmethods are generally more than sentence-length ones, most of them require additionallinguistic resources or knowledge Moore has tried to overcome this issue by using IBM

Trang 31

Model-1 with sentence pairs which are extracted from the initial model, which does notuse any extra information However, using Model 1 also results in a rather slowcomputation; therefore, this method considers only alignment close to the initially foundalignment, which sometimes affects the range of search space This algorithm focusesonly on one-to-one alignments and excludes one-to-many and many-to-many alignments.

If working in two languages which have different sentence structuring conventions, thismay lead to losing amounts of aligned material In this case, it may be difficult tomaximize alignment recall

2.6.2 Hunalign: Varga et al., 2005

This method is based on both sentence length and lexical similarity Generally, it issimilar to Moore‟s approach; however, it uses a crude word-by-word dictionary-basedreplacement rather than IBM Model 1 as the proposal of Moore The dictionary in thismethod can be manually expanded [19]

This algorithm may be described as shown in Figure 2.4 In the first step, a crudetranslation (T‟) of the source text (S) is produced by converting each word token into thedictionary translation that has the highest frequency in the target corpus, or to itself incase of lookup failure This is considered as a pseudo target language text A comparisonbetween this text (T‟) and the actual target one (T) on a sentence is performed bysentence basis

AlignedSentences

Dictionary

Figure 2.4 The method of Varga et al., 2005

Định dạng
Số trang	63
Dung lượng	356,71 KB