Bilingual sentence alignment based on sentence length and word translation

Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine Translation, word disambigua

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

MASTER THESIS OF INFORMATION TECHNOLOGY

SUPERVISOR: PhD Phuong-Thai Nguyen

Trang 3

ORIGINALITY STATEMENT

„I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET) or any other educational institution, except where due acknowledgement is made in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project‟s design and conception or in style, presentation and linguistic expression is acknowledged.‟

Signed

Trang 4

Acknowledgements

I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his supervision but also for his enthusiastic encouragement, right suggestion and knowledge which I have been giving during studying in Master‟s course I would also like to show

my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of Information Technology - Vietnam Academy of Science and Technology - who provided valuable data in my evaluating process I would like to thank PhD Van-Vinh Nguyen for examining and giving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh Nguyen for supporting and checking some issues in my research

In addition, I would like to express my thanks to lectures, professors in Faculty of Information Technology, University of Engineering and Technology (UET), Vietnam University, Hanoi who teach me and helping me whole time I study in UET

Finally, I would like to thank my family and friends for their support, share, and confidence throughout my study

Trang 5

Abstract

Sentence alignment plays an important role in machine translation It is an essential task in processing parallel corpora which are ample and substantial resources for natural language processing In order to apply these abundant materials into useful applications, parallel corpora first have to be aligned at the sentence level

This process maps sentences in texts of source language to their corresponding units in texts of target language Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval This task also helps to extract structural information and derive statistical parameters from bilingual corpora

There have been a number of algorithms proposed with different approaches for sentence alignment However, they may be classified into some major categories First of all, there are methods based on the similarity of sentence lengths which can be measured

by words or characters of sentences These methods are simple but effective to apply for language pairs that have a high similarity in sentence lengths The second set of methods

is based on word correspondences or lexicon These methods take into account the lexical information about texts, which is based on matching content in texts or uses cognates An external dictionary may be used in these methods, so these methods are more accurate but slower than the first ones There are also methods based on the hybrids of these first two approaches that combine their advantages, so they obtain quite high quality of alignments

In this thesis, I summarize general issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on the hybrid method, especially the proposal

of Moore (2002), an effective method with high performance in term of precision From analyzing the limits of this method, I propose an algorithm using a new feature, bilingual word clustering, to improve the quality of Moore‟s method The baseline method (Moore, 2002) will be introduced based on analyzing of the framework, and I describe advantages

as well as weaknesses of this approach In addition to this, I describe the basis knowledge, algorithm of bilingual word clustering, and the new feature used in sentence alignment Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method

Keywords: sentence alignment, parallel corpora, natural language processing, word

clustering

Trang 6

Table of Contents

ORIGINALITY STATEMENT 3

Acknowledgements 4

Abstract 5

Table of Contents 6

List of Figures 9

List of Tables 10

CHAPTER ONE Introduction 11

1.1 Background 11

1.2 Parallel Corpora 12

1.2.1 Definitions 12

1.2.2 Applications 12

1.2.3 Aligned Parallel Corpora 12

1.3 Sentence Alignment 12

1.3.1 Definition 12

1.3.2 Types of Alignments 12

1.3.3 Applications 15

1.3.4 Challenges 15

1.3.5 Algorithms 16

1.4 Thesis Contents 16

1.4.1 Objectives of the Thesis 16

1.4.2 Contributions 17

1.4.3 Outline 17

1.5 Summary 18

CHAPTER TWO Related Works 19

2.1 Overview 19

Trang 7

2.2.1 Classification 19

2.2.2 Length-based Methods 19

2.2.3 Word Correspondences Methods 21

2.2.4 Hybrid Methods 21

2.3 Some Important Problems 22

2.3.1 Noise of Texts 22

2.3.2 Linguistic Distances 22

2.3.3 Searching 23

2.3.4 Resources 23

2.4 Length-based Proposals 23

2.4.1 Brown et al., 1991 23

2.4.2 Vanilla: Gale and Church, 1993 24

2.4.3 Wu, 1994 27

2.5 Word-based Proposals 27

2.5.1 Kay and Roscheisen, 1993 27

2.5.2 Chen, 1993 27

2.5.3 Melamed, 1996 28

2.5.4 Champollion: Ma, 2006 29

2.6 Hybrid Proposals 30

2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002 30

2.6.2 Hunalign: Varga et al., 2005 31

2.6.3 Deng et al., 2007 32

2.6.4 Gargantua: Braune and Fraser, 2010 33

2.6.5 Fast-Champollion: Li et al., 2010 34

2.7 Other Proposals 35

2.7.1 Bleu-align: Sennrich and Volk, 2010 35

2.7.2 MSVM and HMM: Fattah, 2012 36

2.8 Summary 37

CHAPTER THREE Our Approach 39

3.1 Overview 39

Trang 8

3.2 Moore‟s Approach 39

3.2.1 Description 39

3.2.2 The Algorithm 40

3.3 Evaluation of Moore‟s Approach 42

3.4 Our Approach 42

3.4.1 Framework 42

3.4.2 Word Clustering 43

3.4.3 Proposed Algorithm 45

3.4.4 An Example 49

3.5 Summary 50

CHAPTER FOUR Experiments 51

4.1 Overview 51

4.2 Data 51

4.2.1 Bilingual Corpora 51

4.2.2 Word Clustering Data 53

4.3 Metrics 54

4.4 Discussion of Results 54

4.5 Summary 57

CHAPTER FIVE Conclusion and Future Work 58

5.1 Overview 58

5.2 Summary 58

5.3 Contributions 58

5.4 Future Work 59

5.4.1 Better Word Translation Models 59

5.4.2 Word-Phrase 59

Bibliography 60

Trang 9

List of Figures

Figure 1.1 A sequence of beads (Brown et al., 1991) 13

Figure 2.1 Paragraph length (Gale and Church, 1993) 25

Figure 2.2 Equation in dynamic programming (Gale and Church, 1993) 26

Figure 2.3 A bitext space in Melamed‟s method (Melamed, 1996) 29

Figure 2.4 The method of Varga et al., 2005 31

Figure 2.5 The method of Braune and Fraser, 2010 33

Figure 2.6 Sentence Alignment Approaches Review 38

Figure 3.1 Framework of sentence alignment in our algorithm 43

Figure 3.2 An example of Brown's cluster algorithm 44

Figure 3.3 English word clustering data 44

Figure 3.4 Vietnamese word clustering data 44

Figure 3.5 Bilingual dictionary 46

Figure 3.6 Looking up the probability of a word pair 47

Figure 3.7 Looking up in a word cluster 48

Figure 3.8 Handling in the case: one word is contained in dictionary 48

Figure 4.1 Comparison in Precision 55

Figure 4.2 Comparison in Recall 56

Figure 4.3 Comparison in F-measure 57

Trang 10

List of Tables

Table 1.1 Frequency of alignments (Gale and Church, 1993) 14

Table 1.2 Frequency of beads (Ma, 2006) 14

Table 1.3 Frequency of beads (Moore, 2002) 14

Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993) 15

Table 2.1 Alignment pairs (Sennrich and Volk, 2010) 36

Table 4.1 Training data-1 51

Table 4.2 Topics in Training data-1 52

Table 4.3 Training data-2 52

Table 4.4 Topics in Training data-2 52

Table 4.5 Input data for training clusters 53

Table 4.6 Topics for Vietnamese input data to train clusters 53

Table 4.7 Word clustering data sets 54

Trang 11

13, 15-16]

Parallel texts, however, are useful only when they have to be sentence-aligned The parallel corpus first is collected from various resources, which has a very large size of the translated segments forming it This size is usually of the order of entire documents and causes an ambiguous task in learning word correspondences The solution to reduce the ambiguity is first decreasing the size of the segments within each pair, which is known as sentence alignment task [7, 12-13, 16]

Sentence alignment is a process that maps sentences in the text of the source language

to their corresponding units in the text of the target language [3, 8, 12, 14, 20] This task

is the work of constructing a detailed map of the correspondence between a text and its translation (a bitext map) [14] This is the first stage for Statistical Machine Translation With aligned sentences, we can perform further analyses such as phrase and word alignment analysis, bilingual terminology, and collocation extraction analysis as well as other applications [3, 7-9, 17] Efficient and powerful sentence alignment algorithms, therefore, become increasingly important

A number of sentence alignment algorithms have been proposed [1, 7, 9, 12, 15, 17, 20] Some of these algorithms are based on sentence length [3, 8, 20]; some use word correspondences [5, 11, 13-14]; some are hybrid of these two approaches [2, 6, 15, 19] Additionally, there are also some other outstanding methods for this task [7, 17] For details of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6

I propose an improvement to an effective hybrid algorithm [15] that is used in sentence alignment For details of our approach, see Section 3.4 I also create experiments

Trang 12

to illustrate my research For details of the corpora used in our experiments, see Section 4.2 For results and discussions of experiments, see Sections 4.4, 4.5

In the rest of this chapter, I describe some issues related to the sentence alignment task In addition to this, I introduce objectives of the thesis and our contributions Finally,

I describe the structure of this thesis

1.2 Parallel Corpora

1.2.1 Definitions

Parallel corpora are a collection of documents which are translations of each other [16] Aligned parallel corpora are collections of pairs of sentences where one sentence is a translation of the other [1]

1.2.2 Applications

Bilingual corpora are an essential resource in multilingual natural language processing systems This resource helps to develop data-driven natural language processing approaches This also contributes to applying machine learning to machine translation [15-16]

1.2.3 Aligned Parallel Corpora

Once the parallel text is sentence aligned, it provides the maximum utility [13] Therefore, this makes the task of aligning parallel corpora of considerable interest, and a number of approaches have been proposed and developed to resolve this issue

1.3 Sentence Alignment

1.3.1 Definition

Sentence alignment is the task of extracting pairs of sentences that are translation of one another from parallel corpora Given a pair of texts, this process maps sentences in the text of the source language to their corresponding units in the text of the target language [3, 8, 13]

Trang 13

Figure 1.1 A sequence of beads (Brown et al., 1991)

Groups of sentence lengths are circled to show the correct alignment Each of the groupings is called a bead, and there is a number to show sentence length of a sentence in the bead In figure 1.1, “17e” means the sentence length (17 words) of an English sentence, and “19f” means the sentence length (19 words) of a French sentence There is a sequence of beads as follows:

 An 𝑒𝑓-bead (one English sentence aligned with one French sentence) followed by

 An 𝑒𝑓𝑓-bead (one English sentence aligned with two French sentences) followed

by

 An 𝑒-bead (one English sentence) followed by

 A ¶𝑒¶𝑓 bead (one English paragraph and one French paragraph)

An alignment, then, is simply a sequence of beads that accounts for the observed sequences of sentence lengths and paragraph markers [3]

There are quite a number of beads, but it is possible to only consider some of them including 1-to-1 (one sentence of source language aligned with one sentence of target language), 1-to-2 (one sentence of source language aligned with two sentences of target language), etc; Brown et al., 1991 [3] mentioned to beads 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-to-1, and a bead of paragraphs ( ¶𝑒, ¶𝑓, ¶𝑒𝑓 ) because of considering alignments by paragraphs of this method Moore, 2002 [15] only considers five of these beads: 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-to-1 in which each of them is called as follows:

 1-to-1 bead (a match)

 1-to-0 bead (a deletion)

 0-to-1 bead (an insertion)

 1-to-2 bead (an expansion)

 2-to-1 bead (a contraction)

Trang 14

The common information related to this is the frequency of beads Table 1.1 shows frequencies of types of beads proposed by Gale and Church, 1993 [8]

Table 1.1 Frequency of alignments (Gale and Church, 1993)

Category Frequency Prob(match)

Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as Table 1.2:

Table 1.2 Frequency of beads (Ma, 2006)

Category Frequency Percentage

Table 1.3 also describes these frequencies of types of beads in Moore, 2002 [15]:

Table 1.3 Frequency of beads (Moore, 2002)

Trang 15

Generally, the frequency of bead 1-to-1 in almost all corpora is largest in all types of beads, with frequency around 90% whereas other types are only about few percentages

1.3.3 Applications

Sentence alignment is an important topic in Machine Translation This is an important first step for Statistical Machine Translation It is also the first stage to extract structural and semantic information and to derive statistical parameters from bilingual corpora [17, 20] Moreover, this is the first step to construct probabilistic dictionary (Table 1.4) for use

in aligning words in machine translation, or to construct a bilingual concordance for use

in lexicography

Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993)

The sentence alignment task is non-trivial because sentences do not always align

1-to-1 At times a single sentence in one language might be translated as two or more sentences in the other language The input text also affects the accuracies The performance of sentence alignment algorithms decreases significantly when input data becomes very noisy Noisy data means that there are more 1-0 and 0-1 alignments in the data For example, there are 89% 1-1 alignments in English-French corpus (Gale and Church, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus Whereas in UN

Trang 16

Chinese English corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1 alignments are 6.4% in this corpus Although some methods work very well on clean data, their performance goes down quickly as data becomes noisy [13]

In addition, it is difficult to achieve perfect accurate alignments even if the texts are easy and “clean” For instance, the success of an alignment program may decline dramatically when applied on a novel or philosophy text, but this program gives wonderful results when applied on a scientific text

The performance alignment also depends on languages of corpus For example, an algorithm based on cognates (words in language pairs that resemble each other phonetically) is likely to work better for English-French than for English-Hindi because there are fewer cognates for English-Hindi [1]

1.3.5 Algorithms

A sentence alignment program is called “ideal” if it is fast, highly accurate, and requires no special knowledge about the corpus or the two languages [2, 9, 15] A common requirement for sentence alignment approaches is the achievement of both high accuracy and minimal consumption of computational resources [2, 9] Furthermore, a method for sentence alignment should also work in an unsupervised fashion and be language pair independent in order to be applicable to parallel corpora in any language without requiring a separate training set A method is unsupervised if it is an alignment model directly from the data set to be aligned Meanwhile, language pair independence means that approaches require no specific knowledge about the languages of the parallel texts to align

1.4 Thesis Contents

This section introduces the organization of contents in this thesis including: objectives, our contributions, and the outline

1.4.1 Objectives of the Thesis

In this thesis, I report results of my study of sentence alignment and approaches proposed for this task Especially, I focus on Moore‟s method (2002), a method which is outstanding and has a number of advantages I also discover a new feature, word clustering, which may apply for this task to improve the accuracy of alignment I examine this proposal in experiments and compare results to those in the baseline method to prove advantages of my approach

Trang 17

1.4.2 Contributions

My main contributions are as follows:

 Evaluating methods in sentence alignment and introducing an algorithm that improves Moore‟s method

 Using new feature - word clustering, helps to improve accuracy of alignment This contributes in complementing strategies in the sentence alignment problem

1.4.3 Outline

The rest of the thesis is organized as follows:

Chapter 2 – Related Works

In this chapter I introduce some recent research about sentence alignment In order to have a general view of methods proposed to deal this problem, an overall presentation about methods of sentence alignment is introduced in this chapter Methods are classified into some types in which each method is given by describing its algorithm along with evaluations related to it

Chapter 3 – Our Approach

This chapter describes the method we proposed in sentence alignment to improve Moore‟s method Initially, an analysis of Moore‟s method and evaluations about it are also mentioned in this chapter The major content of this chapter is the framework of the proposed method, an algorithm using bilingual word clustering An example is described

in this chapter to illustrate the approach clearly

Chapter 4 – Experiments

This chapter shows experiments performed in our approach Data corpora used in experiments are presented completely Results of experiments as well as discussions about them are clearly described for evaluating our approach to the baseline method

Chapter 5 –Conclusions and Future Works

In this last chapter, advantages and restrictions of my works are summarized in a general conclusion Besides, some research directions are mentioned to improve the current model in the future

Finally, references are given to show research published that my system refers to

Trang 18

1.5 Summary

This chapter introduces my research work I have given background information about parallel corpora, sentence alignment, definitions of issues as well as some initial problems related to sentence alignment algorithms Terms of alignment which are used in this task have been defined in this chapter In addition, an outline of my research work in this thesis has also been provided A discussion of future proposed work is also presented

Trang 19

Section 2.2 provides an overview of sentence alignment approaches Section 2.3 introduces and evaluates some primary approaches in length-based methods Section 2.4 introduces and evaluates proposals of word-correspondence-based approaches Proposals

as well as evaluations for each of them in hybrid methods are presented in Section 2.5 Certainly, there are some other outstanding approaches about this task, which are also introduced in Section 2.6 Section 2.7 concludes this chapter

There are also some other techniques such as methods based on BLEU score, support vector machine, and hidden Markov model classifiers

2.2.2 Length-based Methods

Length-based approaches are based on modeling the relationship between the lengths

of sentences that are mutual translations The length is measured by characters or words

of a sentence In these approaches, semantics of the text are not considered Statistical methods are used for this task instead of the content of texts In other words, these

Trang 20

methods only consider the length of sentences in order to make the decision for alignment

These methods are based on the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to

be translated into shorter sentences A probabilistic score is assigned to each proposed correspondence of sentences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference There are two random variables 𝑙1 and 𝑙2 which are the lengths of the two sentences under consideration It is assumed that these random variables are independent and identically distributed with a normal distribution [8]

Given the two parallel texts 𝑆𝑇 (source text) and 𝑇𝑇 (target text), the goal of this task

is to find alignment A which is highest probability

The methods proposed in this type of sentence alignment algorithm are based solely

on the lengths of sentences, so they require almost no prior knowledge Furthermore, these methods are highly accurate despite their simplicity They can also perform in a high speed When aligning texts whose languages are similar or have a high length correlation such as English, French, and German, these approaches are especially useful and work remarkably well They also perform fairly well if the input text is clean such as

in Canadian Hansards corpus [3] The Gale and Church algorithm is still widely used today, for instance to align Europarl (Koehn, 2005)

Nevertheless, these methods are not robust since they only use the sentence length information They will no longer be reliable if there is too much noise in the input bilingual texts As shown in (Chen, 1993) [5] the accuracy of sentence-length based methods decreases drastically when aligning texts containing small deletions or free

Trang 21

proposed by Gale and Church depends on prior alignment of paragraphs to constrain the search When aligning texts where the length correlation breaks down, such as the Chinese-English language pair, the performance of length-based algorithms declines quickly

2.2.3 Word Correspondences Methods

The second approach, one that tries to overcome the disadvantages of length-based approaches, is the word-based method that is based on lexical information from translation lexicons, and/or through the recognition of cognates These methods take into account the lexical information about texts Most algorithms match content in one text with their correspondences in the other text, and use these matches as anchor points in the task sentence alignment Words which are translations of each other may have similar distribution in the source language and target language texts Meanwhile, some methods use cognates (words in language pairs that resemble each other phonetically) rather than the content of word pairs to determine beads of sentences

This type of sentence alignment methods may be illustrated in some outstanding approaches such as Kay and Röscheisen, 1993 [11], Chen, 1993 [5], Melamed, 1996 [14], and Ma, 2006 [13] Kay‟s work has not proved efficient enough to be suitable for large corpora while Chen constructs a word-to-word translation model during alignment to assess the probability of an alignment Word correspondence was further developed in IBM Model-1 (Brown et al., 1993) for statistical machine translation Meanwhile, word correspondence in another way (geometric correspondence) for sentence alignment is proposed by Melamed, 1996

These algorithms have higher accuracy in comparison with length-based methods Because they use the lexical information from source and translation lexicons rather than only sentence length to determine the translation relationship between sentences in the source text and the target text, these algorithms usually are more robust than the length-based algorithms

Nevertheless, algorithms based on a lexicon are slower than those based on length sentence because they require considerably more expensive computation In addition to this, they usually depend on cognates or a bilingual lexicon The method of Chen requires

an initial bilingual lexicon; the proposal of Melamed, meanwhile, depends on finding cognates in the two languages to suggest word correspondences

2.2.4 Hybrid Methods

Sentence length and lexical information are also combined in order that different approaches can complement on each other and achieve more efficient algorithms

Trang 22

These approaches are proposed in Moore, 2002; Varga et al., 2005; and Braune and Fraser, 2010 Both approaches have two passes in which a length-based method is used for a first alignment and this subsequently serves as training data for a translation model, which is then used in a complex similarity score Moore, 2002 proposes a two-phase method that combines sentence length (word count) in the first pass and word correspondences (IBM Model-1) in the second one Varga et al (2005) also use the hybrid technique in sentence alignment by combining sentence length with word correspondences (using a dictionary-based translation model in which the dictionary can

be manually expanded) Braune and Fraser, 2010 also propose an algorithm similar to Moore except that this approach has the technique to build 1-to-many and many-to-1 alignments rather than focus only on 1-to-1 alignment as Moore‟s method

The hybrid approaches achieve a relatively high performance and overcome limits of the first two methods along with combining their advantages The approach of Moore,

2002 obtains a high precision (fraction of retrieved documents that are in fact relevant) and computational efficiency Meanwhile, the algorithm proposed by Varga et al., 2005 which has the same idea as Moore, 2002 gains a very high recall rate (fraction of relevant documents that are retrieved by the algorithm)

Nonetheless, there are still weaknesses which should be handled in order to obtain a more efficient sentence alignment algorithm In Moore‟s method, the recall rate is rather low, and this fact is especially problematic when aligning parallel corpora with much noise or sparse data The approach of Varga et al., 2005, meanwhile, gets a very high recall value; however, it still has a quite low precision rate

2.3 Some Important Problems

2.3.2 Linguistic Distances

Another parameter which can also affect the performance of sentence alignment

Trang 23

example, English is linguistically “closer” to Western European languages (such as French and German) than it is to East Asian languages (such as Korean and Japanese) There are some measures to assess the linguistic distance such as the number of cognate words, syntactic features It is important to recognize that some algorithms may not perform so well if they rely on the closeness between languages while these languages are distant An obvious example for this is that a method is likely to work better for English-French or English-German than for English-Hindi if is based on cognates because of fewer cognates in English-Hindi Hindi belongs to the Indo-Aryan branch whereas English and German belongs to the Indo-Germanic one

2.3.3 Searching

Dynamic programming is the technique that most sentence alignment tools use in searching the best path of sentence pairs through a parallel text This also means that the texts are ordered monotonically and none of these algorithms is able to extract sentence pairs in crossing positions Nevertheless, most of these programs have advantages in using this technique in searching and none of them reports weaknesses about it It is because the characteristic of translations is that almost all sentences have same order in both source and target texts

In this aspect, algorithms may be confronted with problems of the search space Thus, pruning strategies to restrict the search space is also an issue that algorithms have to resolve

2.3.4 Resources

All systems learn their respective models from the parallel text itself There are only some algorithms which support the use of external resources such as Hunalign (Varga et al., 2005) with bilingual dictionaries and Bleualign (Sennrich and Volk, 2010) with existing MT systems

Trang 24

To perform searching for the best alignment, Brown et al use dynamic programming This technique requires time quadratic in the length of the text aligned, so it is not practical to align a large corpus as a single unit The computation of searching may be reduced dramatically if the bilingual corpus is subdivided into smaller chunks This subdivision is performed by using anchors in this algorithm An anchor is a piece of text likely to be present at the same location in both of the parallel corpora of a bilingual corpus Dynamic programming first is used to align anchors, and then this technique is applied again to align the text between anchors

The alignment computation of this algorithm is fast since it makes no use of the lexical details of the sentence Therefore, it is practical to apply this method to very large collections of text, especially to high correlation languages pairs

2.4.2 Vanilla: Gale and Church, 1993

This algorithm performs sentence alignment based on a statistical model of sentence lengths measured by characters It uses the fact that longer sentences in one language tend

to be translated into longer sentences in another language

This algorithm is similar to the proposal of Brown et al except that the former is based on the number of words whereas the latter is based on the number of characters in sentences In addition, the algorithm of Brown et al aligns a subset of the corpus for further research instead of focusing on entire articles The work of Gale and Church (1991) supports this promise of wider applicability

This sentence alignment program has two steps First paragraphs are aligned, and then sentences within a paragraph are aligned This algorithm reports that paragraph lengths are highly correlated Figure 2.1 illustrates this correlation of the languages pair: English and German

Trang 25

Figure 2.1 Paragraph length (Gale and Church, 1993)

A probabilistic score is assigned to each proposed correspondence of sentences, based

on the scaled difference of lengths of the two sentences and the variance of this difference This score is used in a dynamic programming framework to find the maximum likelihood alignment of sentences The use of dynamic programming allows the system to consider all possible alignments and find the minimum cost alignment effectively

A distance function 𝑑 is defined in a general way to allow for insertions, deletions, substitution, etc The function takes four arguments: 𝑥1, 𝑦1, 𝑥2, 𝑦2

 Let 𝑑(𝑥1, 𝑦1; 0, 0) be the cost of substitution 𝑥1 with 𝑦1,

 𝑑(𝑥1, 0; 0, 0) be the cost of deleting 𝑥1,

 𝑑(0, 𝑦1; 0, 0) be the cost of insertion of 𝑦1,

 𝑑(𝑥1, 𝑦1; 𝑥2, 0) be the cost of contracting 𝑥1 and 𝑥2 to 𝑦1,

 𝑑(𝑥1, 𝑦1; 0, 𝑦2) be the cost of expanding 𝑥1 to 𝑦1 and 𝑦2, and

 𝑑(𝑥1, 𝑦1; 𝑥2, 𝑦2) be the cost of merging 𝑥1 and 𝑥2 and matching with 𝑦1 and 𝑦2 The Dynamic Programming Algorithm is summarized in the following recursion equation

Trang 26

Figure 2.2 Equation in dynamic programming (Gale and Church, 1993)

Let 𝑠𝑖, 𝑖 = 1 𝐼, be the sentences of one language, and 𝑡𝑗, 𝑗 = 1 𝐽, be the translations

of those sentences in the other language

Let d be the distance function, and let 𝐷(𝑖, 𝑗) be the minimum distance between sentences 𝑠1, … 𝑠𝑖 and their translations 𝑡1, … , 𝑡𝑗 , under the maximum-likelihood alignment 𝐷(𝑖, 𝑗) is computed by minimizing over six cases (substitution, deletion, insertion, contraction, expansion, and merger) These, in effect, impose a set of slope constraints 𝐷(𝑖, 𝑗) is defined with the initial condition 𝐷 𝑖, 𝑗 = 0

This algorithm has some main characteristics as follows:

 Firstly, this is a simple algorithm The number of characters in the sentences is counted simply, and the dynamic programming model is used to find the correct pairs of alignment Many later researchers have integrated this method

to their methods because of this simplicity

 Secondly, this algorithm can be used between any pairs of languages because it does not use any lexical information

 Thirdly, it has a low time cost, one of the most important criteria to apply this method to a very large bilingual corpus

 Finally, this is also quite an accurate algorithm especially when aligning on data of language pairs with high correlation like English-French or English-German

As the report of Gale and Church, in comparison with the length-based method based

on word count like the proposal of Brown et al., it is better to use characters rather than words in counting sentence length The performance of this approach is better since there

is less variability in the differences of sentence lengths so measured, which using words

as units increases the error rate by half This method performs well at least on related languages The accuracy of this method also depends on the type of alignment It gets best results on 1-to-1 alignments, but it has a high error on more difficult alignments This

Trang 27

2.4.3 Wu, 1994

Wu applies Gale and Church‟s method to the language pair English-Chinese In order

to improve the accuracy, he utilizes lexical information from translation lexicons, and/or through the identification of cognates Lexical cues used in this method are in the form of

a small corpus-specific lexicon

This method is important in two respects:

 Firstly, this method has indicated that length-based methods give satisfactory results even between unrelated languages (languages from unrelated families such as English and Chinese), a surprising result

 Secondly, lexical cues used in this method increase accuracy of alignment This proves the effect of lexical information to the accuracy when adding to a length-based method

2.5 Word-based Proposals

2.5.1 Kay and Roscheisen, 1993

This algorithm is based on word correspondences, which is an iterative relaxation approach The iterations are started by the assumption that the first and last sentences of the texts align, and these are the initial anchors before these below steps continue until most sentences are aligned:

 Step 1: Form an envelope of possible alignments

 Step 2: Choose word pairs that tend to co-occur in these potential partial alignments

 Step 3: Find pairs of source and target sentences which contain many possible lexical correspondences

A set of partial alignments which will be part of the final result is induced by using the most reliable of these pairs

However, a weakness of this method is that it is not efficient enough to apply to large corpora

Trang 28

Dynamic programming with threshold is the search strategy of this algorithm The search is linear in the length of the corpus because of the use of threshold As a result, the corpus needs not be subdivided into smaller chunks It also deals well with deletions This also makes the search strategy robust despite large deletions whereas identifying the beginning and the end of deletions is performed confidently thanks to using lexical information With an intelligent threshold, great benefits may be obtained due to the fact that most alignments are one-to-one The computation of this algorithm is reduced to a linear one because of considering only a subset of all possible alignments

This method gives better accuracy than the length-based one, but is “tens of times slower than the Brown [3] and Gale [8] algorithms” [5] Furthermore, it is also language independent It can handle large deletions in text

This algorithm requires that for each language pair 100 sentences need to be aligned

by hand to bootstrap the translation model; therefore, it depends on a minimum of human intervention This method takes a great computational cost because of using the lexical information However, alignment is a one-time cost, and it may be very useful after it is contributed whereas the computing power is also available These causes indicate that the computational cost sometimes may be acceptable

2.5.3 Melamed, 1996

This method is based on word correspondences [14] A bitext map of words is used to mark the points of correspondences between these words in a two-dimensional graph After marking all possible points, the true correspondence points in graph are found by some rules, sentences location information and boundary of sentences

Melamed uses a term of bitext which comprises two versions of a text like a text in two different languages A bitext is created in each time when translators translate a text Figure 2.3 illustrates a rectangular bitext space that each bitext defines The lengths of the two components texts (in characters) are presented by the width and the height of the rectangle

Trang 29

Figure 2.3 A bitext space in Melamed’s method (Melamed, 1996)

This algorithm has slightly better accuracy than Gale and Church‟s one This method can give almost perfect results in alignment if a good bitext map can be formed This is also the power of this method If it is used in popular languages, it may be the best and may be possible to acquire a good bitext map

However, this method requires a good bitext map to have satisfactory accuracy

2.5.4 Champollion: Ma, 2006

This algorithm, which is designed for robust alignment of potential noisy parallel text,

is a lexicon-based sentence aligner [13] It was first developed for aligning English parallel text before ported to other language pairs such as Hindi-English or Arabic-English In this method, only if lexical matches are present, a match is considered

Chinese-as possible In considering a stronger indication that two segments are a match, higher weights are assigned to less frequent words In order to weed out bogus matches, this algorithm also uses sentence length information

This method focuses on dealing with noisy data It overcomes existing methods which work very well on clean data but decline quickly in their performance when data become noisy

There are also some different characteristics between this method and other sentence aligners:

Trang 30

 Noisy data is resources that contain a larger percentage of alignments in which they will not be 1-to-1 ones They have a significant amount of the number of deletions and insertions This method assumes with such an input data It is unreliable when using sentence length information to deal with noisy data This information is only used as a minor role when the method is based on lexical evidence

 Translated words are treated equally in most sentence alignment algorithms In other words, when the method decides sentence correspondences, an equal weight

is assigned to translated words pairs This method assigns weights to translated words, which makes it different from other lexicon-based algorithms

 There are two steps to apply translation lexicons in sentence aligners: entries from

a translation lexicon are used to identify translated words before sentence correspondences are identified by using statistics of translated words

In this method, assigning greater weights to less frequent translated words helps to increase the robustness of the alignment especially when dealing with noisy data It gains high precision and recall rates, and it may be easy to use for new language pairs However, it requires an externally supplied bilingual lexicon

2.6 Hybrid Proposals

2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002

This algorithm combines length-based and word-based approaches to achieve high accuracy at a modest computational cost [15] A problem in using lexical information is that it limits the use of algorithm only between a pair of languages Moore resolves this problem by using a method similar to IBM translation model to extract a bilingual corpus with the texts at hand This method has two passes Sentence-length based statistics are first used for extracting the training data for the IBM Model-1 translation tables before this model based on sentence-length are combined with the acquired lexical statistics in order to extract 1-to-1 correspondences with high accuracy A forward-backward computation is used as the search heuristic in which the forward pass is a pruned dynamic programming procedure

This is a highly accurate and language-independent algorithm, so it is a very promising method It constantly achieves the high precision Furthermore, it is also fast in comparison with methods which use solely lexical information Requiring no knowledge

of the languages or the corpus is also another advantage of this method While lexical methods are generally more than sentence-length ones, most of them require additional

Định dạng
Số trang	61
Dung lượng	1,01 MB