Parallel corpora aligned at sentence level become a usefulresource for a number of applications in natural language processing including StatisticalMachine Translation, word disambiguati
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
MASTER THESIS OF INFORMATION TECHNOLOGY
SUPERVISOR: PhD Phuong-Thai Nguyen
Hanoi - 2014
Trang 3ORIGINALITY STATEMENT
„I hereby declare that this submission is my own work and to the best of myknowledge it contains no materials previously published or written by another person, orsubstantial proportions of material which have been accepted for the award of any otherdegree or diploma at University of Engineering and Technology (UET) or any othereducational institution, except where due acknowledgement is made in the thesis I alsodeclare that the intellectual content of this thesis is the product of my own work, except tothe extent that assistance from others in the project‟s design and conception or in style,presentation and linguistic expression is acknowledged.‟
Signed
Trang 4I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for hissupervision but also for his enthusiastic encouragement, right suggestion and knowledgewhich I have been giving during studying in Master‟s course I would also like to show
my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of InformationTechnology - Vietnam Academy of Science and Technology - who provided valuable data
in my evaluating process I would like to thank PhD Van-Vinh Nguyen for examining andgiving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen fortheir help along with comments on my work, especially M.A Kim-Anh Nguyen forsupporting and checking some issues in my research
In addition, I would like to express my thanks to lectures, professors in Faculty ofInformation Technology, University of Engineering and Technology (UET), VietnamUniversity, Hanoi who teach me and helping me whole time I study in UET
Finally, I would like to thank my family and friends for their support, share, andconfidence throughout my study
Trang 5Sentence alignment plays an important role in machine translation It is an essentialtask in processing parallel corpora which are ample and substantial resources for naturallanguage processing In order to apply these abundant materials into useful applications,parallel corpora first have to be aligned at the sentence level
This process maps sentences in texts of source language to their corresponding units intexts of target language Parallel corpora aligned at sentence level become a usefulresource for a number of applications in natural language processing including StatisticalMachine Translation, word disambiguation, cross language information retrieval Thistask also helps to extract structural information and derive statistical parameters frombilingual corpora
There have been a number of algorithms proposed with different approaches forsentence alignment However, they may be classified into some major categories First ofall, there are methods based on the similarity of sentence lengths which can be measured
by words or characters of sentences These methods are simple but effective to apply forlanguage pairs that have a high similarity in sentence lengths The second set of methods
is based on word correspondences or lexicon These methods take into account the lexicalinformation about texts, which is based on matching content in texts or uses cognates Anexternal dictionary may be used in these methods, so these methods are more accurate butslower than the first ones There are also methods based on the hybrids of these first twoapproaches that combine their advantages, so they obtain quite high quality of alignments
In this thesis, I summarize general issues related to sentence alignment, and I evaluateapproaches proposed for this task and focus on the hybrid method, especially the proposal
of Moore (2002), an effective method with high performance in term of precision Fromanalyzing the limits of this method, I propose an algorithm using a new feature, bilingualword clustering, to improve the quality of Moore‟s method The baseline method (Moore,2002) will be introduced based on analyzing of the framework, and I describe advantages
as well as weaknesses of this approach In addition to this, I describe the basis knowledge,algorithm of bilingual word clustering, and the new feature used in sentence alignment.Finally, experiments performed in this research are illustrated as well as evaluations toprove benefits of the proposed method
Keywords: sentence alignment, parallel corpora, natural language processing, word
clustering
Trang 6Table of Contents
ORIGINALITY STATEMENT 3
Acknowledgements 4
Abstract 5
Table of Contents 6
List of Figures 9
List of Tables 10
CHAPTER ONE Introduction 11
1.1 Background 11
1.2 Parallel Corpora 12
1.2.1 Definitions 12
1.2.2 Applications 12
1.2.3 Aligned Parallel Corpora 12
1.3 Sentence Alignment 12
1.3.1 Definition 12
1.3.2 Types of Alignments 12
1.3.3 Applications 15
1.3.4 Challenges 15
1.3.5 Algorithms 16
1.4 Thesis Contents 16
1.4.1 Objectives of the Thesis 16
1.4.2 Contributions 17
1.4.3 Outline 17
1.5 Summary 18
CHAPTER TWO Related Works 19
2.1 Overview 19
2.2 Overview of Approaches 19
Trang 72.2.1 Classification 19
2.2.2 Length-based Methods 19
2.2.3 Word Correspondences Methods 21
2.2.4 Hybrid Methods 21
2.3 Some Important Problems 22
2.3.1 Noise of Texts 22
2.3.2 Linguistic Distances 22
2.3.3 Searching 23
2.3.4 Resources 23
2.4 Length-based Proposals 23
2.4.1 Brown et al., 1991 23
2.4.2 Vanilla: Gale and Church, 1993 24
2.4.3 Wu, 1994 27
2.5 Word-based Proposals 27
2.5.1 Kay and Roscheisen, 1993 27
2.5.2 Chen, 1993 27
2.5.3 Melamed, 1996 28
2.5.4 Champollion: Ma, 2006 29
2.6 Hybrid Proposals 30
2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002 30
2.6.2 Hunalign: Varga et al., 2005 31
2.6.3 Deng et al., 2007 32
2.6.4 Gargantua: Braune and Fraser, 2010 33
2.6.5 Fast-Champollion: Li et al., 2010 34
2.7 Other Proposals 35
2.7.1 Bleu-align: Sennrich and Volk, 2010 35
2.7.2 MSVM and HMM: Fattah, 2012 36
2.8 Summary 37
CHAPTER THREE Our Approach 39
3.1 Overview 39
Trang 83.2 Moore‟s Approach 39
3.2.1 Description 39
3.2.2 The Algorithm 40
3.3 Evaluation of Moore‟s Approach 42
3.4 Our Approach 42
3.4.1 Framework 42
3.4.2 Word Clustering 43
3.4.3 Proposed Algorithm 45
3.4.4 An Example 49
3.5 Summary 50
CHAPTER FOUR Experiments 51
4.1 Overview 51
4.2 Data 51
4.2.1 Bilingual Corpora 51
4.2.2 Word Clustering Data 53
4.3 Metrics 54
4.4 Discussion of Results 54
4.5 Summary 57
CHAPTER FIVE Conclusion and Future Work 58
5.1 Overview 58
5.2 Summary 58
5.3 Contributions 58
5.4 Future Work 59
5.4.1 Better Word Translation Models 59
5.4.2 Word-Phrase 59
Bibliography 60
Trang 9List of Figures
Figure 1.1 A sequence of beads (Brown et al., 1991) 13
Figure 2.1 Paragraph length (Gale and Church, 1993) 25
Figure 2.2 Equation in dynamic programming (Gale and Church, 1993) 26
Figure 2.3 A bitext space in Melamed‟s method (Melamed, 1996) 29
Figure 2.4 The method of Varga et al., 2005 31
Figure 2.5 The method of Braune and Fraser, 2010 33
Figure 2.6 Sentence Alignment Approaches Review 38
Figure 3.1 Framework of sentence alignment in our algorithm 43
Figure 3.2 An example of Brown's cluster algorithm 44
Figure 3.3 English word clustering data 44
Figure 3.4 Vietnamese word clustering data 44
Figure 3.5 Bilingual dictionary 46
Figure 3.6 Looking up the probability of a word pair 47
Figure 3.7 Looking up in a word cluster 48
Figure 3.8 Handling in the case: one word is contained in dictionary 48
Figure 4.1 Comparison in Precision 55
Figure 4.2 Comparison in Recall 56
Figure 4.3 Comparison in F-measure 57
Trang 10List of Tables
Table 1.1 Frequency of alignments (Gale and Church, 1993) 14
Table 1.2 Frequency of beads (Ma, 2006) 14
Table 1.3 Frequency of beads (Moore, 2002) 14
Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993) 15
Table 2.1 Alignment pairs (Sennrich and Volk, 2010) 36
Table 4.1 Training data-1 51
Table 4.2 Topics in Training data-1 52
Table 4.3 Training data-2 52
Table 4.4 Topics in Training data-2 52
Table 4.5 Input data for training clusters 53
Table 4.6 Topics for Vietnamese input data to train clusters 53
Table 4.7 Word clustering data sets 54
Trang 1113, 15-16].
Parallel texts, however, are useful only when they have to be sentence-aligned Theparallel corpus first is collected from various resources, which has a very large size of thetranslated segments forming it This size is usually of the order of entire documents andcauses an ambiguous task in learning word correspondences The solution to reduce theambiguity is first decreasing the size of the segments within each pair, which is known assentence alignment task [7, 12-13, 16]
Sentence alignment is a process that maps sentences in the text of the source language
to their corresponding units in the text of the target language [3, 8, 12, 14, 20] This task isthe work of constructing a detailed map of the correspondence between a text and itstranslation (a bitext map) [14] This is the first stage for Statistical Machine Translation.With aligned sentences, we can perform further analyses such as phrase and wordalignment analysis, bilingual terminology, and collocation extraction analysis as well asother applications [3, 7-9, 17] Efficient and powerful sentence alignment algorithms,therefore, become increasingly important
A number of sentence alignment algorithms have been proposed [1, 7, 9, 12, 15, 17,20] Some of these algorithms are based on sentence length [3, 8, 20]; some use wordcorrespondences [5, 11, 13-14]; some are hybrid of these two approaches [2, 6, 15, 19].Additionally, there are also some other outstanding methods for this task [7, 17] Fordetails of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6
I propose an improvement to an effective hybrid algorithm [15] that is used insentence alignment For details of our approach, see Section 3.4 I also create experiments
Trang 12to illustrate my research For details of the corpora used in our experiments, see Section4.2 For results and discussions of experiments, see Sections 4.4, 4.5.
In the rest of this chapter, I describe some issues related to the sentence alignmenttask In addition to this, I introduce objectives of the thesis and our contributions Finally,
I describe the structure of this thesis
1.2 Parallel Corpora
1.2.1 Definitions
Parallel corpora are a collection of documents which are translations of each other
[16].Aligned parallel corpora are collections of pairs of sentences where one sentence is atranslation of the other [1]
1.2.2 Applications
Bilingual corpora are an essential resource in multilingual natural language processingsystems This resource helps to develop data-driven natural language processingapproaches This also contributes to applying machine learning to machine translation[15-16]
1.2.3 Aligned Parallel Corpora
Once the parallel text is sentence aligned, it provides the maximum utility [13].Therefore, this makes the task of aligning parallel corpora of considerable interest, and anumber of approaches have been proposed and developed to resolve this issue
1.3 Sentence Alignment
1.3.1 Definition
Sentence alignment is the task of extracting pairs of sentences that are translation ofone another from parallel corpora Given a pair of texts, this process maps sentences inthe text of the source language to their corresponding units in the text of the targetlanguage [3, 8, 13]
Trang 13Figure 1.1 A sequence of beads (Brown et al., 1991).
Groups of sentence lengths are circled to show the correct alignment Each of thegroupings is called a bead, and there is a number to show sentence length of a sentence inthe bead In figure 1.1, “17e” means the sentence length (17 words) of an Englishsentence, and “19f” means the sentence length (19 words) of a French sentence There is asequence of beads as follows:
An -bead (one English sentence aligned with one French sentence) followed by
An -bead (one English sentence aligned with two French sentences) followed by
An -bead (one English sentence) followed by
A ¶ ¶ bead (one English paragraph and one French paragraph).
An alignment, then, is simply a sequence of beads that accounts for the observedsequences of sentence lengths and paragraph markers [3]
There are quite a number of beads, but it is possible to only consider some of them including
1-to-1 (one sentence of source language aligned with one sentence of target language), 1-to-1-to-2 (one sentence of source language aligned with two sentences of target language), etc; Brown et al., 1991 [3] mentioned to beads 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-to-1, and a bead of paragraphs ( ¶ , ¶ ,¶
) because of considering alignments by paragraphs of this method Moore, 2002 [15] only considers five of these beads: 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-to-1 in which each of them is called as follows:
1-to-1 bead (a match)
1-to-0 bead (a deletion)
0-to-1 bead (an insertion)
1-to-2 bead (an expansion)
2-to-1 bead (a contraction)
13
Trang 14The common information related to this is the frequency of beads Table 1.1 shows frequencies of types of beads proposed by Gale and Church, 1993 [8].
Table 1.1 Frequency of alignments (Gale and Church, 1993)
Category1-11-0 or 0-12-1 or 1-22-2
Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as Table 1.2:
Table 1.2 Frequency of beads (Ma, 2006)
Category1-11-0 or 0-11-2 or 2-1OthersTotalTable 1.3 also describes these frequencies of types of beads in Moore, 2002 [15]:
Table 1.3 Frequency of beads (Moore, 2002)
Category1-11-22-11-00-1Total
Trang 15Generally, the frequency of bead 1-to-1 in almost all corpora is largest in all types ofbeads, with frequency around 90% whereas other types are only about few percentages.
1.3.3 Applications
Sentence alignment is an important topic in Machine Translation This is an importantfirst step for Statistical Machine Translation It is also the first stage to extract structuraland semantic information and to derive statistical parameters from bilingual corpora [17,20] Moreover, this is the first step to construct probabilistic dictionary (Table 1.4) for use
in aligning words in machine translation, or to construct a bilingual concordance for use
in lexicography
Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993)
English
thethethethethethethethethe
1.3.4 Challenges
Although this process might seem very easy, it has some important challenges which make the task difficult [9]:
The sentence alignment task is non-trivial because sentences do not always align
1-to-1 At times a single sentence in one language might be translated as two or moresentences in the other language The input text also affects the accuracies Theperformance of sentence alignment algorithms decreases significantly when input databecomes very noisy Noisy data means that there are more 1-0 and 0-1 alignments in thedata For example, there are 89% 1-1 alignments in English-French corpus (Gale andChurch, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus Whereas in UN
Trang 16Chinese English corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1alignments are 6.4% in this corpus Although some methods work very well on clean data,their performance goes down quickly as data becomes noisy [13].
In addition, it is difficult to achieve perfect accurate alignments even if the texts areeasy and “clean” For instance, the success of an alignment program may declinedramatically when applied on a novel or philosophy text, but this program giveswonderful results when applied on a scientific text
The performance alignment also depends on languages of corpus For example, analgorithm based on cognates (words in language pairs that resemble each otherphonetically) is likely to work better for English-French than for English-Hindi becausethere are fewer cognates for English-Hindi [1]
1.3.5 Algorithms
A sentence alignment program is called “ideal” if it is fast, highly accurate, andrequires no special knowledge about the corpus or the two languages [2, 9, 15] Acommon requirement for sentence alignment approaches is the achievement of both highaccuracy and minimal consumption of computational resources [2, 9] Furthermore, amethod for sentence alignment should also work in an unsupervised fashion and belanguage pair independent in order to be applicable to parallel corpora in any languagewithout requiring a separate training set A method is unsupervised if it is an alignmentmodel directly from the data set to be aligned Meanwhile, language pair independencemeans that approaches require no specific knowledge about the languages of the paralleltexts to align
1.4 Thesis Contents
This section introduces the organization of contents in this thesis including: objectives,our contributions, and the outline
1.4.1 Objectives of the Thesis
In this thesis, I report results of my study of sentence alignment and approachesproposed for this task Especially, I focus on Moore‟s method (2002), a method which isoutstanding and has a number of advantages I also discover a new feature, wordclustering, which may apply for this task to improve the accuracy of alignment I examinethis proposal in experiments and compare results to those in the baseline method to proveadvantages of my approach
Trang 171.4.2 Contributions
My main contributions are as follows:
Evaluating methods in sentence alignment and introducing an algorithmthat improves Moore‟s method
Using new feature - word clustering, helps to improve accuracy ofalignment This contributes in complementing strategies in the sentencealignment problem
1.4.3 Outline
The rest of the thesis is organized as follows:
Chapter 2 – Related Works
In this chapter I introduce some recent research about sentence alignment In order tohave a general view of methods proposed to deal this problem, an overall presentationabout methods of sentence alignment is introduced in this chapter Methods are classifiedinto some types in which each method is given by describing its algorithm along withevaluations related to it
Chapter 3 – Our Approach
This chapter describes the method we proposed in sentence alignment to improveMoore‟s method Initially, an analysis of Moore‟s method and evaluations about it arealso mentioned in this chapter The major content of this chapter is the framework of theproposed method, an algorithm using bilingual word clustering An example is described
in this chapter to illustrate the approach clearly
Chapter 4 – Experiments
This chapter shows experiments performed in our approach Data corpora used inexperiments are presented completely Results of experiments as well as discussions aboutthem are clearly described for evaluating our approach to the baseline method
Chapter 5 –Conclusions and Future Works
In this last chapter, advantages and restrictions of my works are summarized in ageneral conclusion Besides, some research directions are mentioned to improve thecurrent model in the future
Finally, references are given to show research published that my system refers to
Trang 181.5 Summary
This chapter introduces my research work I have given background information aboutparallel corpora, sentence alignment, definitions of issues as well as some initial problemsrelated to sentence alignment algorithms Terms of alignment which are used in this taskhave been defined in this chapter In addition, an outline of my research work in thisthesis has also been provided A discussion of future proposed work is also presented
Trang 19Section 2.2 provides an overview of sentence alignment approaches Section 2.3introduces and evaluates some primary approaches in length-based methods Section 2.4introduces and evaluates proposals of word-correspondence-based approaches Proposals
as well as evaluations for each of them in hybrid methods are presented in Section 2.5.Certainly, there are some other outstanding approaches about this task, which are alsointroduced in Section 2.6 Section 2.7 concludes this chapter
There are also some other techniques such as methods based on BLEU score, supportvector machine, and hidden Markov model classifiers
2.2.2 Length-based Methods
Length-based approaches are based on modeling the relationship between the lengths
of sentences that are mutual translations The length is measured by characters or words
of a sentence In these approaches, semantics of the text are not considered Statisticalmethods are used for this task instead of the content of texts In other words, these
Trang 20methods only consider the length of sentences in order to make the decision foralignment.
These methods are based on the fact that longer sentences in one language tend to betranslated into longer sentences in the other language, and that shorter sentences tend to betranslated into shorter sentences A probabilistic score is assigned to each proposedcorrespondence of sentences, based on the scaled difference of lengths of the two sentences(in characters) and the variance of this difference There are two random variables 1 and 2which are the lengths of the two sentences under consideration It is assumed that theserandom variables are independent and identically distributed with a normal distribution [8].Given the two parallel texts (source text) and (target text), the goal of this task is tofind alignment A which is highest probability
The methods proposed in this type of sentence alignment algorithm are based solely
on the lengths of sentences, so they require almost no prior knowledge Furthermore,these methods are highly accurate despite their simplicity They can also perform in ahigh speed When aligning texts whose languages are similar or have a high lengthcorrelation such as English, French, and German, these approaches are especially usefuland work remarkably well They also perform fairly well if the input text is clean such as
in Canadian Hansards corpus [3] The Gale and Church algorithm is still widely usedtoday, for instance to align Europarl (Koehn, 2005)
Nevertheless, these methods are not robust since they only use the sentence lengthinformation They will no longer be reliable if there is too much noise in the inputbilingual texts As shown in (Chen, 1993) [5] the accuracy of sentence-length basedmethods decreases drastically when aligning texts containing small deletions or freetranslation; they can easily misalign small passages because they ignore word identities.The algorithm of Brown et al requires corpus-dependent anchor points while the method
20
Trang 21proposed by Gale and Church depends on prior alignment of paragraphs to constrain thesearch When aligning texts where the length correlation breaks down, such as theChinese-English language pair, the performance of length-based algorithms declinesquickly.
2.2.3 Word Correspondences Methods
The second approach, one that tries to overcome the disadvantages of length-basedapproaches, is the word-based method that is based on lexical information fromtranslation lexicons, and/or through the recognition of cognates These methods take intoaccount the lexical information about texts Most algorithms match content in one textwith their correspondences in the other text, and use these matches as anchor points in thetask sentence alignment Words which are translations of each other may have similardistribution in the source language and target language texts Meanwhile, some methodsuse cognates (words in language pairs that resemble each other phonetically) rather thanthe content of word pairs to determine beads of sentences
This type of sentence alignment methods may be illustrated in some outstandingapproaches such as Kay and Röscheisen, 1993 [11], Chen, 1993 [5], Melamed, 1996 [14],and Ma, 2006 [13] Kay‟s work has not proved efficient enough to be suitable for largecorpora while Chen constructs a word-to-word translation model during alignment toassess the probability of an alignment Word correspondence was further developed inIBM Model-1 (Brown et al., 1993) for statistical machine translation Meanwhile, wordcorrespondence in another way (geometric correspondence) for sentence alignment isproposed by Melamed, 1996
These algorithms have higher accuracy in comparison with length-based methods.Because they use the lexical information from source and translation lexicons rather thanonly sentence length to determine the translation relationship between sentences in thesource text and the target text, these algorithms usually are more robust than the length-based algorithms
Nevertheless, algorithms based on a lexicon are slower than those based on lengthsentence because they require considerably more expensive computation In addition tothis, they usually depend on cognates or a bilingual lexicon The method of Chen requires
an initial bilingual lexicon; the proposal of Melamed, meanwhile, depends on findingcognates in the two languages to suggest word correspondences
2.2.4 Hybrid Methods
Sentence length and lexical information are also combined in order that differentapproaches can complement on each other and achieve more efficient algorithms
Trang 22These approaches are proposed in Moore, 2002; Varga et al., 2005; and Braune andFraser, 2010 Both approaches have two passes in which a length-based method is usedfor a first alignment and this subsequently serves as training data for a translation model,which is then used in a complex similarity score Moore, 2002 proposes a two-phasemethod that combines sentence length (word count) in the first pass and wordcorrespondences (IBM Model-1) in the second one Varga et al (2005) also use the hybridtechnique in sentence alignment by combining sentence length with wordcorrespondences (using a dictionary-based translation model in which the dictionary can
be manually expanded) Braune and Fraser, 2010 also propose an algorithm similar toMoore except that this approach has the technique to build 1-to-many and many-to-1alignments rather than focus only on 1-to-1 alignment as Moore‟s method
The hybrid approaches achieve a relatively high performance and overcome limits ofthe first two methods along with combining their advantages The approach of Moore,
2002 obtains a high precision (fraction of retrieved documents that are in fact relevant)and computational efficiency Meanwhile, the algorithm proposed by Varga et al., 2005which has the same idea as Moore, 2002 gains a very high recall rate (fraction of relevantdocuments that are retrieved by the algorithm)
Nonetheless, there are still weaknesses which should be handled in order to obtain amore efficient sentence alignment algorithm In Moore‟s method, the recall rate is ratherlow, and this fact is especially problematic when aligning parallel corpora with muchnoise or sparse data The approach of Varga et al., 2005, meanwhile, gets a very highrecall value; however, it still has a quite low precision rate
2.3 Some Important Problems
2.3.2 Linguistic Distances
Another parameter which can also affect the performance of sentence alignmentalgorithms is the linguistic distance between source language and target language.Linguistic distance means the extent to which languages differ from each other For
Trang 23example, English is linguistically “closer” to Western European languages (such asFrench and German) than it is to East Asian languages (such as Korean and Japanese).There are some measures to assess the linguistic distance such as the number of cognatewords, syntactic features It is important to recognize that some algorithms may notperform so well if they rely on the closeness between languages while these languages aredistant An obvious example for this is that a method is likely to work better for English-French or English-German than for English-Hindi if is based on cognates because offewer cognates in English-Hindi Hindi belongs to the Indo-Aryan branch whereasEnglish and German belongs to the Indo-Germanic one.
2.3.3 Searching
Dynamic programming is the technique that most sentence alignment tools use insearching the best path of sentence pairs through a parallel text This also means that thetexts are ordered monotonically and none of these algorithms is able to extract sentencepairs in crossing positions Nevertheless, most of these programs have advantages inusing this technique in searching and none of them reports weaknesses about it It isbecause the characteristic of translations is that almost all sentences have same order inboth source and target texts
In this aspect, algorithms may be confronted with problems of the search space Thus,pruning strategies to restrict the search space is also an issue that algorithms have toresolve
2.3.4 Resources
All systems learn their respective models from the parallel text itself There are onlysome algorithms which support the use of external resources such as Hunalign (Varga etal., 2005) with bilingual dictionaries and Bleualign (Sennrich and Volk, 2010) withexisting MT systems
Trang 24To perform searching for the best alignment, Brown et al use dynamic programming.This technique requires time quadratic in the length of the text aligned, so it is notpractical to align a large corpus as a single unit The computation of searching may bereduced dramatically if the bilingual corpus is subdivided into smaller chunks Thissubdivision is performed by using anchors in this algorithm An anchor is a piece of textlikely to be present at the same location in both of the parallel corpora of a bilingualcorpus Dynamic programming first is used to align anchors, and then this technique isapplied again to align the text between anchors.
The alignment computation of this algorithm is fast since it makes no use of thelexical details of the sentence Therefore, it is practical to apply this method to very largecollections of text, especially to high correlation languages pairs
2.4.2 Vanilla: Gale and Church, 1993
This algorithm performs sentence alignment based on a statistical model of sentencelengths measured by characters It uses the fact that longer sentences in one language tend
to be translated into longer sentences in another language
This algorithm is similar to the proposal of Brown et al except that the former isbased on the number of words whereas the latter is based on the number of characters insentences In addition, the algorithm of Brown et al aligns a subset of the corpus forfurther research instead of focusing on entire articles The work of Gale and Church(1991) supports this promise of wider applicability
This sentence alignment program has two steps First paragraphs are aligned, and thensentences within a paragraph are aligned This algorithm reports that paragraph lengthsare highly correlated Figure 2.1 illustrates this correlation of the languages pair: Englishand German
Trang 25Figure 2.1 Paragraph length (Gale and Church, 1993).
A probabilistic score is assigned to each proposed correspondence of sentences, based
on the scaled difference of lengths of the two sentences and the variance of thisdifference This score is used in a dynamic programming framework to find the maximumlikelihood alignment of sentences The use of dynamic programming allows the system toconsider all possible alignments and find the minimum cost alignment effectively
A distance function is defined in a general way to allow for insertions, deletions, substitution, etc The function takes four arguments: 1 , 1 , 2 , 2
Let ( 1 , 1 ; 0, 0) be the cost of substitution 1 with 1 ,
( 1 , 0; 0, 0) be the cost of deleting 1 ,
(0, 1 ; 0, 0) be the cost of insertion of 1 ,
( 1 , 1 ; 2 , 0) be the cost of contracting 1 and 2 to 1 ,
( 1 , 1 ; 0, 2 ) be the cost of expanding 1 to 1 and 2 , and
( 1 , 1 ; 2 , 2 ) be the cost of merging 1 and 2 and matching with 1 and 2
The Dynamic Programming Algorithm is summarized in the following recursion equation
25
Trang 26Figure 2.2 Equation in dynamic programming (Gale and Church, 1993)
Let , = 1 , be the sentences of one language, and , = 1 , be the translations of those sentences in the other language.
Let d be the distance function, and let ( , ) be the minimum distance between
sentences 1, …and their translations 1 , … , ,
alignment ( , ) is computed by minimizing over six
insertion, contraction, expansion, and merger) These, in
constraints ( , ) is defined with the initial condition
This algorithm has some main characteristics as follows:
Firstly, this is a simple algorithm The number of characters in thesentences is counted simply, and the dynamic programming model is used tofind the correct pairs of alignment Many later researchers have integrated thismethod to their methods because of this simplicity
Secondly, this algorithm can be used between any pairs of languages because it does not use any lexical information
Thirdly, it has a low time cost, one of the most important criteria to apply this method to a very large bilingual corpus
Finally, this is also quite an accurate algorithm especially when aligning
on data of language pairs with high correlation like French or German
English-As the report of Gale and Church, in comparison with the length-based method based
on word count like the proposal of Brown et al., it is better to use characters rather thanwords in counting sentence length The performance of this approach is better since there
is less variability in the differences of sentence lengths so measured, which using words
as units increases the error rate by half This method performs well at least on relatedlanguages The accuracy of this method also depends on the type of alignment It gets bestresults on 1-to-1 alignments, but it has a high error on more difficult alignments Thisalgorithm is still widely used today to align some corpora like the Europarl corpus(Koehn, 2005) and the JRC-Acquis (Stein-berger et al., 2006)
26
Trang 272.4.3 Wu, 1994
Wu applies Gale and Church‟s method to the language pair English-Chinese In order
to improve the accuracy, he utilizes lexical information from translation lexicons, and/orthrough the identification of cognates Lexical cues used in this method are in the form of
a small corpus-specific lexicon
This method is important in two respects:
Firstly, this method has indicated that length-based methods givesatisfactory results even between unrelated languages (languages fromunrelated families such as English and Chinese), a surprising result
Secondly, lexical cues used in this method increase accuracy ofalignment This proves the effect of lexical information to the accuracy whenadding to a length-based method
2.5 Word-based Proposals
2.5.1 Kay and Roscheisen, 1993
This algorithm is based on word correspondences, which is an iterative relaxationapproach The iterations are started by the assumption that the first and last sentences ofthe texts align, and these are the initial anchors before these below steps continue untilmost sentences are aligned:
Step 1: Form an envelope of possible alignments
Step 2: Choose word pairs that tend to co-occur in these potential partial alignments
Step 3: Find pairs of source and target sentences which contain many possible lexical correspondences
A set of partial alignments which will be part of the final result is induced by using themost reliable of these pairs
However, a weakness of this method is that it is not efficient enough to apply to largecorpora
Trang 28Dynamic programming with threshold is the search strategy of this algorithm Thesearch is linear in the length of the corpus because of the use of threshold As a result, thecorpus needs not be subdivided into smaller chunks It also deals well with deletions Thisalso makes the search strategy robust despite large deletions whereas identifying thebeginning and the end of deletions is performed confidently thanks to using lexicalinformation With an intelligent threshold, great benefits may be obtained due to the factthat most alignments are one-to-one The computation of this algorithm is reduced to alinear one because of considering only a subset of all possible alignments.
This method gives better accuracy than the length-based one, but is “tens of timesslower than the Brown [3] and Gale [8] algorithms” [5] Furthermore, it is also languageindependent It can handle large deletions in text
This algorithm requires that for each language pair 100 sentences need to be aligned
by hand to bootstrap the translation model; therefore, it depends on a minimum of humanintervention This method takes a great computational cost because of using the lexicalinformation However, alignment is a one-time cost, and it may be very useful after it iscontributed whereas the computing power is also available These causes indicate that thecomputational cost sometimes may be acceptable
2.5.3 Melamed, 1996
This method is based on word correspondences [14] A bitext map of words is used tomark the points of correspondences between these words in a two-dimensional graph.After marking all possible points, the true correspondence points in graph are found bysome rules, sentences location information and boundary of sentences
Melamed uses a term of bitext which comprises two versions of a text like a text intwo different languages A bitext is created in each time when translators translate a text.Figure 2.3 illustrates a rectangular bitext space that each bitext defines The lengths of thetwo components texts (in characters) are presented by the width and the height of therectangle
Trang 29Figure 2.3 A bitext space in Melamed’s method (Melamed, 1996).
This algorithm has slightly better accuracy than Gale and Church‟s one This methodcan give almost perfect results in alignment if a good bitext map can be formed This isalso the power of this method If it is used in popular languages, it may be the best andmay be possible to acquire a good bitext map
However, this method requires a good bitext map to have satisfactory accuracy
2.5.4 Champollion: Ma, 2006
This algorithm, which is designed for robust alignment of potential noisy parallel text,
is a lexicon-based sentence aligner [13] It was first developed for aligning English parallel text before ported to other language pairs such as Hindi-English orArabic-English In this method, only if lexical matches are present, a match is considered
Chinese-as possible In considering a stronger indication that two segments are a match, higherweights are assigned to less frequent words In order to weed out bogus matches, thisalgorithm also uses sentence length information
This method focuses on dealing with noisy data It overcomes existing methods whichwork very well on clean data but decline quickly in their performance when data becomenoisy
There are also some different characteristics between this method and other sentencealigners:
Trang 30 Noisy data is resources that contain a larger percentage of alignments in whichthey will not be 1-to-1 ones They have a significant amount of the number ofdeletions and insertions This method assumes with such an input data It isunreliable when using sentence length information to deal with noisy data Thisinformation is only used as a minor role when the method is based on lexicalevidence.
Translated words are treated equally in most sentence alignment algorithms Inother words, when the method decides sentence correspondences, an equal weight
is assigned to translated words pairs This method assigns weights to translatedwords, which makes it different from other lexicon-based algorithms
There are two steps to apply translation lexicons in sentence aligners: entriesfrom a translation lexicon are used to identify translated words before sentencecorrespondences are identified by using statistics of translated words
In this method, assigning greater weights to less frequent translated words helps toincrease the robustness of the alignment especially when dealing with noisy data It gainshigh precision and recall rates, and it may be easy to use for new language pairs.However, it requires an externally supplied bilingual lexicon
2.6 Hybrid Proposals
2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002
This algorithm combines length-based and word-based approaches to achieve highaccuracy at a modest computational cost [15] A problem in using lexical information isthat it limits the use of algorithm only between a pair of languages Moore resolves thisproblem by using a method similar to IBM translation model to extract a bilingual corpuswith the texts at hand This method has two passes Sentence-length based statistics arefirst used for extracting the training data for the IBM Model-1 translation tables beforethis model based on sentence-length are combined with the acquired lexical statistics inorder to extract 1-to-1 correspondences with high accuracy A forward-backwardcomputation is used as the search heuristic in which the forward pass is a pruned dynamicprogramming procedure
This is a highly accurate and language-independent algorithm, so it is a very promisingmethod It constantly achieves the high precision Furthermore, it is also fast incomparison with methods which use solely lexical information Requiring no knowledge
of the languages or the corpus is also another advantage of this method While lexicalmethods are generally more than sentence-length ones, most of them require additionallinguistic resources or knowledge Moore has tried to overcome this issue by using IBM
Trang 31Model-1 with sentence pairs which are extracted from the initial model, which does notuse any extra information However, using Model 1 also results in a rather slowcomputation; therefore, this method considers only alignment close to the initially foundalignment, which sometimes affects the range of search space This algorithm focusesonly on one-to-one alignments and excludes one-to-many and many-to-many alignments.
If working in two languages which have different sentence structuring conventions, thismay lead to losing amounts of aligned material In this case, it may be difficult tomaximize alignment recall
2.6.2 Hunalign: Varga et al., 2005
This method is based on both sentence length and lexical similarity Generally, it issimilar to Moore‟s approach; however, it uses a crude word-by-word dictionary-basedreplacement rather than IBM Model 1 as the proposal of Moore The dictionary in thismethod can be manually expanded [19]
This algorithm may be described as shown in Figure 2.4 In the first step, a crudetranslation (T‟) of the source text (S) is produced by converting each word token into thedictionary translation that has the highest frequency in the target corpus, or to itself incase of lookup failure This is considered as a pseudo target language text A comparisonbetween this text (T‟) and the actual target one (T) on a sentence is performed bysentence basis
AlignedSentences
Dictionary
Figure 2.4 The method of Varga et al., 2005