VIETNAM KATIONAL UNIVERSITY, HANOL UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAT-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION MASTER THESIS O
Trang 1
VIETNAM KATIONAL UNIVERSITY, HANOL UNIVERSITY OF ENGINEERING AND TECHNOLOGY
HAT-LONG TRIEU
BILINGUAL SENTENCE ALIGNMENT
BASED ON SENTENCE LENGTH AND
WORD TRANSLATION
MASTER THESIS OF INFORMATION TECHNOLOGY
Hanoi - 2014
Trang 2VIETNAM NATIONAL UNIVERSITY, ILANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
HAL-LONG TRIEU
BILINGUAL SENTENCE ALIGNMENT
BASED ON SENTENCE LENGTH AND
WORD TRANSLATION
Major Computer scicnce
Code: 60 48 01
MASTER THESIS OF INFORMATION TECHNOLOGY
SUPERVISOR: PhD Phuong-Thai Nguyen
Hanoi - 2014
Trang 3ORIGINALITY STATEMENT
‘T hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (VET) or any other educational institution, except where due acknowledgement is made itt the thesis T also
a
the extent [hal
olare that the intel
ual content of this thesis is the product of my own werk, excupt lo
anee from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged
Signed
Trang 4Acknowledgements
I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his supervision but also for his enthusiastic encowagement, right suggestion and knowledge which | have been giving during studying in Master’s course | would also like to show
my deep gratitude M.A Phuong-Ihao Thi Nguyen from Institute of Information Technology - Vietnam Avademy of Science and Technology - who provided valuable dala in my cvalualing process T would like to thank PhD Van-Vinh Nguyen for examining and giving some advices lo my work, MLA Kim-Anh Nguyen, M.A Truong Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh Nguyen for supporting and checking some issues in my research
In addition, I would like to express my thanks to lectures, professors in Faculty off Information Technology, University of Engineering and Technology (UET), Vietnam
University, Hanoi who teach me and helping me whole time T study in UET
Vinally, | would like to thank my family and friends for their support, share, and confidence throughout my study
Trang 5Abstract
Sentence alignment plays an important role in machine translation It is an essential task in processing, parallel corpora which are ample and substantial resources for natural language processing, In order to apply these abundant materials into useful applications, parallel corpora first have to be aligned at the sentence level
This process maps sentences in texts of source language to their corresponding units in texts of target language Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine ‘Translation, word disambiguation, cross language information retrieval ‘This task also helps to extract structural information and derive statistical parameters from bilingual corpora
There have been a number of algorithms proposed with different approaches for sentence alignment However, they may be classified into some major categories First of
all, there are methods based on the similarity of sentence lengths which can be measured
by words or characters of sentences ‘hese methods are simple but effective to apply for Janguage pairs that have a high similarity in sentence lengths ‘The second set of methods
is based on word correspondences or lexicon These methods take inte account the lexical
information about texts, which is based on matching content in texts or uses cognates An
extemal dictionary may be used in these methods, so these methods are more accurate but
slower than the first ones There are also methods based on the hybrids of these firs! lwo approaches that combine their advantages, so they oblain quite high quality of alignments
in this thesis, 1 summarize general issues related to sentence alignment, and | evaluate approaches proposed for this task and focus on the hybrid method, especially the proposal
of Moore (2002), an effvelive method with high performance im term of precision, From
analyzing the limits of this method, I propose an algorithm using a new Icature, bilingual
word clustering, to improve the quality of Moore’s method The baseline method (Moore,
2002) will be introduced based on analyzing of the framework, and | describe advantages
as well as weaknesses of this approach In additian to this, I describe the basis knowledge,
algorithm of bilingual word clustering, and the new feature used in sentence alignment Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method
Keywords: science alignment, parallel corpora, matural language processing, word clustering
Trang 72.6.1 Microsoft's Bilingual Sentenee Alianer: Moore, 2002 coceoeoooe 30
3.7, Other PropOsals iioioeiroree onctuntimiituncenessennnstsnsneeenen SS
CHAPTER THREE Our Approach
Trang 83.2, Moore's Approach esses snusonetsoeusnneoseeneveeneeeieeintn
Trang 9Paragraph longth (Galc and Church, 193) sen,
Equation in dynamic programming (Gale and Church, 1993) - 26
A bitext space in Melamed’s method (Melamed, 1996) - 29
The method of Varga et al., 2005 — 31
The method of Braunc and Fraser, 2010 cà dD
Framework of sentence alignment in our algorithm - - 43
‘An example of Brown's cluster algorithm
English word clustering data TH, 012 maie 44
Looking up the probability of a word pair AT Looking up ina word cluster - 48
Handling in the case: one werd is contained in dictionary - 48
Comparison in PTeeiSi0H s cn H1200100111 1 1 cc1 Ee.errree 55
Comparison in Recall -
Trang 10An engy in a probabilistie đicdonary (Gale and Church, 1993) 15 Alignrment pairs (Sennrich and Volk, 3010) cn ni 36 Trainine data-L ào ooneerie
Trang 11translation, cross language uformahion retrieval, word disambiguafion, sense
disambiguation, bilingual lexicography, aulomalic translation verificalion, automatic
acquisition of knowledge about translation, and cross-language information retrieval Building a parallel corpus, therefore, helps connecting considered languages [1, 5, 7, 12-
13, 15-16]
Parallel texts, however, are useful only when they have to be sontence-aligned The
parallel corpus fist is collected from various resources, which has a very large size of Ihe
translated segments forming iL This sive is usually of the order of entire documents and
causes an atulnguous task in learting word correspondences The solution to reduce the
ambiguity is first deoreasing the size of the segments within each pair, which is known as
sentence alignment task [7, 12-13, 16]
Sentence alignment is a process that maps scritences im the text of the souree language
to their corresponding unils in the lex of the targel language [3, 8, 12, 14, 20] This task
is the work of constructing a detailed map of the correspondence between a text and its translation (a bitext map) [14] This is the first stage for Statistical Machine Translation With aligned sentences, we can perform further analyses such as phrase and word
alignment analysis, bilingual terminology, and collocation extraction analysis as well as other applications [3, 7-9, 17] Lfficient and powerful sentence alignment algorithms,
therefore, become increasingly important
A manber of sontenice alignment algorithms have been proposed [], 7, 9, 12, 15, 17, 20], Some of these algorithms are based on sentence length [3, 8, 20]; some use word correspondences [5, 11, 13-14], same are hybrid of these two approaches [2, 6, 15, 19] Additionally, there are also some other outstanding methods for this task [7, 17] For details of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6
T propose an improvement io an effective hybrid algorithm [15] thal is used in
sentence alignment For details of our approach, sce Section 3.4 T also create cxpernmonis
11
Trang 12to illustrate my research, For details of the corpora used in our experiments, see Section
4.2 For resulls and discussions of expenments, see Sections 4.4, 4.5
Tn the rest of this chapter, T describe some issues related to the sentence alignment
task In addition to this, I introduce objectives of the thesis and our cantributions Finally,
T describe the structure of this thesis
12, Parallel Corpora
1.2.1 Definitions
Parallel corpora are a collection of documents which are translations of each other [16] Aligned parallel corpora are collections of pairs of sentences where one sentence is a translation of the other [1]
1.2.2 Applications
Bilingual corpora are an essential resource in multilingual natural language processing
systems This resource helps to develop data-driven natural language processing
approaches This alsa contributes to applying machine learning to machine translation
[15-16],
1.2.3 Aligned Parallel Corpora
Once the parallel text is sentence aligned, i provides the maximum utility [13]
Therefore, this makes the task of aligning parallel corpora of considerable imerest, and a
number of approaches have been proposed and developed to resolve this issue
more defiaitions aboul “alignment” as well as issues relaled to il
Brown et al, 1991, assumed that every parallel corpus can be aligned in terms of a sequence of minimal alignment scgincnts, which they call “beads”, in which sentences align 1-to-1, 1-to-2, 2-to-1, 1-to-0, 0-to-]
Trang 1317e\ / 2Be j| 120 Te
Yigure 1.1 A sequenee of bends (Browu cÝ aL, 1991)
Groups of sentence lengths are circled to show the correct alignment Each of the groupings is called a bead, and there is a number to show sentence length of a sentence in the bead In figure 1.1, “17e” means the sentence length (17 words) of an English sentence, and “19f" means the sentence length (19 words) of a French sentence There is a sequence of beads as follows,
* An ef-bead (one Hnglish sentence aligned with one French sentence) followed by
® Aneff-bead (one Linglish sentence aligned with two Hrench sentences) followed
by
® Ane-bead (ane English sentence) followed by
© A Jef bead (one English paragraph and one French paragraph)
An alignment, then, is simply a sequence of beads that accounts for the observed sequences of sentence lengths and paragraph markers [3]
‘There are quite a umber of beads, but it is possible to only consider some of them including 1-to-I (one sentence of source language aligned with one sentence of target language), 1-to-2 (one sentence of source language aligned with two sentences of target
language), etc, Brown ef al., 1991 [3] mentioned to beads 1-f0-1, 1-0-0, 0-10-1, 1-0-2, 2-
fol, and a bead of paragraphs (Te, Tf, Tef) because of considering alignmnenis by
paragraphs of this method Moore, 2002 [15] only considers [ive of these beads: 1-10-1, 1-
10-0, 0-to-1, 1-to-2, 2-to-1 in which cach of them is called as follows:
© 1-to-l bead (a match)
« 1-to-0 bead (a deletion)
® 0-to-1 bead (an insertion)
© 10-2 bead (an expansion)
® 2-to-1 bead (a contraction)
13
Trang 14The common infonnation related to this is the frequency of beads ‘fable 1.1 shows
frequencies of types of beads proposed by Gale and Church, 1993 [8]
Table 1.1 Frequency of alignments (Gale and Church, 1993}
ProbGnat
Calagory Frequency
0.89 0.0099 0.089
Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as ‘lable 1.2:
Table 1.2 Frequency of heads (Ma, 2006)
Table 1.3 also describes these frequencies of types of beads in Moore, 2002 {15}
Table 1.3 Frequency of beads (Moore, 2002)
Trang 15Generally, the frequency of bead 1-to-1 in alnost all corpora is largest in alll types of beads, with frequency around 90% whereas other types are only about few percentages, 1.3.3 Applications
Sentence alignment is an imporlanl Lopic in Machine Translation This is an important
first step for Statistical Machine Translation It is also the first stage to extract structural
and semantic information and to derive statistical parameters from bilingual corpora [17, 20], Moreover, this is the first step to construct probabilistic dictionary (‘able 1.4) for use
in aligning words in machine translation, or to construct a bilingual concordance for use
the les the se the il the de the a the que
1.3.4 Challenges
Although this process might seem very easy, iUhas some important, challenges which
make the task difficult [9]:
The sentence alignment task is non-trivial because sentences do not always align I-to-
1 At times a single sentence in one language might be translated as two or more sentences in the other language Ihe input text also affects the accuracies The performance of sentence alignment algorithms decreases significantly when input data
becnmes very noisy Noisy data mcans that there are more 1-0 and 0-1 alignments in the
data, For example, there arc 89% 1-1 alignmonts im English-French corpus (Gale and Church, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus Whereas in UN
15
Trang 16Chinese Hnglish corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1 alignments arc 6.4% in this corpus Although some methods work very well on clean data, their performance gous down quickly as data becomes noisy [13]
In addition, it is difficult to achieve perfect accurate alignments even if the texts are easy and “clean” For instance, the success of an alignment program may decline dramatically when applied on a novel or philosophy text, but this program gives wonderful results when applied on a scientific text
The performance alignment also depends on languages of corpus For example, an algorithm based on cognates (words in language pairs that resemble each other phonetically) is likely to work better for linglish-l'rench than for Linglish-Ilindi because there are fewer cognates for Linelish-11indi [1]
1.3.5 Algoritions
A sentence alignment program is called “ideal” if it is fast, highly accurate, and requires no special knowledge about the corpus or the two languages [2, 9, 15] A common requirement for sentence alignment approaches is the achievement of both high accuracy and minimal consumption of computational resources [2, 9] Furthermore, a method for sentence alignment should also work in an unsupervised fashion and be language pair independent in order to be applicable to parallel corpora in any language without requiring a separate training set A method is unsupervised if it is an alignment model direelly from the dala set o be aligned Meanwhile, language pair independence means thal approaches require no specilic knowledge aboul the langaages af the parallel
fexts to align
14, Thesis Contents
‘This section introduces the organization of coments in this thesis including: objectives, our contributions, and the oulline
1.4.1 Objectives of the Thesis
In this thesis, 1 report results of my study of sentence alignment and approaches proposed for this task Uspecially, 1 focus on Moore’s method (2002), a method which is outstanding and has a number of advantages I also discover a new feature, word clustering, which may apply for tlis task to improve the accuracy of alignment I cxaming this proposal in experiments and compare results to those in the baseline icthod to prove advantages of my approach
Trang 171.4.2 Contributions
My main contributions are as follows
« Evaluating methods in sentence alignment and introducing an algorithm that
unproves Moore’s method
* Using new feature - word clustering, helps to improve accuracy of alignment
‘This contributes in complementing strategies in the sentence alignment
problan 1.4.3, Outline
The rest of the thesis is organized as follows:
Chapter 2 — Related Works
To this chapter T introduce some recent rescarch aboul senlence aligranent In order ta
have a general view of methods proposed lo deal this problem, an overall presentation
aboul methods of sentence alignment is introduced in this chapler Methods are classified into some types in which each method is given by describing its algorithm along with
evaluations related to it
Chapter 3— Our Approach
This chapter describes the method we proposed in sentence alignment to improve
Moore’s method Friially, an analysis of Moore’s method and evaluations about il arc algo mentioned in this chapler The major content of this chapler is the framework of the
proposed method, an algorilhin using bilmgual word clustering Art example is described
in this chapter ta illustrate the approach clearly
Chapter 4— Experiments
This chapter shows experiments performed in our approach Data corpora used in experiments are presented completely Results of experiments as well as discussions about them are clearly described for evaluating our approach lo the baseline method
Chapter 5—Conclusions and Future Works
In this last chapter, advantages and restrictions of my works are summarized in a general conclusion Besides, some research directions are mentioned to improve the cuzrent model in the future
Finally, referers
are given (o show reacarch published thai my system re!
Trang 181.5 Summary
This chapter introduces my research work I have given background information about
parallel corpora, sentence alignment, definitions of issues as well as some initial problems
related to sentence alignment algorillins Terms of alignment which are used in this Lask
have been defined a this chapter In addition, an oulline of my research work m this thesis has also been provided A discussion of future proposed work is alsa presented
18
Trang 19CHAPTER TWO
Related Works
2.1 Overview
This chapler is an ilroduction to some research in sentence alignmeril m recent years
and some evaluations about these approaches A number of problems related lo this work
are also discussed: factors that affeel the performance of alignment algorithms, searching
and resources for each method, Evaluations of algorithm are introduced to give a general view of advantages as well as weaknesses of each algorithm
Scetion 2.2 provides an overview of sentence alignment approaches Section 2.3 introduces and evaluates some primary approaches in longth-based methods Section 2
introduces and evaluales proposals of word-correspondence-based approaches Proposals
as well as evaluations for each of them in hybrid methods are presented in Sechon 2.5
Cerlatnly, there are some other oulslanding approaches aboul ths task, which are also
introduced in Section 2.6 Section 2.7 conchudes this chapter
2.2 Overview of Approaches
2.2.1 Classification
From the first approaches proposed in 1990s, there have been a number of
publications reported in sentence alignment with different techniques
4n various sentence alignment algorithms which have been proposed, there are three widespread approaches based respectively on a comparison of sentence length, lexical
correspondence and a combination of these first lwe methods
There are also some other techniques such as methods based on BLE score, support
vector machine, and hidden Markov model classifiers
2.2.2 Length-hased Methods
Length-based approaches are based on modeling the relationship between the lengths
of sentences that are mutual translations The length is measured by characters or words
of a sentence In these approaches, semantics of the text are not considered Statistical methods are used for this task instead of the content of texts In other words, these
19
Trang 20methods only consider the length of sentences in order to make the decision for
alignment
These methods are based on the fact that longer sentences in one language tend to be
iranslated into longer sentences in the other language, and that shorter sentences tend to
be translated into shorter sentences A probahilistic scare is assigned to each proposed
correspondence of sentences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference there are two random
variables 1, and Jy which are the lengths of the two sentences under consideration TL is
assumed that these random variables are mdependent, and identically distibuted wilh a
normal distribtion [8]
Given the two parallel texts ST (source text) and TT (target text), the goal of this task
is to find alignment A whieh is highest probability
maxA P(A,ST,TT)
Tn order to estimate this probability, aligned text is decomposed in a sequence of
aligned sentence beads where each bead is assumed to be independent of others
The algorithms of this type were first proposed in Brown, ct al, 1991 and Gale and Church, 1993 These approaches use scntence-length statistics in order to model the relationship between groups of sentences that arc translations of cach other Wu (Wu, 1994) also uses the length-based method by applying the algorithm proposed by Gale and Church, and he further uses lexical cnes from corpus-specific bilingual lexicon to improve
alignment
The methods proposed in this type of semence alignment algorithm are based solely
on the Iengths of sentences, so they require almost no prior knowledge Furthermore, these methods are highly accurate despite their simplicity They can also perform in a high speed When aligning lexls whose languages are similar or have a high length correlation such as English, French and German, these approaches are especially useful and work remarkably well They also perform fairly well if the input text is clean such as
in Canadian Hansards corpus [3] The Gale and Church algorithm is still widely used today, for instance to align Europarl (Koehn, 2005)
Nevertheless, these methods are not robusl since they only use the sentence Tenth information They will no longer be rehable if there is too much noise in the input bilingual texls As shown in (Chen, 1993) [5] the aveuracy of sentence-length based methods decreases drastically when aligning texts containing small deletions or free translation; they can easily misalign small passages because they ignore word identities
‘The algorithm of Brown et al requires corpus-dependent anchor points while the method
20
Trang 21proposed by Gale and Church depends on prior alignment of paragraphs to constrain the search, When aligning texts where the length correlation breaks down, such as the Chinese-English language pair, the performance of longth-basud algorithms dectines quickly
2.2.3 Word Correspondences Methods
The second approach, one that tries to overcome the disadvantages of length-based approaches, is the word-based method that is bascd on lexical information from translation lexicons, and/or through the recognition of cognates These methods take into account the lexical information about texts Most algorithms match content in one text
with their corresponderices in the olfier text, and use these matches as anchor poms in the
task sentence alignment Words which are Wanslations of each other may have sunilar
distribution in the source language and target language texts Meanwhile, some methods uuse cognates (words in language pairs that resemble each other phonetically) rather than
the content of word pairs to determine beads of sentences
This type of sentence alignment methods may be illustrated in some ouistanding approaches such as Kay and Réscheisen, 1993 [11], Chen, 1993 (5], Melamed, 1996 [14], and Ma, 2006 [13] Kay's wark has not proved efficient enough to be suitable for large corpora while Chen constructs a word-to-word translation madel during alignment to assess the probability of an alignment Word correspondence was further developed in IBM Model-I (Brown et al, 1993) for statistical machine translation Meanwhile, word correspondence in another way (geometric comespondence) for sentence alignment is proposed by Melamed, 1996,
These algorithms have higher accuracy in comparison with length-based methods Because they use the lexical information from source and translation lexicons rather than only sentence length ta determine the translation relationship between sentences in the source text and the target text, these algorithms usually are more robust than the length- based algorithms
Nevertheless, algorithms based on a lexicon are slower than those based on length
sentence because they requive considerably more expensive compulation In addition to
this, they usually depend on cognates or a bilingual lexicon The method of Chen requires
an initiat bilingual lexicon; the proposal of Melamed, meamwhile, depends on finding cognates in the two languages to suggest word correspondences
2.2.4 Hybrid Methods
Sentence fength and lexical information are also combined in order that different approaches can complement on each other and achieve more efficient algorithms
21
Trang 22‘These approaches are proposed in Moore, 2002, Varga et al, 2005, and Braune and Fraser, 2010 Bolh approaches have Iwo pass
st which « lunglti-based method is used
for a first aligrment and this subscquently serves as training data for a translation model, which is then used in a complex similarity score Moore, 2002 proposes a two-phase
method that combines sentence length (word count) in the first pass and word
correspondences (IRM Model-1) in the second one Varga et al (2005) also use the
hybrid technique in sentence alignment by combining sentence length with ward
comespondences (using a dictionary-based translation model in which the dictionary can
be manually expanded) Braune and Fraser, 2010 also propose an algorithm similar to Moore except that this approach has the technique to build 1-temany and many-to-I alignments rather than focus only on 1-to-] alignment as Moore’s method
The hybrid approaches achieve a relatively high performance and overcome limits of the first two methods along with combining their advantages The approach of Moore,
2002 obtains a high precision (fraction of retrieved documents that are in fact relevant) and computational efficiency Meanwhile, the algorithm proposed by Varga et al 2005 which has the same idea as Moore, 2002 gains a very high recall rate (fraction of relevant documents that arc retrieved by the algorithm)
Nonetheless, there are still weaknesses which should be handled in order to obtain a more efficient sentence alignment algorithm In Moore’s method, the recall rate is rather low, and this fact is especially problematic when aligning parallel corpara with much noise or sparse data ‘he approach of Varga et al., 2005, meanwhile, gets a very high recall value, however, it still has a quite low precision rate
2,3 Some Important Problems
2.3.2 Linguistic Distances
Another parameter which can also affect the performance of sentence alignment algorithms is the linguistic distanco betwoon source language and targct language Linguistic distance means the extent to which languages differ from cach other For
32
Trang 23example, English is linguistically “closer” to Western Huropean languages (such as French and German) than it is to ast Asian languages (such as Korean and Japanese) Thore are some measures lo assess thie linguistic distance such as the number of cognate
words, syntactic features TL is important 1o Tecognize thal some algorithms may nal
perform so well if they rely on the closeness between languages while these languages are distant An obvious example for this is that a method is likely to work better for English- French or English-German than for English-Hindi if is based on cognates because of fewer cognates in English-Ilindi Ilindi belongs to the Indo-Aryan branch whereas English and German belongs to the Indo-Germanic one
2.3.3 Searching
Dynamic programming is the techroque that mos) sentence alignment tools use in
searching the best path of sentence pairs through a parallel text This also means that the
texts are ordered monotonically and none of these algorithms is able to extract sentence
pairs in crossing positions Nevertheless, most of these programs have advantages in using this technique in searching and none of them reports weaknesses about it it is
because the characteristic of translations is that almost all sentences have same order in
both source and target texts
In this aspect, algorithms may be confronted with problems of the search space Thus, pruning strategies to restrict the search space is also an issue that algorithms have to resolve
Trang 24‘Yo perform searching, for the best alignment, Brown et al use dynamic programming This technique requires lime quadratic in the length of the text aligned, so it is not practical to align a large corpus as a single uni The computation of searching may be reduced dranatically if the bilngual corpus is subdivided into smaller chunks This subdivision is performed by using anchors in this algorithm An anchor is a piece of text likely to be present at the same location in both of the parallel corpora of a bilingual corpus Dynamic programming first is used to align anchors and then this technique is applied again to align the text between anchors
The alignment, compulation of this algorithnr is fast since il makes no use of the Texteal delails of the sentence Therclore, il is practical to apply this method to very large collections of text, especially lo high correlation languages pairs
2.4.2 Vanilla: Gale and Church, 1993
This algorithm perfonns sentence alignment based on a statistical mode! of sentence Iengihs measured by charavlors Tl uses the fact hai longer sentences in one language lend
to be translated into longer sentences iti another language
This algorithm is similar to the proposal of Brown et al except that the former is based on the number of words whereas the latter is based on the number of characters in sentences In addition, the algorithm of Brown et al aligns a subset of the corpus for further research instead of focusing on entire articles ‘the work of Gale and Church
(1991) supporls this promise of wider applicabilily
This sentence alignment program has two steps First paragraphs are aligned, and then sentences within a paragraph are aligned This algorithm reports that paragraph lengths are highly correlated Figure 2.1 illustrates this correlation of the languages pair: English and German,
Trang 25English paragraph length
Figure 2.1 Paragraph length (Gale and Church, 1993),
‘A probabilistic score is assigned to each proposed correspondence of sentences, based
on the sealed dillerence of longibs of the two sentences and the variance of this diffrence This score is used ina dynamic progranuning framework to find the maximum likelihood aligrancnt of sontenecs, The use of dynamic programming allows the system to consider all possible alignments and find the minimum cost alignment effectively
A distance function d is defined in a general way to allow for insertions, deletions, substitution, ele The function takes four arguments: 344 ¥4, X2, ¥2
Tet d(x1,¥1; 9,0) be the cost of substitution x, with y;,
(x1, 0; 0,0) be the cost of deleting x,
d (0,31; 0,0) be the cost of insertion of y,,
d (£1, y1} 2,0) be the cost of contracting x, and x, to'y;,
d(24,¥1; 0,92) be the cosh of expanding x, lo yy, and yo, and
(41,91; Xo,¥2) be the cost of merging x, and x, and matching with Vị and y>
The Dynamic Programming Algoridin is summarized in the following recursion
equation.
Trang 26Let 5;,1 = 1 1, be the sentences of one language, and t),/ = 1 /, be the translations
of those sentences in the other language
Let d be the distance fiction, and let D(i,j) be the minimum distance between senteness sy) 5; and (heir translations ty, ,4), under the maximum-likeliood alignment D(i,/) is computed by minimizing over six cases (substitution, deletion, insertion, contraction, expansion, and merger) These, in effect, impose a set of slope constraints D(,/} is defined with the initial condition D(i, /) = 0
This algorithm has some main characteristics as follows:
«Firstly, this is a simple algorithm The number of characters in the sentences is counted simply, and the dynamic programming model is used to find the correct pairs of alignment, Many later rescarchers have integrated this method
lo their methods because of this simplicity
* Secondly, this algorithm can be used between any pairs of languages because it does not use any Jexical information
«Thirdly, it has a low time cost, one of the most important criteria to apply this method to a very large bilingual corpus
© Finally, this is also quite an accurate algontun especially when aligning on data of language pairs with high correlation like English-French or English- German
As the report of Gale and Church, in comparison with the length-based method based
on word count like the proposal of Brown et al., it is better to use characters rather than
words in counting sentence length The performance of this approach is beltsr since there
ia less variability im the differcnecs of scutence lengths so measured, which using words
as unils tereases the error rale by half! This method performs well at least on related
languages The accuracy of this method also depends on the type of alignment It gets best results on 1-to-1 alignments, but it has a high error on more difficult alignments This algorithm is still widely used today to align some corpora like the EuroparÌ corpus
(Koehn, 2005) and the JRC-Acquis (Stein-berger et al 2006)
36
Trang 272.4.3 Wu, 1994
Wu applies Gale and Church’s method to the language pair English-Chinese In order
to improve the accuravy, he ulilizes lexical information from translaGon lexicons, and/or
through the idenlification of cognates Lexical cuss used in this method are in the form of
a sinall corpus-specifie lexicon
This method is important in two respects:
«Firstly, this method has indicated that length-based methods give satisfactory resulls even belween unrelated Iauguages (languages from unrelated families such as English and Chinese), a surprising result
* Secondly, lexival cues used in this method increase accuracy of alignment ‘Chis proves the cffcct of lexical information to the accuracy when adding lo a lengll-based method,
1.5 Word-based Proposals
2.5.1 Kay and Roscheisen, 1993
This algorithm is based on word correspondences, which is an iterative relaxation approach The iterations are starled by the assumpliou Uhal the first and last sonlences off the texts align, and these arc the initial anchors before these below steps continue until
most sentences are aligned:
© Step 1: Form an envelope of possible aligmnents
«© Step 2 Choose word pairs that tend to co-occur in these potential partial alignments
«Step 3: Find pairs of source and target sentences which contain many possible lexical correspondences
A sel of partial aligranents which will be part of the final result is duced by using the
most rehable of these pairs
However, a weakness of this method is that it is not efficient enough to apply to large corpora
Trang 28Dynamic programming with threshold is the search strategy of this algorithm ‘The
search 1s linear im the length of the corpus beeause of the use of threshold As a resull, the corpus needs nol be subdivided imo smaller chunks Tt also deals well wilh deletions This
also makes the search strategy robust despite large deletions whereas identifying the
beginning and the end of deletions is performed confidently thanks to using lexical
information With an intelligent threshold, great benefits may be obtained due to the fact that most alignments are one-to-one The computation of this algorithm is reduced to a
linear one because of considering only a subset of all possible alignments
This method gives beller accuracy than the longih-based one, but is “lens of times
slower (han the Brown [3] and Gale [8] algorithms” [5] Furthermore, il is also language
independent Tt cari handle large deletions in texl
‘This algorithm requires that for each language pair 100 sentences need to be aligned
by hand to bootstrap the translation model, therefore, it depends on a minimum of human intervention This method takes a great computational cost because of using the lexical information However, alignment is a one-time cost, and it may be very useful after it is
contributed whereas Ihe computing power is also available These causes dicate thal the computational cost sometimes may be acceplable
2.5.3 Melamed, 1996
‘This method is based on word correspondences [14] A bitext map of words is used to
mark the poinls of correspondences belween these words in a Lwo-dimensional graph Afler marking all possible points, the ruc correspondence poinls in graph are found by
some rules, sentences localion information and boundary of sentences
Melamed uses a term of bitext which comprises two versions of a text like a text in
two different languages A bitext is created in each time when translators translate a text Figure 2.3 illustrates # reclangular bilexl space thal cach bilext delines The lengths of the two components toxls (in charactors) are presented by the widih and the height of the rectangle.
Trang 29x = character position in text 1
Figure 2.3 A bitext space in Melamed’s method (Melamed, 1996)
:y than Gale and Church’s one This method
This algorithm has slightly better accu
can give almost perfect resulis in alignment if a good bitext map can be formed This is also the power of this method If it is used in popular languages, it may be the best and may be possible to acquire a good bitexi map
However, this method requires a good bitext map to have satisfactory accuracy
2.5.4 Champollion: Ma, 2006
This algorithm, which is designed for robusl aligmen of polential noisy parallel text,
is a lexicon-based sentence aligner [13] Tt was first developed for aligning Chinese- English parallel text before parted to other language pairs such as Hindi-English or Arabic-English In this method, only if lexical matches are present, a match is considered
as possible In considering a stronger indication that two segments are a match, higher weights are assigned to less frequent words In order to weed out bogus matches, this algorithm also uses sentence length information
This method focuses on dealing wilh noisy data, Tt overcomes existing methods which work very well on clean dala bul decline quickly in their performance when data become noisy
Thore arc also some different characteristics between this method and other sentence aligners:
Trang 30© Noisy data is resources that contain a larger percentage of alignments in which
they will not be I-to-1 ones They have a significant amount of the number of
deletions and insertions This method assumes with such an input data It is
unreliable when using sentence length information to deal with noisy data This
information is only used as a mimor role when the method is based on lexical evidence
« Translated words are treated equally in most sentence alignment algorithms In
other words, when the method decides sentence correspondences, an equal weight
is assigned to translated words pairs ‘This method assigns weights to translated
words, which makes it different from other lexicon-based algorithms
e There are two steps to apply translation lexicons in sentence aligners: entries from
a translation lexicon are used to identify translated words before sentence
correspondences are identified by using statistics of translated words
In this method, assigning grcater weighls to less (requent translaled words helps to
increase the robustness of the alignment especially when dealing with noisy data Tt gains
high precision and recall rates, and it may be easy to use for new language pairs
However, it requires an extemally supplied bilingual lexicon
2.6 Hybrid Proposals
2.6.1 Microsoft's Bilingual Sentence Aligner: Moore, 2002
This algorithm combines length-based and ward-based approaches to achieve high
accuracy at a modest computational cost [15] A problem in using lexical information is
that it limits the use of algorithm only between a pair of languages Moore resolves this
problem by using a method similar to IBM translation model to extract a bilingual corpus with the texts at hand This method has two passes Sentence-length based statistics are
first used for extracting the training data for the IBM Model-] translation tables before this model based on sentence-length are combined with the acquired lexical statistics in
order to extract I-to-l correspondences with high accuracy A forward-backward computation is used as the search heurislic in which the forward pass is a pruned dynamic programming procedure
This is a highly accurate and language-independent algorithm, so it is a very promising method It constantly achieves the high precision Furthermore, it is also fast in comparison with methods which use solely Icxical information Requiring no knowledge
of the languages or the corpus is also another advantage of this method While lexical methods are generally more than sentence-length ones, most of them require additional linguistic resources or knowledge Moore has tried lo overcome this issue by using TBM
30