This approach usually uses lexicon translations getting from a bilingual dictionary to measure the similarity of content of the two texts.. The most important contribution of our work is
Trang 1from the Web
[LÊ (Warr
by
Le Quang Hung
Faculty of Information Technology
University of Engineering and Technology Vietnam National University, Hanoi
Supervised by
Dr Le Anh Cuong
A thesis submitted in fulfillment of the requirements for
the degree of Master of Information Technology
December, 2010
Trang 21.1 Parallel corpus and its role 1
3Š StmMNHY ; ¿rác ¿Z3 677350 S65 WEG BUSS DRS SBS 15
3.1.2 Content-based filtering module 18
Ỷ 1 The method based on cognation 20
2 The method based on identifying translation seg-
3.1.3 Structure analy 28
Trang 4List of Figures
1.1 An example of English-Vietnamese parallel texts -.- 2
The algorithm of translation pairs finder [3]
Architecture of the PTI system [4|
An example of the two links in the text
3.1 Architecture of the Parallel Text Mining system 6.0.00 6 17
3.4 Description of the process content-based filtering module 20 3.5 Au example of two corresponding texts of English and Vietnamese 22 3.6 The algorithm measures similarity of cognates between a texts pair
TOE EPO bes ae ee eee SỈ Hew! Boe eae 25
3.9 Identifying translation paragraphs -.- 37
3.10 A sample code written in Java to perform translation from English
into Vietnamese via Google AJAX APIL 3r 3.11 Web documents and the source HTML code for two parallel tran:
3.12 An example of the publication date fesbine is rieeteavted from a
HEM, PaO 2: occ oc eee ees Hh WR EW I RGNE 66 Bướn ® aoe 30
3.13 Classification model 00 eee 31
4.1 Figure for precision and recall measures 2 oo ee 32 4.2 The format of training and testing data 34
4.3 Performance of identifying translation segments method 38
44 Comparison of the methods ee 39
Trang 51.1 Enroparl parallel corpus; 10 aligned language pairs all of which
4 40 Overall results of each method (P-Pr
URLs from three sites: BBC, VOA News and VietnamPlus 33
No pages downloaded and No candidate pairs
Structure-based method
Content-based method
Method based on cognation
Combining structural features and cognate information,
Identifying translation at document level
Trang 6Chapter 1
Introduction
In this chapter, we first introduce about parallel corpus and its role in NLP ap-
and contributions are then
presented Finally, the thesis’ structure is shortly des
Parallel text
Different definitions of the term “parallel text” (also known as bitext) can be
found in the literature As common understanding, a parallel text is a text in
one language together with its translation in another language Dan Tufis [5]
gives a definition: “parallel text is an association between two texts in different
languages that represent translations of each other”, Figure 1.1 shows an example
of English-Vietnamese parallel texts
Parallel corpus
A parallel corpus is a collection of parallel texts, According to [0], the simplest case is where two languages only are involved, one of the corpora is an exact translation of the other (e.g., COMPARA corpus [7]) However,
corpora exist in several languages For instance, Europar! parallel corpus [8] which
me parallel
includes versions as report in Table 1.1, In addition, the
Trang 7
Ti] in the book — an early example of the | [1] Cuốn sich lavi dy đầu tiên của thể loại truyện slave narrative genre Equiano gives an | kẻ về người nô lệ trong đỏ Equiano kệ về mánh account of his native land and the horrors | đất quê hương mỉnh, những nỗi hãi hằng và sự and crueHi of his captivity and | tam dc ma ong phải chịu đựng rong thời giam ông enslavement in the West Indies bị cảm giữ và lam né 1g 6 West Indies
[2] Equiano an Ibo fom Niger (West|[2] Equiano mot nguoi Ibo din te Niger (lay Africa), was the first black in America to | Phi), là nguời Da den dau tiên ở Mỹ viết một tiêu write an autobiogmphy The Interesting | sử tự thuật mang tên The Interesting Narrative of
Narrative of the Life of Olaudah Equiano | the Life of Olaudah Equiano, or Gustavus V'assa,
ot Gustavus Vassa, the African (1789), | the Affican (Ty thuat ha vj vé cuge doi cla
Olaudah Equiano hay Gustavus Vassa, newoi chau Phi - 1789)
[3] (1745-e 1797) Important black writers | [3] Olaudah Equiano (Gustavus Vasa) (Khoảng
like Olandah Equiano and Jupiter Hammon | 1745 - khoảng 1797) Hai tác giả Da đen có tim emerged during the colonial period quan trong va néi bật trong thời kỳ thuộc địa là
Olandah Equiano va Tupiter Hammon
[3] imitative of English Hterary fashions, | [3] Bat chước những trảo lưu vấn học Anh, người the southemers attained imaginative | miền Nam thể hiện trí tưởng tượng phong phủ heights in witty precise observations of | trong sv quan sảt sắc sảo và thông mình chỉnh distinctive New World conditions Olaudah | xác những điều kiện sống đặc thù của Tân thể
Equiano (Gustavus Vassa) lới
[5] In general the colonial South may Em: Nhin chung văn học thuộc địa miến Nam cỏ
fairly be linked with a light, w thể nói là cô mỗi liên hệ phân nảo với truyền informative, and realistic literary tradition | théng van hoc hién thực, gẵn gũi với cuộc sống,
mang tinh phỏ quát và chữa nhiều thông tin
EIGURE 1.1: An example of English-Vietnamese parallel texts,
corpus may have been translated from language Ly to language Ly and others the
other way around The direction of the translation may not even be known
The parallel corpora exist in several formats, They can be raw parallel texts or they can be aligned texts, The texts can be aligned in paragraph level, sentence
level or even in phrase level and word level The alignment of the texts is useful for
semantically equivalent components of the parallel texts as words, phrases, sen-
tences are useful for bilingual dictionary construction [14, 15] The parallel texts are also used for acquisition of lexical translation [16] or word sense disambiguation
[17] For most of the mentioned tasks, the parallel corpora are currently playing
a crucial role in NLP applications
Trang 8= 27 468.389
47,000,805
42,810,628
Portuguese-English Swedish-English 1570.411
processing For that reason, many studies [1-4, 18-22] are paying their attention
in mining parallel corpora from the Web Basically, we can classify these studies
into three groups: content-based (CB) [3, 4, 22], structure-based (SB) [1, 2, 18]
and hybrid (combination of the both methods) [19-21]
The CB approach uses the textual content of the parallel document pairs being
evaluated This approach usually uses lexicon translations getting from a bilingual
dictionary to measure the similarity of content of the two texts When the bili
gual dictionary is available, documents are translated word by word to the target
language The translated documents then are used to find the best matching par-
allel documents by applying similarity scores functions such as cosine, Jaccard, Dice, ete However, using bilingual dictionary may face difficulty because a word
usually has many its translations
Meanwhile, the SB approach relies on analysis HTML structure of pages This
approach uses the hypothesis that parallel web pages are presented in similar structures The similarity of the web pages are estimated based on the structural
HTML of them Note that this approach does not require lingnistical knowledge
Trang 9In addition, this approach is very effective in filtering a big number of unmatched
Web, the structure of pages is similar but the content of them is different For
that reason, HTML structure-based approach is not applicable in some eases
As we have introduced, the parallel corpus is the valuable resource for different
NLP tasks Unfortunately, the available parallel corpora are not only in relatively
small size, but also unbalanced even in the major languages [3] Some resources are available, such as for English-French, the data are usually restricted to gov-
ernment documents (e.g., the Hansard corpus) or newswire texts The others are
limited availability due to licensing restrictions as [23] According to [24], there are
now some reliable parallel corpora: Hansard Corpus!', JRC-Acquis Parallel Cor-
There are a few studies of mining parallel
corpora from the Web, one of them is pres
big motivation for many studies on this work
The objective of this research is extracting parallel texts from bilingual web sites
translation segments Then, we combine content-based features with structural
features under a framework of machine learning
Thttp://www.isi.edu/natural-language/download /hansard/
2http:/ /langtech.jrc.it /JRC-Acquis html
Shttp://www.statmt.org/europarl/
“http://www.linguateca.pt/COMPARA/
Trang 10raging by [20] we formulate this problem as classification problem to
utilize as much as possible the knowledge from structural information and the
similarity of content The most important contribution of our work is that we
proposed two new methods of designing content-based features and combined with
structural-based features to extract parallel texts from bilingual web sit
e The first method based on cognation It is worth to emphasize that different
from previous studies [2, 20], we use cognate information replace of word by
word translation From our observation, when translating a text from one
language to another, some special parts will be kept or changed in a little
These parts are usually abbreviation, proper noun, and number We also use other content-based featur ch as the length of tokens, the length of
paragraphs, which also do not require any linguistically analysis It is worth
to note that by this approach we do not need any dictionary thus we think
it can be apply for other language pairs
e The second method based on identifying translation segments use to match
translation paragraphs That will help us to extract proper translation units
in bilingual web pages Previous studies usually use lexicon translations
getting from a bilingual dictionary to measure the similarity of content of the two te
such as in [4, 20] This approach may face difficulty because
a word usually has many its translations Differently, we use the Google
translator because by using it we can utilize the advantages of a statistical
machine translation It helps to disambiguating lexical ambiguity, translat- ing phrases, and reordering
Given below is a brief outline of the topics discussed in next sections of this thesis:
Chapter 2 - Related works
The studies that have close relations with our work are introduced in this chapter.
Trang 11Chapter 3 - The proposed approach
We show our proposed model, including the general architecture of the model, how
structural features and content-based fealures are designed and estimated
Chapter 4 - Experiment
This clupler evaluates the goodness and effectiveness of our proposcd invthod for extracting parallel texta from the Web The performance of aur proposed and
baseline are presented in here
Chapter 5 - Conclusion and Future works
Final conclusions about our work us u whole and Une evaluation of the results in particular are presented, followed by suggestions of possible future work that could
be done
Finally, references introduce researches that are closely related to our work.
Trang 12Chapter 2
Related works
In this chapter we outline the general framework in building parallel corpus Then,
we review the studies that have close relations with our work
FIGURE 2.1: General architecture in building parallel corpus
Trang 13In general, there are two approaches in building the parallel corpus (illustrated in Figure 2.1), The first one is automatically collect bilingual documents from the
to extract parallel texts (the detail of this task is presented in the next sections)
The other one based on the monolingual corpora [25] As scen from the diagram, starting with two large monolingual corpora (a non-parallel corpus) divided into documents, this approach is composed of three steps: (1) selecting pairs of sim-
ilar documents, (2) from each such pair, generate all possible sentence pairs and
pass them through a simple word-overlap-based filter, thus obtaining candidate
sentence pairs, and (3) the candidates are presented to a maximum entropy (ME)
classifier that decides whether the sentences in each pair are mutual translations
The Original STRAND is an architecture for
nition, acquiring natural data Its goal is to identify pairs of web pages that are
uctural translation recog-
mutual translations In order to do this, it exploits an observation about the way
that web page authors disseminate information in multiple languages: When pre-
senting the same content in two different languages, authors exhibit a very strong
tendency to use the same document structure The STRAND therefore locates
pages that might be translations of each other, via a number of different strate-
gies, and filters out page pairs whose page structures diverge by too much The
STRAND architecture has three basic steps (illustrated in Figure 2.2):
Trang 14Chapter 2 Related works 9
Candidate Pair
Generation © [7] Evaluation, «© [>] language dependent)
(stracturdl)
Translation Đan
FIGURE 2.2: The STRAND architecture [1]
© Location of pages that might have parallel translations,
e Generation of candidate pairs that might be translations, and
« Structural filtering out of nontranslation candidate pairs
The heart of STRAND is a structural filtering process that relies on analysis of the pages’ underlying HTML to determine a set of pair-specific structural values, and then uses those values to decide whether the pages are translations of one
ize the HTML structure and content of the documents
another The first step in this process is to lines
ignore the actual lingni
Both documents in the candidate pair are run through a markup analyzer that
acts as a transducer, producing a linear sequence containing three kinds of token:
[START:clement label] ¢.g., [START:H3]
[END:elementJabel] eø., [END:H3]
Trang 15The socond step is to align the linearized sequences using a standard dynamic programming technique For example, consider two documents that begin as Fig-
Ficure 2.3: An example of aligning two documents
Using this alignment, the authors compute four values from the aligned strue- tures which indicate the amount of non-shared material, the number of aligned non-markup text chunks of unequal length, the correlation of lengths of the aligned non-markup chunks, and the significance level of the correlation Machine learn- ing, namely decision trees, are then used for filtering, based on these four values
PTMiner system [2] works on extracting bilingual English-Chinese docu- ments, This
tem uses a search engine to locate for host containing the par-
allel web pages In order to generate candidate pairs, the PTMiner uses a URL- matching process (¢.g., Chinese translation of a URL as “http://www.XXXX.com/ -/eng/ ehtmml" might be “http://www.XXXX.com/ /chi/ c.html”) and other
Trang 16Chapter 2 Related works "
FiGuRE 2.4: The workflow of the PTMiner s tem [2]
‘The PTMiner implements the following steps (illustrated in Figure 2.4):
1 Search for candidate siti
Using existing Web search engines, search for the candidate sites that may contain parallel pages
Ss Filename fetching - For each candidate site, fetch the URLs of Web pages
that are indexed by the search engines
3 Host crawling - Starting from the URLs collected in the previous step, search
through each candidate site separately for more URLs
4 Pair scan - From the obtained URLs of each site,
can for possible parallel pairs
5 Download and verifying - Download the parallel pages, determine file size,
language, and character set of each page, and filter out non-parallel pairs
In experiment, several hundred selected pairs were evaluated manually Their
re
ilts were quite promising, from a corpus of 250 MB of English-Chinese text,
statistical evaluation showed that of the pairs identified, 90% were correct.
Trang 172.3 Content-based methods
The approach discussed thus far relies heavily on document structure However,
as Ma and Liberman [3] point out not all translators create translated pages that look like the original page Moreover, structure-based matching is applicable only
in corpora that include markup, and there are certainly multilingual collections on
the Web and elsewhere that contain parallel text without structural tags All these considerations motivate an approach to matching translations that pays attention
to similarity of content, whether or not similarities of structure exist In this
Bilingual Internet Text Search (BITS) [3],
Parallel Text Identification (PTI) [4], and Dang’s system [22]
measured as an algorithm in Figure 2.5 Then finding the B which is mo milar
to A, if the similarity between A and B is greater than a given threshold t, then A and B are declared a translation pair The similarity between A and B is defined
as
Number of translation token pairs
simyAB) = Fr onler of tokens in teal A
In experiment, Ma and Liberman use an English-German bilingual lexicon of
117,793 entries The authors report 99.1% precision and 97.1% recall on a hand-
picked set of 600 documents (half in each language) containing 240 translation pairs (as judged by humans)
The PTI system (illustrated in Figure 2.6) crawls the Web to fetch parallel
multilingual Web documents using a Web spider To determine the parallelism
between potential bilingual document pairs, two different modules are developed
A filename comparison module is used to check filename resemblance A content,
analysis module is used to measure the degree of semantic similarity It incor-
porates a novel content-based similarity scoring method for measuring the degree
of parallelism for every potential document pair based on their semantic content
Trang 18Chapter 2 Related works 13
£
= endfor
if max_sim > t then
atput (A, 5) endif
endfor
Figure 2.5: The algorithm of translation pairs finder [3]
using a bilingual wordlist ‘The results showed that the PTI system achieves a precision rate of 93% and a recall rate of 96% (180 instances is
total of 193 pairs extracted)
a See | Plmame | Documents | conte | Documents
er =| Compancon >} Anaivas > Discaré
In our knowledge, there are rarely studies on this field related to Vietnamese,
[22] built an English-Vietnamese parallel corpus based on content-based matching,
Firstly, candidate web page pairs are found by using the features of sentence length
and date Then, they measure similarity of content using a bilingual English-
Trang 19for parallel pages that are good translations of each other and they are required being written in the same style, Moreover, using word by word translation will cause much ambiguity Therefore, this approach is difficult to extend when the data increases as well as when applying for bilingual web sites
with various styles
Another instance of this approach is that instead of using bilingual dictionary,
a simple word-based statistical machine translation is used to translate texts in one language to the other [26] uses this method to build an English-Chinese
parallel corpus from a huge text collection of Xinhua Web bilingual news corpora
collected by LDC! By adding newly built parallel corpus to their existing corpus,
they reported an increase in the translation quality of their word-based statistical machine translation in terms of word alignment A bootstrapping approach [27]
can also be applied to incrementally increase number of both parallel sentences
and bilingual lexical vocabulary:
The last version of STRAND [20] is another well-known web parallel text mining system Its goal is to identify pairs of web pages that are mutual translations, They used the AltaVista search engine to search for multilingual web sites and generated candidate pairs based on manually created substitution rules The heart
of STRAND is a structural filtering process that relies on analysis of the pages un- derlying HTML to determine a set of pair-specific structural values, and then uses those values to filter the candidate pairs This system also proposes a new method
that combines content-based and structure matching by using a cross-language
similarity score as an additional parameter of the structure-based method A
translation lexicon is used to link tokens between pairs of parallel document The link be a pair (x, y) in which x is a word in language Ly and y is a word in Ly
An example of two texts with links is illustrated in Figure 2.7 Using the results
of MCBM? they defined a ¢sim translational similarity measure as
Number of two-word links in best matching
tsim Number of links in best matching (2.2)
Linguistic Data Consortium, at http://www.lde.upenn.edu/
2Problem of maximum cardinality bipartite matching
Trang 20Chapter 2 Related works 15
Vie They plow the paddy fields and pull a curt
| i Nutt Z2 NULL
V32:H@ cây ruộng lúa và kếo xe
FIGURE 2.7: An example of the two links in the text
In experiment, approximately 400 pairs were evaluated by human annotators The STRAND produced fewer than 3500 English-Chinese pairs with a precision of 98% and a recall of 61%
In others systems, [19] proposed a method that combining length-base and
a bilingual page [28] proposed a similar approach The author presents a sys- tem that automatically collects bilingual texts from the Internet The criteria for parallel text detection is based on the size, HTML structures and word by word
translation model
In this chapter, we presented related works for mining parallel corpus from the Web The content-based approach usually uses a bilingual dictionary to mateh pairs of word-word in two languages Meanwhile, structure-based approach relies
on analysis HTML structure of pages In the real implementation, both approaches
are usually employed to get good performance Generally, the structure-based
Trang 21The proposed approach
In this chapter, we introduce our proposed model, including the general architee- ture of the model, how structural features and content-based features are designed
and estimated We also represent the classification modeling in our sy
In this work, our proposed approach whose it is combination content-based features and structure-based features of the HTML pages to extract parallel texts from the Web by using machine learning [20] The machine learning algorithm used hei
Support Vector Machine (SVM) Figure 3.1 illustrates the general architecture of
s shown in the model it includes the following tasks:
y, we use a crawler on the 5)
scified domains to extract bilingual English-
Vietnamese pages which are called raw data
« Secondly, from the raw data, we will create candidates of parallel web pages
by using some threshold of extracted features (content-based features and the feature about date)
© Thirdly, we mannally label these candidates and then we have a training,
data It means that we will obtain some pairs of parallel web pages which
ich
are assigned with label 1, and some other pairs of parallel web pages w!
are assigned with label 0 (the detail of this task is presented in the experiment,
section)
16
Trang 22Chapter 3 The proposed approach 17
Ficur Architecture of the Parallel Text Mining system
e Fourthly, we will extract structural features and content-based features so
that each web page pair can be represented as a vector of these features
This representation is required to fit a classification model
a SVM tool to train a classification system on this training
to as Web crawlers or spiders, In general terms, the working of a Web crawler is
as Figure 3.2 A typical Web crawler, starting from a set of seed pages, locates new pages by parsing the downloaded pages and extracting the hyperlinks (in short links) within, Extracted links are stored in a FIFO fetch quene for further
Trang 23‘World Wide Web
Web pages
‘Text and metadata
FiGuRE 3.2: Architecture of a standard Web crawler
retrieval, Crawling continues until the fetch queue gets empty or a satisfactory
number of pages are downloaded
In our work, bilingual English-Vietnamese web pages are collected by crawling
the Web using a Web spider as in [4] To execute this process, our system uses the
Teleport-Pro! to retrieve web pages from remote web sites Teleport-Pro is a tool
designed to download the documents on the Web via HTTP and FTP protocols
and store the extracted data in disk [3] Note that, we select the URLs on the
specified hosts from the three news sites: BBC, VietnamPlus, and VOA News
For example, the URL on the BBC site for English is ” http: //www.bbe
*http://www-bbe.co.uk/vietnamese/” for Vietnamese Then, we use Teleport-Pro
'o,uk” and
to download the HTML pages for obtaining the candidate web pages
3.1.2 Content-based filtering module
The HTML pages are converted to plain text after they are retrieved from remote
web sites Note that, the original web pages usually contain less user interface
components such as JavaScript, Flash, ete So, we use a simple script to clean
them and extract only text when content-based matching
"http: //www.tenmax.com/teleport /pro/home.htm
Trang 24Chapter 3 The proposed approach 19
“Ác 22Ef0gĐ4a he đà tế nhIA De tông VS0 e gà
‘Diets Eathg ite bing 209 hata tea Ti thiệt Tà ent Straus any rt 2 ctrl gái
‘Tonge he meg 10 ong Mum in ce cea ego sin mbt tle b mee {Pica choy 34) 0 ib 320900 tn ft Bode Ton seg one
‘Bio cubs eo share e900 0} ie oma ens cuseng Zang dang aD sities vecinoung aan la!
moe [Benne nn
¬—
lên G3 Eendteotreed Print cy Sere Onions
Steel prices on the rise
‘Steel poces nas seon a sharp increase öf VHIO300 066
‘er tonne in te past few cays
“This spe ts ata to auton ctthe US cor aed Ihe inceing cuirand fee sleet lh Movember and
‘cernbar, sal Ps Inst for Mast and Pree uae oe intr eta ana Tease
nme nest nam ct Osteber, stat cantumetion eaenee orth 383.009 formes, Coan By 200,000 formes cơprkd lo Setar
{95 ew orate
FIGURE 3.3: An example of a candidate pair
As common understanding using content-based features we want to determine
whether two pages are mutual translation, However, as [3] pointed out, not all
translators create translated pages that look like the original page Moreover, structure-based matching is applicable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain
parallel text without structural tags [20] Many studies have used this approach
to build a parallel corpus from the Web such as [4, 22] They use a bilingual dictionary to measure the similarity of the contents of two texts However, this
Trang 25method can cause much ambiguity because a word usually has many its transla- tions For English-Vietnamese, one word in English can correspond to multiple
words in Vietnamese To overcome this limitation, we propose two new methods
ss content-based filtering module,
3.1.2.1 The method based on cognation
This method use cognate information, which provides a cheap, and reasonable re-
source, This proposal is based on an observation that a document usually contains
some cognates and if two documents are mutual translations then the cognates are
usually kept the same in both of them The cognates are words that are spelled similarly in two languages, or words that simply are not translated (e.g., abbrevi- ations) For example, if the word “ITO” appears in an English text, it probably also appears as “WTO” in a Vietnamese text Note that, [30] also use cognates but for sentence alignment We divide a token which is considered as a cognate
into three types as follows:
1 The abbreviations (¢.g., “EU”,
2 The proper nouns in English (e.g ”, “Paris”), and
Trang 26Chapter 3 The proposed approach 21
estimated by the rate between the number of corresponding,
cognates between the two texts and the number of tokens in one text (e.g., for
English text) Given a pair of texts (Etext, Viet) where Btext stands for English and Vteat stands for Vietnamese, we respectively obtained the token set of cog- nates T; and Tỷ from Etect and Vtezt For a robust matching between cognates,
we make some modifications of the original token:
¢ A mmber, which is written as sequence of letters in the English alphabet,
is converted into a real number According to our observations, units of the numbers in English are often retained when translated into Vietnamese, So,
we do not consider in case the units are different (c.g., inch vs cm, pound vs
kg, USD vs VND, ete)
We use a list which contains the corresponding names between English and
Vietnamese They include names of countries, continents, date, ete How-
ever, the names of countries in English can be translated into Vietnamese in different ways Therefore, we only consider these names in English, which
Vietnamese names corresponding have been published on Wikipedia site”
Figure 3.5 is an example of two corresponding texts of English and Vietnamese
we obtained T;
of cognates between Etext and Vtext by using the algorithm presented in Figure
3.6
If Sitticognates(Etext, Vtewt) greater than a threshold then the pair (Etext.Vtext)
is a candidate The siméoynates(Etext, Viext) is calculated as in formula (3.1)